Mining the social web ch3

Post on 08-Jul-2015

249 views 2 download

Transcript of Mining the social web ch3

Mining The Social Web

NAVER 아키텍트를 꿈꾸는 사람들

발표 : 김연기

Mail Boxes

누가 메일을 보내나?

답장을 받는 시간대가 있나?

누가 자주 메일을 보내나?

요즘 핫이슈는??

Mbox From santa@northpole.example.org Fri Dec 25 00:06:42 2009 Message-ID: <16159836.1075855377439@mail.northpole.example.org> References: <88364590.8837464573838@mail.northpole.example.org> In-Reply-To: <194756537.0293874783209@mail.northpole.example.org> Date: Fri, 25 Dec 2001 00:06:42 -0000 (GMT) From: St. Nick <santa@northpole.example.org> To: rudolph@northpole.example.org Subject: RE: FWD: Tonight Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sounds good. See you at the usual location. Thanks, -S -----Original Message----- From: Rudolph Sent: Friday, December 25, 2009 12:04 AM To: Claus, Santa Subject: FWD: Tonight Santa - Running a bit late. Will come grab you shortly. Standby. Rudy Begin forwarded message: > Last batch of toys was just loaded onto sleigh.

> > Please proceed per the norm. > > Regards, > Buddy > > -- > Buddy the Elf > Chief Elf > Workshop Operations > North Pole > buddy.the.elf@northpole.example.org From buddy.the.elf@northpole.example.org Fri Dec 25 00:03:34 2009 Message-ID: <88364590.8837464573838@mail.northpole.example.org> Date: Fri, 25 Dec 2001 00:03:34 -0000 (GMT) From: Buddy <buddy.the.elf@northpole.example.org> To: workshop@northpole.example.org Subject: Tonight Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Last batch of toys was just loaded onto sleigh. Please proceed per the norm. Regards, Buddy -- Buddy the Elf Chief Elf Workshop Operations North Pole buddy.the.elf@northpole.example.org

Mbox From santa@northpole.example.org Fri Dec 25 00:06:42 2009 Message-ID: <16159836.1075855377439@mail.northpole.example.org> References: <88364590.8837464573838@mail.northpole.example.org> In-Reply-To: <194756537.0293874783209@mail.northpole.example.org> Date: Fri, 25 Dec 2001 00:06:42 -0000 (GMT) From: St. Nick <santa@northpole.example.org> To: rudolph@northpole.example.org Subject: RE: FWD: Tonight Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sounds good. See you at the usual location. Thanks, -S -----Original Message----- From: Rudolph Sent: Friday, December 25, 2009 12:04 AM To: Claus, Santa Subject: FWD: Tonight Santa - Running a bit late. Will come grab you shortly. Standby. Rudy Begin forwarded message: > Last batch of toys was just loaded onto sleigh.

> > Please proceed per the norm. > > Regards, > Buddy > > -- > Buddy the Elf > Chief Elf > Workshop Operations > North Pole > buddy.the.elf@northpole.example.org From buddy.the.elf@northpole.example.org Fri Dec 25 00:03:34 2009 Message-ID: <88364590.8837464573838@mail.northpole.example.org> Date: Fri, 25 Dec 2001 00:03:34 -0000 (GMT) From: Buddy <buddy.the.elf@northpole.example.org> To: workshop@northpole.example.org Subject: Tonight Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Last batch of toys was just loaded onto sleigh. Please proceed per the norm. Regards, Buddy -- Buddy the Elf Chief Elf Workshop Operations North Pole buddy.the.elf@northpole.example.org

Mbox From santa@northpole.example.org Fri Dec 25 00:06:42 2009 Message-ID: <16159836.1075855377439@mail.northpole.example.org> References: <88364590.8837464573838@mail.northpole.example.org> In-Reply-To: <194756537.0293874783209@mail.northpole.example.org> Date: Fri, 25 Dec 2001 00:06:42 -0000 (GMT) From: St. Nick <santa@northpole.example.org> To: rudolph@northpole.example.org Subject: RE: FWD: Tonight Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sounds good. See you at the usual location. Thanks, -S -----Original Message----- From: Rudolph Sent: Friday, December 25, 2009 12:04 AM To: Claus, Santa Subject: FWD: Tonight Santa - Running a bit late. Will come grab you shortly. Standby. Rudy Begin forwarded message: > Last batch of toys was just loaded onto sleigh.

> > Please proceed per the norm. > > Regards, > Buddy > > -- > Buddy the Elf > Chief Elf > Workshop Operations > North Pole > buddy.the.elf@northpole.example.org From buddy.the.elf@northpole.example.org Fri Dec 25 00:03:34 2009 Message-ID: <88364590.8837464573838@mail.northpole.example.org> Date: Fri, 25 Dec 2001 00:03:34 -0000 (GMT) From: Buddy <buddy.the.elf@northpole.example.org> To: workshop@northpole.example.org Subject: Tonight Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Last batch of toys was just loaded onto sleigh. Please proceed per the norm. Regards, Buddy -- Buddy the Elf Chief Elf Workshop Operations North Pole buddy.the.elf@northpole.example.org

Mbox { "From": "St. Nick <santa@northpole.example.org>", "Content-Transfer-Encoding": "7bit", "To": [ "rudolph@northpole.example.org" ], "parts": [ { "content": "Sounds good. See you at the usual location.\n\nThanks,...", "contentType": "text/plain" } ], "References": "<88364590.8837464573838@mail.northpole.example.org>", "Mime-Version": "1.0", "In-Reply-To": "<194756537.0293874783209@mail.northpole.example.org>", "Date": "Fri, 25 Dec 2001 00:06:42 -0000 (GMT)", "Message-ID": "<16159836.1075855377439@mail.northpole.example.org>", "Content-Type": "text/plain; charset=us-ascii", "Subject": "RE: FWD: Tonight" }, { "From": "Buddy <buddy.the.elf@northpole.example.org>", "Content-Transfer-Encoding": "7bit", "To": [ "workshop@northpole.example.org" ], "parts": [ { "content": "Last batch of toys was just loaded onto sleigh. \n\n...", "contentType": "text/plain" } ], "Mime-Version": "1.0", "Date": "Fri, 25 Dec 2001 00:03:34 -0000 (GMT)", "Message-ID": "<88364590.8837464573838@mail.northpole.example.org>", "Content-Type": "text/plain; charset=us-ascii", "Subject": "Tonight" } ]

Mbox + couchDB

DB 에 저장하여 통계를낼수 있다.

Json API를 제공

couchDB

문서 기반 DB Server

Json API를 제공

Views

Schema-Free

couchDB

Install couchdb on centOS yum install couchdb /etc/init.d/couchdb start

couchDB -+ Python

Install Couchdb Kit (On CentOS) curl -O http://peak.telecommunity.com/dist/ez_setup.py http://pypi.python.org/pypi/setuptools#rpm-based-systems $ sudo python ez_setup.py -U setuptools

Python – Couchdb API http://packages.python.org/CouchDB

couchDB -+ Python

{# -*- coding: utf-8 -*- import sys import os import couchdb try: import jsonlib2 as json except ImportError: import json JSON_MBOX = sys.argv[1] # i.e. enron.mbox.json DB = os.path.basename(JSON_MBOX).split('.')[0] server = couchdbkit.Server('http://localhost:5984') db = server.create(DB) docs = json.loads(open(JSON_MBOX).read()) db.update(docs, all_or_nothing=True)

couchDB - Views

def dateTimeToDocMapper(doc): # Note that you need to include imports used by your mapper # inside the function definition from dateutil.parser import parse from datetime import datetime as dt if doc.get('Date'): # [year, month, day, hour, min, sec] _date = list(dt.timetuple(parse(doc['Date']))[:-3]) yield (_date, doc) # Specify an index to back the query. Note that the index won't be # created until the first time the query is run view = ViewDefinition('index', 'by_date_time', dateTimeToDocMapper, language='python') view.sync(db)

couchDB – Map/Reduce

def dateTimeCountMapper(doc): from dateutil.parser import parse from datetime import datetime as dt if doc.get('Date'): _date = list(dt.timetuple(parse(doc['Date']))[:-3]) yield (_date, 1) def summingReducer(keys, values, rereduce): return sum(values) view = ViewDefinition('index', 'doc_count_by_date_time', dateTimeCountMapper, reduce_fun=summingReducer, language='python') view.sync(db)

couchDB – Lucene

JAVA 기반의 검색 엔진 Library

Look Who’s Talking

검색어에 해당하는 메시지 ID를 couchdb-lucene 에 질의.

메시지 ID가 있는 모든 메일을 찾는다.

메일중에서 메시지가 있는 메일의 유니크한 메일 주소를 찾아 낸다.

Look Who’s Talking

Look Who’s Talking

Look Who’s Talking

Look Who’s Talking

Look Who’s Talking