1 How to Build a Search Engine 樂倍達數位科技股份有限公司 范綱岷( Kung-Ming Fung...

72
1 How to Build a Search Engine 樂樂樂樂樂樂樂樂樂樂樂樂樂 樂樂樂Kung-Ming Fung [email protected] 2008/04/01
  • date post

    15-Jan-2016
  • Category

    Documents

  • view

    219
  • download

    0

Transcript of 1 How to Build a Search Engine 樂倍達數位科技股份有限公司 范綱岷( Kung-Ming Fung...

Page 1: 1 How to Build a Search Engine 樂倍達數位科技股份有限公司 范綱岷( Kung-Ming Fung ) kmfung@doubleservice.com 2008/04/01.

1

How to Build a Search Engine

樂倍達數位科技股份有限公司范綱岷( Kung-Ming Fung )[email protected]

2008/04/01

Page 2: 1 How to Build a Search Engine 樂倍達數位科技股份有限公司 范綱岷( Kung-Ming Fung ) kmfung@doubleservice.com 2008/04/01.

2 樂倍達數位科技股份有限公司http://www.doubleservice.com/

Outline

Introduction Different Kinds of Search Engine Architecture

Robot, Spider, Crawler HTML and HTTP Indexing Keyword Search

Evaluation Criteria Related Work Discussion

About Google Ajax : A New Approach to Web Applications

References

Page 3: 1 How to Build a Search Engine 樂倍達數位科技股份有限公司 范綱岷( Kung-Ming Fung ) kmfung@doubleservice.com 2008/04/01.

3 樂倍達數位科技股份有限公司http://www.doubleservice.com/

Introduction

Page 4: 1 How to Build a Search Engine 樂倍達數位科技股份有限公司 范綱岷( Kung-Ming Fung ) kmfung@doubleservice.com 2008/04/01.

4 樂倍達數位科技股份有限公司http://www.doubleservice.com/

Different Kinds of Search Engine

Directory Search Full Text Search

Web pages News Images …

Meta Search

Page 5: 1 How to Build a Search Engine 樂倍達數位科技股份有限公司 范綱岷( Kung-Ming Fung ) kmfung@doubleservice.com 2008/04/01.

5 樂倍達數位科技股份有限公司http://www.doubleservice.com/

Number of Page :Directory < Full text < Meta

Directory Search 目錄式 ODP : Open Directory Project , http://

dmoz.org/ Full-Text Search 全文檢索

Google , http://www.google.com/

Page 6: 1 How to Build a Search Engine 樂倍達數位科技股份有限公司 范綱岷( Kung-Ming Fung ) kmfung@doubleservice.com 2008/04/01.

6 樂倍達數位科技股份有限公司http://www.doubleservice.com/

Meta Search 整合型 MetaCrawler , http://

www.metacrawler.com/ 愛幫, http://www.aibang.com/

Page 7: 1 How to Build a Search Engine 樂倍達數位科技股份有限公司 范綱岷( Kung-Ming Fung ) kmfung@doubleservice.com 2008/04/01.

7 樂倍達數位科技股份有限公司http://www.doubleservice.com/

Simplified control flow of the meta search engineReference: Context and Page Analysis for Improved Web Search , http://www.neci.nec.com/~lawrence/papers.html

Page 8: 1 How to Build a Search Engine 樂倍達數位科技股份有限公司 范綱岷( Kung-Ming Fung ) kmfung@doubleservice.com 2008/04/01.

8 樂倍達數位科技股份有限公司http://www.doubleservice.com/

Architecture

WWW

Database

Robot, Spider, Crawler

Indexing

Keyword Search

Simple Architecture

Page 9: 1 How to Build a Search Engine 樂倍達數位科技股份有限公司 范綱岷( Kung-Ming Fung ) kmfung@doubleservice.com 2008/04/01.

9 樂倍達數位科技股份有限公司http://www.doubleservice.com/

Reference: Carlos Castillo, Effective Web Crawling, Dept. of Computer Science - University of Chile, November 2004.

Typical high-level architecture of a Web crawler

Page 10: 1 How to Build a Search Engine 樂倍達數位科技股份有限公司 范綱岷( Kung-Ming Fung ) kmfung@doubleservice.com 2008/04/01.

10 樂倍達數位科技股份有限公司http://www.doubleservice.com/

Typical anatomy of a large-scale crawler.Reference: Soumen Chakrabarti, mining the web – Discovering Knowledge from Hypertext Data, Morgan Kaufmann, 2003.

Page 11: 1 How to Build a Search Engine 樂倍達數位科技股份有限公司 范綱岷( Kung-Ming Fung ) kmfung@doubleservice.com 2008/04/01.

11 樂倍達數位科技股份有限公司http://www.doubleservice.com/

High Level GoogleArchitecture

Reference: A Survey On Web Information Retrieval Technologies

Page 12: 1 How to Build a Search Engine 樂倍達數位科技股份有限公司 范綱岷( Kung-Ming Fung ) kmfung@doubleservice.com 2008/04/01.

12 樂倍達數位科技股份有限公司http://www.doubleservice.com/

The architecture of a standard meta search engine.

Reference: Web Search – Your Way

Page 13: 1 How to Build a Search Engine 樂倍達數位科技股份有限公司 范綱岷( Kung-Ming Fung ) kmfung@doubleservice.com 2008/04/01.

13 樂倍達數位科技股份有限公司http://www.doubleservice.com/

Reference: Web Search – Your Way

The architecture of a meta search engine.

Page 14: 1 How to Build a Search Engine 樂倍達數位科技股份有限公司 范綱岷( Kung-Ming Fung ) kmfung@doubleservice.com 2008/04/01.

14 樂倍達數位科技股份有限公司http://www.doubleservice.com/

Reference: Carlos Castillo, Effective Web Crawling, Dept. of Computer Science - University of Chile, November 2004.

Cyclic architecture for search engines

Page 15: 1 How to Build a Search Engine 樂倍達數位科技股份有限公司 范綱岷( Kung-Ming Fung ) kmfung@doubleservice.com 2008/04/01.

15 樂倍達數位科技股份有限公司http://www.doubleservice.com/

Robot, Spider, Crawler Robot 是 Search Engine 中負責資料收集

的軟體,又稱為 Spider 、或 Crawler ,他可以自動在設定的期限內定時自各網站收集網頁資料,而且通常是由一些預定的起始網站開始遊歷其所連結的網站,如此反覆不斷( recursive )的串連收集。

A major performance stress is DNS lookup.

Page 16: 1 How to Build a Search Engine 樂倍達數位科技股份有限公司 范綱岷( Kung-Ming Fung ) kmfung@doubleservice.com 2008/04/01.

16 樂倍達數位科技股份有限公司http://www.doubleservice.com/

Goal Resolving the hostname in the URL to an

IP address using DNS ( Domain Name Service ) .

Connecting a socket to the server and sending the request.

Receiving the request page in response.

Page 17: 1 How to Build a Search Engine 樂倍達數位科技股份有限公司 范綱岷( Kung-Ming Fung ) kmfung@doubleservice.com 2008/04/01.

17 樂倍達數位科技股份有限公司http://www.doubleservice.com/

Reference: Carlos Castillo, Effective Web Crawling, Dept. of Computer Science - University of Chile, November 2004.

Page 18: 1 How to Build a Search Engine 樂倍達數位科技股份有限公司 范綱岷( Kung-Ming Fung ) kmfung@doubleservice.com 2008/04/01.

18 樂倍達數位科技股份有限公司http://www.doubleservice.com/

Reference: Carlos Castillo, Effective Web Crawling, Dept. of Computer Science - University of Chile, November 2004.

Amount of static and dynamic pages at a given depth

Dynamic pages: 5 levelsStatic pages: 15 levels

Page 19: 1 How to Build a Search Engine 樂倍達數位科技股份有限公司 范綱岷( Kung-Ming Fung ) kmfung@doubleservice.com 2008/04/01.

19 樂倍達數位科技股份有限公司http://www.doubleservice.com/

Policy A selection policy that states which pages

to download. A re-visit policy that states when to check

for changes to the pages. A politeness policy that states how to

avoid overloading Web sites. A parallelization policy that states how to

coordinate distributed Web crawlers.Reference: Carlos Castillo, Effective Web Crawling, Dept. of Computer Science - University of Chile, November 2004.

Page 20: 1 How to Build a Search Engine 樂倍達數位科技股份有限公司 范綱岷( Kung-Ming Fung ) kmfung@doubleservice.com 2008/04/01.

20 樂倍達數位科技股份有限公司http://www.doubleservice.com/

The view of Web Crawler

Reference: Structural abstractions of hypertext documents for Web-based retrieval

Page 21: 1 How to Build a Search Engine 樂倍達數位科技股份有限公司 范綱岷( Kung-Ming Fung ) kmfung@doubleservice.com 2008/04/01.

21 樂倍達數位科技股份有限公司http://www.doubleservice.com/

Flow of a basic sequential crawler

Reference: Crawling the Web.

Page 22: 1 How to Build a Search Engine 樂倍達數位科技股份有限公司 范綱岷( Kung-Ming Fung ) kmfung@doubleservice.com 2008/04/01.

22 樂倍達數位科技股份有限公司http://www.doubleservice.com/

Reference: Crawling the Web.

A multi-threaded crawler model

Page 23: 1 How to Build a Search Engine 樂倍達數位科技股份有限公司 范綱岷( Kung-Ming Fung ) kmfung@doubleservice.com 2008/04/01.

23 樂倍達數位科技股份有限公司http://www.doubleservice.com/

HTML and HTTP

HTML – Hypertext Markup Language HTTP – Hypertext Transport Protocol

TCP – Transport Control Protocol HTTP is built on top of TCP.

Hyperlink A hyperlink is expressed as an anchor tag with an href

attribute. <a href=“http://www.ntust.edu.tw/”>NTUST</a> URL – Uniform Resource

Locator ( http://www.ntust.edu.tw/ )

Page 24: 1 How to Build a Search Engine 樂倍達數位科技股份有限公司 范綱岷( Kung-Ming Fung ) kmfung@doubleservice.com 2008/04/01.

24 樂倍達數位科技股份有限公司http://www.doubleservice.com/

GET / Http/1.0

Http/1.1 200 OKDate: Sat, 13 Jan 2001 09:01:02 GMTServer: Apache/1.3.0 (Unix) PHP/3.0.4Last-Modified: Wed, 20 Dec 2000 13:18:38 GMTAccept-Ranges: bytesContent-Length: 5437Connection: CloseContent-Type: text/html

<html><head><title>NTUST</title></head><body>…</body></html>

Page 25: 1 How to Build a Search Engine 樂倍達數位科技股份有限公司 范綱岷( Kung-Ming Fung ) kmfung@doubleservice.com 2008/04/01.

25 樂倍達數位科技股份有限公司http://www.doubleservice.com/

Reference: Carlos Castillo, Effective Web Crawling, Dept. of Computer Science - University of Chile, November 2004.

For checking a URL

Page 26: 1 How to Build a Search Engine 樂倍達數位科技股份有限公司 范綱岷( Kung-Ming Fung ) kmfung@doubleservice.com 2008/04/01.

26 樂倍達數位科技股份有限公司http://www.doubleservice.com/

Operation of a crawler

Reference: Crawling a Country Better Strategies than Breadth-First for Web Page Ordering.

Page 27: 1 How to Build a Search Engine 樂倍達數位科技股份有限公司 范綱岷( Kung-Ming Fung ) kmfung@doubleservice.com 2008/04/01.

27 樂倍達數位科技股份有限公司http://www.doubleservice.com/

Reference: Crawling on the World Wide Web.

Get new URLs

Page 28: 1 How to Build a Search Engine 樂倍達數位科技股份有限公司 范綱岷( Kung-Ming Fung ) kmfung@doubleservice.com 2008/04/01.

28 樂倍達數位科技股份有限公司http://www.doubleservice.com/

HTML Tag Tree

Reference: Crawling the Web.

Page 29: 1 How to Build a Search Engine 樂倍達數位科技股份有限公司 范綱岷( Kung-Ming Fung ) kmfung@doubleservice.com 2008/04/01.

29 樂倍達數位科技股份有限公司http://www.doubleservice.com/

HTML Tag Tree

Reference: Crawling the Web.

Page 30: 1 How to Build a Search Engine 樂倍達數位科技股份有限公司 范綱岷( Kung-Ming Fung ) kmfung@doubleservice.com 2008/04/01.

30 樂倍達數位科技股份有限公司http://www.doubleservice.com/

Strategies Breadth-first Backlink-count Batch-pagerank Partial-pagerank OPIC ( On-line Page Importance

Computation ) Larger-sites-firstReference: Carlos Castillo, Effective Web Crawling, Dept. of Computer Science - University of Chile, November 2004.

Page 31: 1 How to Build a Search Engine 樂倍達數位科技股份有限公司 范綱岷( Kung-Ming Fung ) kmfung@doubleservice.com 2008/04/01.

31 樂倍達數位科技股份有限公司http://www.doubleservice.com/

Re-visit policy Freshness: This is a binary measure that

indicates whether the local copy is accurate or not. The freshness of a page p in the repository at time t is defined as:

Reference: Web crawler, From Wikipedia, the free encyclopedia, http://en.wikipedia.org/wiki/Web_crawler.

Page 32: 1 How to Build a Search Engine 樂倍達數位科技股份有限公司 范綱岷( Kung-Ming Fung ) kmfung@doubleservice.com 2008/04/01.

32 樂倍達數位科技股份有限公司http://www.doubleservice.com/

Age: This is a measure that indicates how outdated the local copy is. The age of a page p in the repository, at time t is defined as:

Reference: Web crawler, From Wikipedia, the free encyclopedia, http://en.wikipedia.org/wiki/Web_crawler.

Page 33: 1 How to Build a Search Engine 樂倍達數位科技股份有限公司 范綱岷( Kung-Ming Fung ) kmfung@doubleservice.com 2008/04/01.

33 樂倍達數位科技股份有限公司http://www.doubleservice.com/

Robot Exclusion http://www.robotstxt.org/wc/exclusion.html The robots exclusion protocol The robots META tag

Page 34: 1 How to Build a Search Engine 樂倍達數位科技股份有限公司 范綱岷( Kung-Ming Fung ) kmfung@doubleservice.com 2008/04/01.

34 樂倍達數位科技股份有限公司http://www.doubleservice.com/

The Robots Exclusion Protocol - /robots.txt Where to create the robots.txt file?

EX:

Site URL Corresponding Robots.txt URL

http://www.w3.org/ http://www.w3.org/robots.txt

Page 35: 1 How to Build a Search Engine 樂倍達數位科技股份有限公司 范綱岷( Kung-Ming Fung ) kmfung@doubleservice.com 2008/04/01.

35 樂倍達數位科技股份有限公司http://www.doubleservice.com/

URL's are case sensitive, and "/robots.txt" must be all lower-case

Examples : To exclude all robots from the entire server

User-agent: * Disallow: /

To exclude all robots from part of the server User-agent: * Disallow: /cgi-bin/ Disallow: /tmp/ Disallow: /private/

Page 36: 1 How to Build a Search Engine 樂倍達數位科技股份有限公司 范綱岷( Kung-Ming Fung ) kmfung@doubleservice.com 2008/04/01.

36 樂倍達數位科技股份有限公司http://www.doubleservice.com/

To exclude a single robot User-agent: BadBot Disallow: /

To allow a single robot User-agent: WebCrawler Disallow:

User-agent: * Disallow: /

Page 37: 1 How to Build a Search Engine 樂倍達數位科技股份有限公司 范綱岷( Kung-Ming Fung ) kmfung@doubleservice.com 2008/04/01.

37 樂倍達數位科技股份有限公司http://www.doubleservice.com/

To exclude all files except one User-agent: * Disallow: /~joe/docs/

User-agent: * Disallow: /~joe/private.html Disallow: /~joe/foo.html Disallow: /~joe/bar.html

Page 38: 1 How to Build a Search Engine 樂倍達數位科技股份有限公司 范綱岷( Kung-Ming Fung ) kmfung@doubleservice.com 2008/04/01.

38 樂倍達數位科技股份有限公司http://www.doubleservice.com/

A sample robots.txt file# AltaVista SearchUser-agent: AltaVista Intranet V2.0 W3C WebreqDisallow: /Out-Of-Date/

# Exclude some access-controlled areasUser-agent: *Disallow: /Team/Disallow: /Project/Disallow: /Sytems/

Page 39: 1 How to Build a Search Engine 樂倍達數位科技股份有限公司 范綱岷( Kung-Ming Fung ) kmfung@doubleservice.com 2008/04/01.

39 樂倍達數位科技股份有限公司http://www.doubleservice.com/

The Robots META Tag <meta name="robots" content="noindex,nofollow"> Like any META tag it should be placed in the HEAD

section of an HTML page<html> <head> <meta name="robots" content="noindex,nofollow"> <meta name="description" content="This page ...."> <title>...</title> </head> <body>...

Page 40: 1 How to Build a Search Engine 樂倍達數位科技股份有限公司 范綱岷( Kung-Ming Fung ) kmfung@doubleservice.com 2008/04/01.

40 樂倍達數位科技股份有限公司http://www.doubleservice.com/

Examples : <meta name="robots" content="index,follow"> <meta name="robots" content="noindex,follow"> <meta name="robots" content="index,nofollow"> <meta name="robots" content="noindex,nofollow">

Index: if an indexing robot should index the page Follow: if a robot is to follow links on the page

The defaults are INDEX and FOLLOW.

Page 41: 1 How to Build a Search Engine 樂倍達數位科技股份有限公司 范綱岷( Kung-Ming Fung ) kmfung@doubleservice.com 2008/04/01.

41 樂倍達數位科技股份有限公司http://www.doubleservice.com/

Indexing 索引 一般而言,索引的產生是將網頁中每個

Word 或者 Phrase 存入 Keyword 索引檔中,另外除了來自網頁內容外,網頁作者所自行定義 Meta Tag 中的 Keyword 也常被納入索引範圍。

TF, IDF, Reverse ( Inverted ) Index Stop words

Page 42: 1 How to Build a Search Engine 樂倍達數位科技股份有限公司 范綱岷( Kung-Ming Fung ) kmfung@doubleservice.com 2008/04/01.

42 樂倍達數位科技股份有限公司http://www.doubleservice.com/

Reference: Supporting web query expansion efficiently using multi-granularity indexing and query processing

(b) is a inverted index of (a)

Page 43: 1 How to Build a Search Engine 樂倍達數位科技股份有限公司 范綱岷( Kung-Ming Fung ) kmfung@doubleservice.com 2008/04/01.

43 樂倍達數位科技股份有限公司http://www.doubleservice.com/

d1: My1 care2 is3 loss4 of5 care6 with7 old8 care9 done10.

d2: Your1 care2 is3 gain4 of5 care6 with7 new8 care9 won10.

tid: token ID did: document ID pos: position

tid did pos

my 1 1

care 1 2

is 1 3

new 2 8

care 2 9

won 2 10Reference: Soumen Chakrabarti, mining the web – Discovering Knowledge from Hypertext Data, Morgan Kaufmann, 2003.

Page 44: 1 How to Build a Search Engine 樂倍達數位科技股份有限公司 范綱岷( Kung-Ming Fung ) kmfung@doubleservice.com 2008/04/01.

44 樂倍達數位科技股份有限公司http://www.doubleservice.com/

Reference: Soumen Chakrabarti, mining the web – Discovering Knowledge from Hypertext Data, Morgan Kaufmann, 2003.

My care is loss of care with old care done.

d1

Your care is gain of care with new care won.

d2

my -> d1 care -> d1; d2

is -> d1; d2

loss -> d1

of -> d1; d2

with -> d1; d2

old -> d1

done -> d1

your -> d2

gain -> d2

new -> d2

won -> d2

my -> d1/1care -> d1/2,6,9; d2/2,6,9is -> d1/3; d2/3loss -> d1/4of -> d1/5; d2/5with -> d1/7; d2/7old -> d1/8done -> d1/10your -> d2/1gain -> d2/4new -> d2/8won -> d2/10Two variants of the inverted index data structure.

Page 45: 1 How to Build a Search Engine 樂倍達數位科技股份有限公司 范綱岷( Kung-Ming Fung ) kmfung@doubleservice.com 2008/04/01.

45 樂倍達數位科技股份有限公司http://www.doubleservice.com/

Usually stored on disk Implemented using a B-tree or a hash

table

Page 46: 1 How to Build a Search Engine 樂倍達數位科技股份有限公司 范綱岷( Kung-Ming Fung ) kmfung@doubleservice.com 2008/04/01.

46 樂倍達數位科技股份有限公司http://www.doubleservice.com/

Large-scale crawlers often use multiple ISPs and a bank of local storage servers to store the pages crawled.

Reference: Soumen Chakrabarti, mining the web – Discovering Knowledge from Hypertext Data, Morgan Kaufmann, 2003.

Page 47: 1 How to Build a Search Engine 樂倍達數位科技股份有限公司 范綱岷( Kung-Ming Fung ) kmfung@doubleservice.com 2008/04/01.

47 樂倍達數位科技股份有限公司http://www.doubleservice.com/

Keyword Search 查詢 檢索軟體是決定 Search Engine 是否能普

遍為人使用的關鍵要素,因為使用者多只能藉由搜尋速度、搜尋結果來判斷一個系統的好壞,而這些工作都屬於檢索軟體的範圍。

人工智慧、自然語言 Ranking : PageRank 、 HITS Query Expansion

Page 48: 1 How to Build a Search Engine 樂倍達數位科技股份有限公司 范綱岷( Kung-Ming Fung ) kmfung@doubleservice.com 2008/04/01.

48 樂倍達數位科技股份有限公司http://www.doubleservice.com/

WAIS : 廣域資訊服務 (Wide Area Information System ; WAIS) 是

一套可以建立全文索引,並提供網路資源全文檢索功能的軟體,其主要由伺服器 (Server) 、用戶端 (Client) 、協定(Protocol) 等三部份組成 。

查詢方式: 關鍵字 (Keyword) 以概念為基礎的 (Concept-based) 模糊( Fuzzy ) 自然語言( Natural Language )

Page 49: 1 How to Build a Search Engine 樂倍達數位科技股份有限公司 范綱岷( Kung-Ming Fung ) kmfung@doubleservice.com 2008/04/01.

49 樂倍達數位科技股份有限公司http://www.doubleservice.com/

PageRankA page can have a high PageRank if there are many pages pointing to it, or if there are same pages that point to it and have a high PageRank.

Reference: A Survey On Web Information Retrieval Technologies

Page 50: 1 How to Build a Search Engine 樂倍達數位科技股份有限公司 范綱岷( Kung-Ming Fung ) kmfung@doubleservice.com 2008/04/01.

50 樂倍達數位科技股份有限公司http://www.doubleservice.com/

We assume page A has pages T1…Tn which point to it (i.e., are citations). The parameter d is a damping factor which can be set between 0 and 1. Google usually sets d to 0.85. Also C(A) is defined as the number of links going out of page A. The PageRank of a page A is given as follows:

Reference: A Survey On Web Information Retrieval Technologies

Page 51: 1 How to Build a Search Engine 樂倍達數位科技股份有限公司 范綱岷( Kung-Ming Fung ) kmfung@doubleservice.com 2008/04/01.

51 樂倍達數位科技股份有限公司http://www.doubleservice.com/

HITS : Hyperlink Induced Topic Search A good hub is a page that points to many

good authorities; a good authority is a page that is pointed to by many good hubs.

Authorities: good sources of content Hubs: good sources of links

Reference: A Survey On Web Information Retrieval Technologies

Page 52: 1 How to Build a Search Engine 樂倍達數位科技股份有限公司 范綱岷( Kung-Ming Fung ) kmfung@doubleservice.com 2008/04/01.

52 樂倍達數位科技股份有限公司http://www.doubleservice.com/

Query Expansion

Reference: Supporting web query expansion efficiently using multi-granularity indexing and query processing

Page 53: 1 How to Build a Search Engine 樂倍達數位科技股份有限公司 范綱岷( Kung-Ming Fung ) kmfung@doubleservice.com 2008/04/01.

53 樂倍達數位科技股份有限公司http://www.doubleservice.com/

Evaluation Criteria

Recall :查詢後回應出適切資料之比率

databasetheindocumentsrelevantofnumberTotal

relevantarethatretrieveditemsofNumberrcall

EX :

  做一個查詢,在 database 中有 80 筆適切的文件,但只有 20 個 items 是有效的, 30 個不適切的,則recall = 20/80 = 0.25

Page 54: 1 How to Build a Search Engine 樂倍達數位科技股份有限公司 范綱岷( Kung-Ming Fung ) kmfung@doubleservice.com 2008/04/01.

54 樂倍達數位科技股份有限公司http://www.doubleservice.com/

Precision :精確度

retrieveddocumentsofnumberTotal

relevantarethatretrieveditemsofNumberprecision

由上例:

precision = 20/50 = 0.4

Page 55: 1 How to Build a Search Engine 樂倍達數位科技股份有限公司 范綱岷( Kung-Ming Fung ) kmfung@doubleservice.com 2008/04/01.

55 樂倍達數位科技股份有限公司http://www.doubleservice.com/

Related Work

Robot, Spider, Crawler Performance issues URL path optimization Robot Exclusion

Indexing TF : Term Frequency IDF : Inverse Document Frequency

Page 56: 1 How to Build a Search Engine 樂倍達數位科技股份有限公司 范綱岷( Kung-Ming Fung ) kmfung@doubleservice.com 2008/04/01.

56 樂倍達數位科技股份有限公司http://www.doubleservice.com/

The term frequency (TF) is the number of times the word appears in a document divided by the number of total words in the document.

If a document contains 100 total words and the word cow appears 3 times, then the term frequency of the word cow in the document is 0.03 (3/100).

One way of calculating document frequency (DF) is to determine how many documents contain the word cow divided by the total number of documents in the collection.

So if cow appears in 1,000 documents out of a total of 10,000,000 then the document frequency is 0.0001 (1000/10,000,000). Reference: tf–idf, From Wikipedia, the free encyclopedia,

http://en.wikipedia.org/wiki/Tf%E2%80%93idf.

Page 57: 1 How to Build a Search Engine 樂倍達數位科技股份有限公司 范綱岷( Kung-Ming Fung ) kmfung@doubleservice.com 2008/04/01.

57 樂倍達數位科技股份有限公司http://www.doubleservice.com/

The final tf-idf score is then calculated by dividing the term frequency by the document frequency.

For our example, the tf-idf score for cow in the collection would be 300 (0.03/0.0001).

Alternatives to this formula are to take the log of the document frequency . The natural logarithm is commonly used. In this example we would have idf = ln(10,000,000/1,000) = 9.21, so tf-idf = 0.03 * 9.21 = 0.27.

Reference: tf–idf, From Wikipedia, the free encyclopedia, http://en.wikipedia.org/wiki/Tf%E2%80%93idf.

Page 58: 1 How to Build a Search Engine 樂倍達數位科技股份有限公司 范綱岷( Kung-Ming Fung ) kmfung@doubleservice.com 2008/04/01.

58 樂倍達數位科技股份有限公司http://www.doubleservice.com/

Keyword Search Ranking Query Expansion

Clustering and Classification Caching Information Retrieval Information Extraction

Page 59: 1 How to Build a Search Engine 樂倍達數位科技股份有限公司 范綱岷( Kung-Ming Fung ) kmfung@doubleservice.com 2008/04/01.

59 樂倍達數位科技股份有限公司http://www.doubleservice.com/

Discussion

Image Search Voice Search Video Search Multimedia Search …

Page 60: 1 How to Build a Search Engine 樂倍達數位科技股份有限公司 范綱岷( Kung-Ming Fung ) kmfung@doubleservice.com 2008/04/01.

60 樂倍達數位科技股份有限公司http://www.doubleservice.com/

About Google …

Google query-serving architecture

Reference: Web Search for a Planet The Google Cluster Architecture

Page 61: 1 How to Build a Search Engine 樂倍達數位科技股份有限公司 范綱岷( Kung-Ming Fung ) kmfung@doubleservice.com 2008/04/01.

61 樂倍達數位科技股份有限公司http://www.doubleservice.com/

About Google … Maps - http://

maps.google.com/ Product -

http://www.google.com/products

Blog - http://blogsearch.google.com/

Page 62: 1 How to Build a Search Engine 樂倍達數位科技股份有限公司 范綱岷( Kung-Ming Fung ) kmfung@doubleservice.com 2008/04/01.

62 樂倍達數位科技股份有限公司http://www.doubleservice.com/

News - http://news.google.com/

Images - http://images.google.com/

Desktop - http://desktop.google.com/

Page 63: 1 How to Build a Search Engine 樂倍達數位科技股份有限公司 范綱岷( Kung-Ming Fung ) kmfung@doubleservice.com 2008/04/01.

63 樂倍達數位科技股份有限公司http://www.doubleservice.com/

Code –http://www.google.com/codesearch

Catalogs –http://catalogs.google.com/

More, more, more …http://www.google.com/intl/en/options/index.html

Page 64: 1 How to Build a Search Engine 樂倍達數位科技股份有限公司 范綱岷( Kung-Ming Fung ) kmfung@doubleservice.com 2008/04/01.

64 樂倍達數位科技股份有限公司http://www.doubleservice.com/

Ajax : A New Approach to Web Applicationshttp://www.adaptivepath.com/publications/essays/archives/000385.php

Page 65: 1 How to Build a Search Engine 樂倍達數位科技股份有限公司 范綱岷( Kung-Ming Fung ) kmfung@doubleservice.com 2008/04/01.

65 樂倍達數位科技股份有限公司http://www.doubleservice.com/

Ajax

Page 66: 1 How to Build a Search Engine 樂倍達數位科技股份有限公司 范綱岷( Kung-Ming Fung ) kmfung@doubleservice.com 2008/04/01.

66 樂倍達數位科技股份有限公司http://www.doubleservice.com/

Page 67: 1 How to Build a Search Engine 樂倍達數位科技股份有限公司 范綱岷( Kung-Ming Fung ) kmfung@doubleservice.com 2008/04/01.

67 樂倍達數位科技股份有限公司http://www.doubleservice.com/

Web P2P Search Model

Reference: Search Engine-Crawler Symbiosis.

Page 68: 1 How to Build a Search Engine 樂倍達數位科技股份有限公司 范綱岷( Kung-Ming Fung ) kmfung@doubleservice.com 2008/04/01.

68 樂倍達數位科技股份有限公司http://www.doubleservice.com/

References

Search Engine Strategies 2000 ,http://www.jupiterevents.com/sew/sf00/index.html

Google Technology , http://www.google.com/technology/pigeonrank.html

Teoma , http://www.teoma.com/

Page 69: 1 How to Build a Search Engine 樂倍達數位科技股份有限公司 范綱岷( Kung-Ming Fung ) kmfung@doubleservice.com 2008/04/01.

69 樂倍達數位科技股份有限公司http://www.doubleservice.com/

WiseNut , http://www.wisenut.com/ Architectural design and evaluation of an

efficient Web-crawling System , http://www.sciencedirect.com/science?_ob=GatewayURL&_origin=CONTENTS&_method=citationSearch&_piikey=S0164121201000917&_version=1&md5=398c9045272cc2249d9323b1418af198

Page 70: 1 How to Build a Search Engine 樂倍達數位科技股份有限公司 范綱岷( Kung-Ming Fung ) kmfung@doubleservice.com 2008/04/01.

70 樂倍達數位科技股份有限公司http://www.doubleservice.com/

Searching the World Wide Web ,http://www.neci.nec.com/~lawrence/papers.html

A Survey On Web Information Retrieval Technologies ,http://citeseer.nj.nec.com/336617.html

ASPSeek , http://www.aspseek.org/

Page 71: 1 How to Build a Search Engine 樂倍達數位科技股份有限公司 范綱岷( Kung-Ming Fung ) kmfung@doubleservice.com 2008/04/01.

71 樂倍達數位科技股份有限公司http://www.doubleservice.com/

Wen-Syan Li, Divyakant Agrawal, “Supporting web query expansion efficiently using multi-granularity indexing and query processing,” Data and Knowledge Engineering, Vol. 35 (3), pp. 239-257, 2000

Web Search – Your Way , http://citeseer.nj.nec.com/glover00web.html

Web Search for a Planet The Google Cluster Architecture , http://www.computer.org/micro/mi2003/m2022.pdf

Page 72: 1 How to Build a Search Engine 樂倍達數位科技股份有限公司 范綱岷( Kung-Ming Fung ) kmfung@doubleservice.com 2008/04/01.

72 樂倍達數位科技股份有限公司http://www.doubleservice.com/

The PageRank Citation Ranking: Bringing Order to the Web , http://citeseer.nj.nec.com/368196.html

Structural abstractions of hypertext documents for Web-based retrieval , http://citeseer.ist.psu.edu/140117.html

Crawling the Web , http://dollar.biz.uiowa.edu/~pant/Papers/crawling.pdf

Effective Web Crawling , http://www.chato.cl/534/article-63160.html#h2_2