Tiny Search Engineread.pudn.com/downloads145/doc/630837/TSE_tutorial.pdf3 Defining System...

Tiny Search EngineTiny Search Engine :Design and implementation

YAN Hongfei (YAN Hongfei (闫宏飞闫宏飞))[email protected]

Network GroupNetwork GroupOct. 2003Oct. 2003

mailto:[email protected]

2

OutlineOutline

� analysisanalysis– which deals with the design requirements and which deals with the design requirements and

overall architecture of a system; overall architecture of a system;

designdesign–– which translates a system architecture into which translates a system architecture into

programming constructs (such as interfaces, programming constructs (such as interfaces, classes, and method descriptions); classes, and method descriptions);

and programmingand programming–– which implements these programming constructs.which implements these programming constructs.

3

Defining System Requirements and CapabilitiesDefining System Requirements and Capabilities

Supports capability to crawl Supports capability to crawl pagepages multis multi--threadlythreadly–– Supports persistent HTTP connectionSupports persistent HTTP connection–– Supports DNS cacheSupports DNS cache–– Supports IP blockSupports IP block–– Supports the capability to filter unreachable sitesSupports the capability to filter unreachable sites–– Supports the capability to parse linksSupports the capability to parse links–– Supports the capability to crawl recursivelySupports the capability to crawl recursively

Supports TianwangSupports Tianwang--format outputformat outputSupports ISAM outputSupports ISAM outputSupports the capability to enumerate a page according to a Supports the capability to enumerate a page according to a URLURLSupports the capability to search a key word in the depotSupports the capability to search a key word in the depot

4

Three main components of the Web

• HyperText Markup Language– A language for specifying the contents and layout

of pages• Uniform Resource Locators

– Identify documents and other resources• A client-server architecture with HTTP

– By with browsers and other clients fetch documents and other resources from web servers

5

HTML

<IMG SRC = http://www.cdk3.net/WebExample/Images/earth.jpg><P>Welcome to Earth! Visitors may also be interested in taking a look at the <A HREF = “http://www.cdk3.net/WebExample/moon.html>Moon</A>.</P>(etcetera)

HTML text is stored in a file of a web server.A browser retrieves the contents of this file from a web server.-The browser interprets the HTML text-The server can infer the content type from the filename

extension.

http://www.cdk3.net/WebExample/Images/earth.jpg

6

URL

Scheme: scheme-specific-locatione.g:

mailto:[email protected]://ftp.downloadIt.com/software/aProg.exehttp://net.pku.cn/….

HTTP URLs are the most widely usedAn HTTP URL has two main jobs to do:

- To identify which web server maintains the resource- To identify which of the resources at that server

mailto:[email protected]

ftp://ftp.downloadit.com/software/aProg.exe

http://net.pku.cn/

7

HTTP URLs

• http://servername[:port]//pathNameOnServer][?arguments]• e.g.

http://www.cdk3.net/http://www.w3c.org/Protocols/Activity.htmlhttp://e.pku.cn/cgi-bin/allsearch?word=distributed+system

----------------------------------------------------------------------------------------------------Server DNS name Pathname on server Argumentswww.cdk3.net (default) (none)www.w3c.org Protocols/Activity.html (none)e.pku.cn cgi-bin/allsearch word=distributed+system-------------------------------------------------------------------------------------------------------

http://www.cdk3.net/

http://www.w3c.org/Protocols/Activity.html

http://e.pku.cn/cgi-bin/allsearch?word=distributed+system

http://www.cdk3.net/

http://www.w3c.org/

8

HTTP

• Defines the ways in which browsers and any other types of client interact with web servers (RFC2616)

• Main features– Request-replay interaction– Content types. The strings that denote the type of

content are called MIME (RFC2045,2046)– One resource per request. HTTP version 1.0– Simple access control

9

More features-services and dynamic pages

• Dynamic content– Common Gateway Interface: a program that web

servers run to generate content for their clients

• Downloaded code– JavaScript– Applet

10

What we need?

Intel x86/Linux (Red Hat Linux) platform C++….

Linus Torvalds

11

Get the homepage of PKU site

[webg@BigPc ]$ telnet www.pku.cn 80 连接到服务器的80号端口

Trying 162.105.129.12... 由Telnet客户输出

Connected to rock.pku.cn (162.105.129.12). 由Telnet客户输出

Escape character is '^]'. 由Telnet客户输出

GET / 我们只输入了这一行

<html> Web服务器输出的第一行

<head><title>北京大学</title>

…… 这里我们省略了很多行输出

</body></html>Connection closed by foreign host. 由Telnet客户输出

12

OutlineOutline

analysisanalysis–– which deals with the design requirements and which deals with the design requirements and

overall architecture of a system; overall architecture of a system;

designdesign–– which translates a system architecture into which translates a system architecture into

programming constructs (such as interfaces, programming constructs (such as interfaces, classes, and method descriptions); classes, and method descriptions);

and and programmingprogramming–– which implements these programming constructs.which implements these programming constructs.

13

Defining system objects

URL – <scheme>://<net_loc>/<path>;<params>?<query>#<fragment>

– 除了scheme部分，其他部分可以不在URL中同时出现。

– scheme ":" ::= 协议名称.– "//" net_loc ::= 网络位置/主机名，登陆信息.– "/" path ::= URL 路径.– ";" params ::= 对象参数.– "?" query ::= 查询信息.

Page….

14

Class URLclass CUrl{public:

string m_sUrl; // URL字串enum url_scheme m_eScheme; // URL scheme 协议名string m_sHost; // 主机字串int m_nPort; // 端口号

/* URL components (URL-quoted). */string m_sPath, m_sParams, m_sQuery, m_sFragment;

/* Extracted path info (unquoted). */string m_sDir, m_sFile;

/* Username and password (unquoted). */string m_sUser, m_sPasswd;

public:CUrl(); ~CUrl();bool ParseUrl( string strUrl );

private:void ParseScheme ( const char *url );

};

15

CUrl::Curl()

CUrl::CUrl(){

this->m_sUrl = "";this->m_eScheme= SCHEME_INVALID;

this->m_sHost = "";this->m_nPort = DEFAULT_HTTP_PORT;

this->m_sPath = "";this->m_sParams = "";this->m_sQuery = "";this->m_sFragment = "";this->m_sDir = "";this->m_sFile = "";this->m_sUser = "";this->m_sPasswd = "";

}

16

CUrl::ParseUrl

bool CUrl::ParseUrl( string strUrl ){

string::size_type idx;this->ParseScheme( strUrl.c_str( ) );if( this->m_eScheme != SCHEME_HTTP )

return false;

// get host namethis->m_sHost = strUrl.substr(7);idx = m_sHost.find('/');if( idx != string::npos ){

m_sHost = m_sHost.substr( 0, idx );}this->m_sUrl = strUrl;return true;

}

17

Defining system objects

URL– <scheme>://<net_loc>/<path>;<params>?<query>#<fragment>

– 除了scheme部分，其他部分可以不在URL中同时出现。

– scheme ":" ::= 协议名称.– "//" net_loc ::= 网络位置/主机名，登陆信息.– "/" path ::= URL 路径.– ";" params ::= 对象参数.– "?" query ::= 查询信息.

Page….

18

Class Page

public:string m_sUrl; string m_sLocation; string m_sHeader; int m_nLenHeader;string m_sCharset; string m_sContentEncoding; string m_sContentType;

string m_sContent; int m_nLenContent;

string m_sContentLinkInfo;string m_sLinkInfo4SE; int m_nLenLinkInfo4SE;string m_sLinkInfo4History; int m_nLenLinkInfo4History;

string m_sContentNoTags;int m_nRefLink4SENum; int m_nRefLink4HistoryNum;enum page_type m_eType;

RefLink4SE m_RefLink4SE[MAX_URL_REFERENCES];RefLink4History m_RefLink4History[MAX_URL_REFERENCES/2];map<string,string,less<string> > m_mapLink4SE;vector<string > m_vecLink4History;

19

Class Page …continued

public:CPage();CPage::CPage(string strUrl, string strLocation, char* header, char* body, int

nLenBody);~CPage();int GetCharset();int GetContentEncoding();int GetContentType();int GetContentLinkInfo();int GetLinkInfo4SE();int GetLinkInfo4History();void FindRefLink4SE();void FindRefLink4History();

private:int NormallizeUrl(string& strUrl);bool IsFilterLink(string plink);

};

20

Sockets used for streams

Requesting a connection Listening and accepting a connection

bind(s, ServerAddress);listen(s,5);

sNew = accept(s, ClientAddress);

n = read(sNew, buffer, amount)

s = socket(AF_INET, SOCK_STREAM,0)

connect(s, ServerAddress)

write(s, "message", length)

s = socket(AF_INET, SOCK_STREAM,0)

ServerAddress and ClientAddress are socket addresses

21

与服务器建立连接中需要考虑的问题

DNS缓存– URL数以亿计，而主机数以百万计。

是否为允许访问范围内的站点– 有些站点不希望搜集程序搜走自己的资源。

– 针对特定信息的搜索，比如：校园网搜索，新闻网站搜索。

– 存在着类似这样的收费规则: 同CERNET连通的国内站点不收费。

是否为可到达站点

与服务器connect的时候，使用了非阻塞连接。– 超过定时，就放弃。

22

构造请求消息体并发送给服务器 1/3

实现代码– int HttpFetch(string strUrl, char **fileBuf, char **fileHeadBuf, char

**location, int* nPSock)

– 参考了 http://fetch.sourceforge.net中的int http_fetch(const char *url_tmp, char **fileBuf)

– 申请内存，组装消息体，发送

http://fetch.sourceforge.net/

http://fetch.sourceforge.net/

23

获取header信息 2/3


**location, int* nPSock)– e.g.

HTTP/1.1 200 OKDate: Tue, 16 Sep 2003 14:19:15 GMTServer: Apache/2.0.40 (Red Hat Linux)Last-Modified: Tue, 16 Sep 2003 13:18:19 GMTETag: "10f7a5-2c8e-375a5cc0"Accept-Ranges: bytesContent-Length: 11406Connection: closeContent-Type: text/html; charset=GB2312

24

获取body信息 3/3


**location, int* nPSock)– e.g.

<html>

<head><meta http-equiv="Content-Language" content="zh-cn"><meta http-equiv="Content-Type" content="text/html;

charset=gb2312"><meta name="GENERATOR" content="Microsoft FrontPage 4.0"><meta name="ProgId" content="FrontPage.Editor.Document"><title>Computer Networks and Distributed System</title></head>….

25

多道收集程序并行工作

局域网的延迟在1-10ms，带宽为10-1000Mbps

Internet的延迟在100-500ms，带宽为0.010-2 Mbps

在同一个局域网内的多台机器，每个机器多个进程并发的工作– 一方面可以利用局域网的高带宽，低延时，各节点充分交流数据，

– 另一方面采用多进程并发方式降低Internet高延迟的副作用。

26

应该有多少个节点并行搜集网页?每个节点启动多少个Robot？ 1/2

计算理论值：– 平均纯文本网页大小为13KB– 在连接速率为100Mbps快速以太网络上，假设线路的最大利用率是

100%，则最多允许同时传输（1.0e+8b/s）/ （1500B*8b/B）≈ 8333个数据帧，也即同时传输8333个网页

– 如果假设局域网与Internet的连接为100Mbs，Internet带宽利用率低于50%（网络的负载超过80%，性能是趋向于下降的；路由），则同时传输的网页数目平均不到4000个。

– 则由n个节点组成的搜集系统，单个节点启动的Robot数目应该低于4000/n。

27

应该有多少个节点并行搜集网页?每个节点启动多少个Robot？ 2/2

经验值：– 在实际的分布式并行工作的搜集节点中，还要考虑CPU和磁盘的使用

率问题，通常CPU使用率不应该超过50%，磁盘的使用率不应该超过80%，否则机器会响应很慢，影响程序的正常运行。

– 在天网的实际系统中局域网是100Mbps的以太网，假设局域网与Internet的连接为100Mbps（这个数据目前不能得到，是我们的估计），启动的Robot数目少于1000个。

– 这个数目的Robot对于上亿量级的搜索引擎（http://e.pku.cn/ ）是足

够的。

28

单节点搜集效率

以太网数据帧的物理特性是其长度必须在46~1500字节之间。

在一个网络往返时间RTT为200ms的广域网中，服务器处理时间SPT为100ms，那么TCP上的事务时间就大约500ms（2 RTT+SPT）。

网页的发送是分成一系列帧进行的，则发送1个网页的最少时间是(13KB/1500B) * 500ms ≈4s。如果系统中单个节点启动100个Robot程序，则每个节点每天应该搜集（24 *60 *60s/4s）* 100 = 2,160,000个网页。

考虑到Robot实际运行中可能存在超时，搜集的网页失效等原因，每个节点的搜集效率小于2,160,000个网页/天。

29

TSE中多线程工作

多个收集线程并发的从待抓取的URL队列中取任务

控制对一个站点并发收集程序的数目– 提供WWW服务的机器，能够处理的未完成的TCP连接数是有一个上限

，未完成的TCP连接请求放在一个预备队列

– 多道收集程序并行的工作，如果没有控制，势必造成对于搜集站点的类似于拒绝服务（Denial of service）攻击的副作用

30

如何避免网页的重复收集?

记录未访问和已访问URL信息– ISAM格式存储：

当新解析出一个URL的，要在WebData.idx中查找是否已经存在，如果有，则丢弃该URL。

– .md5.visitedurlE.g. 0007e11f6732fffee6ee156d892dd57e

– .unvisit.tmpE.g.

http://dean.pku.edu.cn/zhaosheng/北京大学2001年各省理科录取分数线.files/filelist.xml

http://mba.pku.edu.cn/Chinese/xinwenzhongxin/xwzx.htmhttp://mba.pku.edu.cn/paragraph.csshttp://www.pku.org.cn/xyh/oversea.htm

31

域名与IP对应问题

存在4种情况：– 一对一，一对多，多对一，多对多。一对一不会造成重复搜集，

后三种情况都有可能造成重复搜集。– 可能是虚拟主机

– 可能是DNS轮转

– 可能是一个站点有多个域名对应

32

ISAM

抓回的网页存到isam形式的文件中，– 包括一个存数据的文件（WebData.dat）– 和一个索引文件(WebData.idx)

索引文件中存放每个网页开始位置的偏移地址以及url

函数原型：– int isamfile(char * buffer, int len);

33

Enumerate a page according to a Enumerate a page according to a URLURL

根据文件WebData.dat和WebData.idx查找指定url并将此网页的前面一部分内容显示在屏幕上。

函数原型：– int FindUrl(char * url,char * buffer,int buffersize);

34

Search a key word in the depotSearch a key word in the depot

根据WebData.dat查找含有指定关键字的网页，并输

出匹配的关键字的上下文。

函数原型：– void FindKey(char *key);– 函数中打印找到的url以及key附近的相关内容，每打印一个就出现提示

让用户选择继续打印或退出以及显示整个网页文件。

35

Tianwang format output

a raw page depot consists of records, every record includes a raw data of a page, records are stored sequentially, without delimitation between records.a record consists of a header(HEAD), a data(DATA) and a line feed ('\n'), such as is HEAD + blank line + DATA + '\n‘a header consists of some properties. Each property is a non blank line. Blank line is forbidden in the header.a property consists of a name and a value, with delimitation ":".the first property of the header must be the version property, such as: version: 1.0the last property of the header must be the length property, such as: length: 1800for simpleness, all names of properties should be in lowercase.

36

SummarySummary

Supports capability to crawl Supports capability to crawl pagepages multis multi--threadlythreadly–– Supports persistent HTTP connectionSupports persistent HTTP connection–– Supports DNS cacheSupports DNS cache–– Supports IP blockSupports IP block–– Supports the capability to filter unreachable sitesSupports the capability to filter unreachable sites–– Supports the capability to parse linksSupports the capability to parse links–– Supports the capability to crawl recursivelySupports the capability to crawl recursively

Supports TianwangSupports Tianwang--format outputformat outputSupports ISAM outputSupports ISAM outputSupports the capability to enumerate a page according to a Supports the capability to enumerate a page according to a URLURLSupports the capability to search a key word in the depotSupports the capability to search a key word in the depot

37

TSE package

http://net.cs.pku.edu.cn/~yhf/tse.031004-2033-HTTP1.1.Linux.tar.gz

tar xvfz tse.031004-2033-HTTP1.1.Linux.tar.gzcd tse; makenohup ./Tse –c seed.pku 100 &To stop crawling process

– ps –ef– Kill ???

http://net.cs.pku.edu.cn/~yhf/tse.031004-2033-HTTP1.1.Linux.tar.gz

38

Thank you for your attentionThank you for your attention！！

Tiny Search Engineread.pudn.com/downloads145/doc/630837/TSE_tutorial.pdf3 Defining System...

Documents

Transcript of Tiny Search Engineread.pudn.com/downloads145/doc/630837/TSE_tutorial.pdf3 Defining System...