Myanmar Search Engine

Post on 27-Apr-2015

187 views 0 download

Transcript of Myanmar Search Engine

Myanmar Search Engine

Nyi Lynn SeckEC (MCPA)

Search Engine Evolution

● 1st generation (use only “on page” data)– text data, Word frequency, language

● 2nd generation (use off-page, web-specific data)– Link (or connectivity) analysis– Click-through data (What people click)– Anchor-text (How people refer to this page)

● 3rd generation (answer “the need behind the query”)– Semantic analysis - what is this about?– Focus on user need, rather than on query– Context determination

Text Mining Research Area

● Information Retrieval (IR)– Search Engines– Classification– Recommendation

● Information Extraction (IE)– Screen scraping– Product Information (e.g. price) scraping

● Information Understanding– Natural Language Processing (NLP)– Question Answering– Concept Extraction from Newsgroup– Visualization– Summarization

● Cross-Lingual Text Mining● Trend Detection

– Outlier Detection

Classical Indexing

Indexing

– Keyword Indexing

– Subject Indexing (Classification)

– Collocate subjects– Define & Assign code (Call Number) to document

Tokenization

Tokenization is the process of replacing sensitive data with unique identification symbols that retain all the essential information without compromising its security

Assign unique ID to each word & keep in a lexicon

Remove Stop/Noise words before/after tokenization

Stemming, Lemmatization

Stemming is the process for reducing inflected (or sometimes derived) words to their stem, base or root form – generally a written word form.

Lemmatization is the process of reducing an inflected spelling to its lexical root or lemma form. The lemma form is the base form or head word form you would find in a dictionary. The combination of the lemma form with its word class (noun, verb. etc.) is called the lexeme.

ကစ

ကစကြကင� ကစစရ အကစကစပြ

ကစ

ကစေ�နသည� ကစလ�မမ���ည�ကစခ�သည�

Inverted Index

Inverted Index

Formula & Algorithm?

The weight of a term that occurs in all documents

Stop Wordsaableaboutaboveabroadaccordingaccordinglyacrossactuallyadjafterafterwardsagainagainstagoaheadain'tallallowallowsalmostalone

Engl

ish

What stop words will be use in Myanmar Search Engine?

NGram သ သသသသသ သ သသသသသသသသသသသသ သသသသ သသသသသေ�မမတယဉမ��သတ�ထ ေေ�မမင�န�င�န �လ��အ ညည�

ေ�မမတေ�တယဉ �ယဉမ��မမ�သသတ�တ�ထထမမင�ေ�မမင�န�င�န�င�န �ရနလန%�ည&လ��အ�အ ညည�

|ေ�မမ||ေ�တ||ယဉ �||မမ�||သ||တ�||ထ||ေ�မမင�||န�င�||ရန �||လ��||အ�||သည�|

ေ�မမတယဉ �ေ�တယဉမ��ယဉမ��သမမ�သတ�သတ�ထတ�ထမမင�ထမမင�န�င�ေ�မမင�န�င�န �န�င�နလန%�ည&ရနလန%�ည&အ�လ��အ ညည�

ေ�မမတယဉမ��ေ�တယဉမ��သယဉမ��သတ�မမ�သတ�ထသတ�ထမမင�တ�ထမမင�န�င�ထမမင�န�င�န �ေ�မမင�န�င�နလန%�ည&န�င�နလန%�ည&အ�ရနလန%�ည&အ ညည�

2 Gram |ေ�မမတ||ယဉမ��||သတ�||ေ�မမင�န�င�||ရနလန%�ည&||လ��အ�||အ ညည�|3 Gram |ေ�မမတယဉ �||သတ�ထ||ေ�မမင�န�င�န �||လ��အ ညည�|4 Gram |ေ�မမတယဉမ��|

ေ�မမတယဉမ��သေ�တယဉမ��သတ�ယဉမ��သတ�ထမမ�သတ�ထမမင�သတ�ထမမင�န�င�တ�ထမမင�န�င�န �ထမမင�န�င�နလန%�ည&ေ�မမင�န�င�နလန%�ည&အ�န�င�နလန%�ည&အ ညည�

MyanmarWord Segmentation using Syllable level Longest Matching : Hla Hla Htay

Simple Myanmar Syllable Structure

Consonant

Medial

Vowel

Killer

Diacritic

Diacritic

Killer

Diacriti

c

Diacritic

Vowel

Killer

Diacritic

Diacritic

Killer

Diacritic

CC+MC+M+VC+M+V+KC+M+ V+ K+ DC+M+V+DC+M+KC+M+K+DC+M+DC+VC+V+KC+V+K+DC+V+DC+KC+K+D

Corpus/Lexicon

WWWWWW

Ranking engine

Query engineParser Indexer

Language specific crawler

Pagerepository

queryresults

Crawler

Language Identification

Language Specific Search EngineBasic Architecture

Pann Yu Mon, Management and Information System Engineering Department, Nagaoka University of Technology, Japan

Crawling Coverage

Crawling Parameters

Seed URLs 35Level of depth 6 Crawling time 2 weeksCPU 2.40 GHzMemory 1 GBConnection: 100 Mbit per second

Domains The Number of Pages Collected

.mm 3,555 [ 1.1%]

.com 276,554 [ 83.2%]

Other gTLDs 52,245 [ 15.7%]

Total 332,354 [100.0%]

10th July 2008