Myanmar Search Engine

Nyi Lynn SeckEC (MCPA)

Search Engine Evolution

● 1st generation (use only “on page” data)– text data, Word frequency, language

● 2nd generation (use off-page, web-specific data)– Link (or connectivity) analysis– Click-through data (What people click)– Anchor-text (How people refer to this page)

● 3rd generation (answer “the need behind the query”)– Semantic analysis - what is this about?– Focus on user need, rather than on query– Context determination

Text Mining Research Area

● Information Retrieval (IR)– Search Engines– Classification– Recommendation

● Information Extraction (IE)– Screen scraping– Product Information (e.g. price) scraping

● Information Understanding– Natural Language Processing (NLP)– Question Answering– Concept Extraction from Newsgroup– Visualization– Summarization

● Cross-Lingual Text Mining● Trend Detection

– Outlier Detection

Classical Indexing

Indexing

– Keyword Indexing

– Subject Indexing (Classification)

– Collocate subjects– Define & Assign code (Call Number) to document

Tokenization

Tokenization is the process of replacing sensitive data with unique identification symbols that retain all the essential information without compromising its security

Assign unique ID to each word & keep in a lexicon

Remove Stop/Noise words before/after tokenization

Stemming, Lemmatization

Stemming is the process for reducing inflected (or sometimes derived) words to their stem, base or root form – generally a written word form.

Lemmatization is the process of reducing an inflected spelling to its lexical root or lemma form. The lemma form is the base form or head word form you would find in a dictionary. The combination of the lemma form with its word class (noun, verb. etc.) is called the lexeme.

ကစ

ကစကြကင� ကစစရ အကစကစပြ

ကစ

ကစေ�နသည� ကစလ�မမ��ည�ကစခ�သည�

Inverted Index

Formula & Algorithm?

The weight of a term that occurs in all documents

Stop Wordsaableaboutaboveabroadaccordingaccordinglyacrossactuallyadjafterafterwardsagainagainstagoaheadain'tallallowallowsalmostalone

What stop words will be use in Myanmar Search Engine?

NGram သ သသသသသ သ သသသသသသသသသသသသ သသသသ သသသသသေ�မမတယဉမ��သတ�ထ ေေ�မမင�န�င�န �လ��အ ညည�

ေ�မမတေ�တယဉ �ယဉမ��မမ�သသတ�တ�ထထမမင�ေ�မမင�န�င�န�င�န �ရနလန%�ည&လ��အ�အ ညည�

|ေ�မမ||ေ�တ||ယဉ �||မမ�||သ||တ�||ထ||ေ�မမင�||န�င�||ရန �||လ��||အ�||သည�|

ေ�မမတယဉ �ေ�တယဉမ��ယဉမ��သမမ�သတ�သတ�ထတ�ထမမင�ထမမင�န�င�ေ�မမင�န�င�န �န�င�နလန%�ည&ရနလန%�ည&အ�လ��အ ညည�

ေ�မမတယဉမ��ေ�တယဉမ��သယဉမ��သတ�မမ�သတ�ထသတ�ထမမင�တ�ထမမင�န�င�ထမမင�န�င�န �ေ�မမင�န�င�နလန%�ည&န�င�နလန%�ည&အ�ရနလန%�ည&အ ညည�

2 Gram |ေ�မမတ||ယဉမ��||သတ�||ေ�မမင�န�င�||ရနလန%�ည&||လ��အ�||အ ညည�|3 Gram |ေ�မမတယဉ �||သတ�ထ||ေ�မမင�န�င�န �||လ��အ ညည�|4 Gram |ေ�မမတယဉမ��|

ေ�မမတယဉမ��သေ�တယဉမ��သတ�ယဉမ��သတ�ထမမ�သတ�ထမမင�သတ�ထမမင�န�င�တ�ထမမင�န�င�န �ထမမင�န�င�နလန%�ည&ေ�မမင�န�င�နလန%�ည&အ�န�င�နလန%�ည&အ ညည�

MyanmarWord Segmentation using Syllable level Longest Matching : Hla Hla Htay

Simple Myanmar Syllable Structure

Consonant

Medial

Killer

Diacritic

Killer

Diacriti

Diacritic

Killer

Diacritic

Killer

Diacritic

CC+MC+M+VC+M+V+KC+M+ V+ K+ DC+M+V+DC+M+KC+M+K+DC+M+DC+VC+V+KC+V+K+DC+V+DC+KC+K+D

Corpus/Lexicon

WWWWWW

Ranking engine

Query engineParser Indexer

Language specific crawler

Pagerepository

queryresults

Crawler

Language Identification

Language Specific Search EngineBasic Architecture

Pann Yu Mon, Management and Information System Engineering Department, Nagaoka University of Technology, Japan

Crawling Coverage

Crawling Parameters

Seed URLs 35Level of depth 6 Crawling time 2 weeksCPU 2.40 GHzMemory 1 GBConnection: 100 Mbit per second

Domains The Number of Pages Collected

.mm 3,555 [ 1.1%]

.com 276,554 [ 83.2%]

Other gTLDs 52,245 [ 15.7%]

Total 332,354 [100.0%]

10th July 2008

Myanmar Search Engine

Documents

Transcript of Myanmar Search Engine

Search engine 1

SEARCH ENGINE INSIDE OUT

Search Engine Optimalisatie

Search Engine

Makalah Search Engine

คู่มือเริ่มต้น SEO (Search Engine Optimization) · คู่มือเริ่มต้น SEO (Search Engine Optimization) ... SEO ต่อมาเราก็คิดว่าคู่มือนี้น่าจะเป็นประโยชน์ส

Search Engine Advertising

BACHELORARBEIT - monami.hs-mittweida.de · SEO Seach-Engine-Optimization / Suchmaschinenoptimierung SEA Search-Engine-Advertising / Suchmaschinenwerbung SEM Search-Engine-Marketing

SE - Search Engine

Edmunds Search Engine

Search engine-marketing

Ms Perry. Bing Search: Microsoft's search engine Google: The world's most popular search engine. Yahoo!: The 2nd largest search engine on the web.

Search Engine Optimization (SEO)

A Utilização das Ferramentas de Marketing Digital nas ... · SEM – Search Engine Marketing SEO – Search Engine Optimization SERP – Search Engine Results Page SPAM – Sending

Excel Search Engine Fortiki

Digitale Berufsbilder 2019 - aktivWEB | Digital Beratung ... · Search-Engine Advertising Manager 21 Search Engine Marketing Manager 22 SEM Manager 22 Search Engine Opimization Manager

Search Engine Optimization

Search Engine Marketing

Search engine optimization SEO

Browsing dan Search engine