Getting Started Japanese Search and Calculate Similarity with Apache Lucene

44
Getting Started Japanese Search and Calculate Similarity with Apache Lucene May 2016 Eiji Shinohara

Transcript of Getting Started Japanese Search and Calculate Similarity with Apache Lucene

Page 1: Getting Started Japanese Search and Calculate Similarity with Apache Lucene

Getting Started Japanese Search and Calculate Similaritywith Apache Lucene

May 2016 Eiji Shinohara

Page 2: Getting Started Japanese Search and Calculate Similarity with Apache Lucene

Name:Eiji Shinohara / 篠原 英治 / @shinodogg

Role:AWS Solutions ArchitectSubject Matter Expert・Amazon CloudSearch・Amazon Elasticsearch Service

Who am I?

Page 3: Getting Started Japanese Search and Calculate Similarity with Apache Lucene

Which Search Engine/Service do you use?• Apache Solr

• Elasticsearch

• Amazon CloudSearch

• Amazon Elasticsearch Service

Page 4: Getting Started Japanese Search and Calculate Similarity with Apache Lucene

On top of Apache Lucene• Apache Solr

• Elasticsearch

• Amazon CloudSearch

• Amazon Elasticsearch Service

Page 5: Getting Started Japanese Search and Calculate Similarity with Apache Lucene

Have you used Apache Lucene?

•Apache Lucene is a free and open-source information retrieval software library, originally written in Java by Doug Cutting. • It is supported by theApache Software Foundation and is released under the Apache Software License.

https://en.wikipedia.org/wiki/Lucene

Page 6: Getting Started Japanese Search and Calculate Similarity with Apache Lucene

Doug Cutting – Hadoop/Nutch/Lucene•Hadoop: MapReduce• Thenamemykidgaveastuffedyellowelephant.

•Nutch: Crawler•Nutchwasthewaymyoldestsonwhenhewastwo,Ithinkitcamefromlunch.

•Lucene: Search• LuceneisDougCutting'swife'smiddlename,andhermaternalgrandmother'sfirstname.

http://www.mwsoft.jp/programming/hadoop/where_come_from.html

Page 7: Getting Started Japanese Search and Calculate Similarity with Apache Lucene

Doug Cutting – Hadoop/Nutch/Lucene•Hadoop: MapReduce• Thenamemykidgaveastuffedyellowelephant.

•Nutch: Crawler•Nutchwasthewaymyoldestsonwhenhewastwo,Ithinkitcamefromlunch

•Lucene: Search• LuceneisDougCutting'swife'smiddlename,andhermaternalgrandmother'sfirstname.

http://www.mwsoft.jp/programming/hadoop/where_come_from.html

MaybemostpropernamingJ

Page 8: Getting Started Japanese Search and Calculate Similarity with Apache Lucene

Apache Lucene•Full-Text search• Easy to use

http://www.lucenetutorial.com/lucene-in-5-minutes.html

Page 9: Getting Started Japanese Search and Calculate Similarity with Apache Lucene

Apache Lucene•Full-Text search• Easy to use

1. Index• new Document → addDocument → commit

2. Query• Generate Query String

3. Search• Search and Fetch hitted documents

4. Display• Get contents from fetched documents to showhttp://www.lucenetutorial.com/lucene-in-5-minutes.html

Page 10: Getting Started Japanese Search and Calculate Similarity with Apache Lucene

Evernote and LinkedIn are using Lucene•w/ thin their own HTTP wrapper• Presentation at Lucene Solr Revolution 2014

https://www.youtube.com/watch?v=drOmahIie6c https://www.youtube.com/watch?v=8O7cF75intk

Page 11: Getting Started Japanese Search and Calculate Similarity with Apache Lucene

Build your own Search engine?• Some companies are doing that

http://www.slideshare.net/lucidworks/galene-linkedins-search-architecture-presented-by-diego-buthay-sriram-sankar-linkedin/8

Page 12: Getting Started Japanese Search and Calculate Similarity with Apache Lucene

Iʼll join Lucene Solr Revolution 2016

Page 13: Getting Started Japanese Search and Calculate Similarity with Apache Lucene

Apache Lucene⼊⾨ in Japanese

http://rondhuit.com/lucene-for-bea-060710.pdfhttp://www.amazon.co.jp/dp/4774127809

Page 14: Getting Started Japanese Search and Calculate Similarity with Apache Lucene

Lucene in Action

https://www.amazon.com/dp/1933988177

Page 15: Getting Started Japanese Search and Calculate Similarity with Apache Lucene

Uchida-sanʼs Blog in Japanese

http://mocobeta-backup.tumblr.com/post/54371099587/lucene-in-action

Page 16: Getting Started Japanese Search and Calculate Similarity with Apache Lucene

Uchida-san: Search Consultant at Rondhuit

Page 17: Getting Started Japanese Search and Calculate Similarity with Apache Lucene

Lucene in Action chap5: Term Vector (2) Calcurate Document Similarity

http://mocobeta-backup.tumblr.com/post/49779999073/

Page 18: Getting Started Japanese Search and Calculate Similarity with Apache Lucene

Lucene in Action chap5: Term Vector (2) Calcurate Document Similarity• Just tried to run on local Macbook Air J• Created 2 classes• Indexer• Indexing some documents

• CalculationSimilarityTester• Comparing 2 documents• Calculate cosine similarity

• Using Luke for browsing index• https://github.com/DmitryKey/luke• Uchida-san is also Luke comitter•

Page 19: Getting Started Japanese Search and Calculate Similarity with Apache Lucene

Lucene 6.0• I had Lucene 5.5 environment but,,,• Invalid directory at the location, check console for more

information. Last exception: • java.lang.IllegalArgumentException: Could not load codec

'Lucene60'. Did you forget to add lucene-backward-codecs.jar?

Page 20: Getting Started Japanese Search and Calculate Similarity with Apache Lucene

Lucene 6.0•So created new Maven project• pom.xml

<dependency><groupId>org.apache.lucene</groupId><artifactId>lucene-core</artifactId><version>6.0.0</version>

</dependency><dependency>

<groupId>org.apache.lucene</groupId><artifactId>lucene-queryparser</artifactId><version>6.0.0</version>

</dependency><dependency>

<groupId>org.apache.lucene</groupId><artifactId>lucene-analyzers-common</artifactId><version>6.0.0</version>

</dependency><dependency>

<groupId>org.apache.lucene</groupId><artifactId>lucene-analyzers-kuromoji</artifactId><version>6.0.0</version>

</dependency>

Page 21: Getting Started Japanese Search and Calculate Similarity with Apache Lucene

Indexerpublic class Indexer {

public static void main(String args[]) throws IOException {Analyzer analyzer = new JapaneseAnalyzer();〜略〜

File[] files = new File("/Users/xxx/lucene_test/docs/").listFiles();for (File file : files) {

Document doc = new Document();〜略〜FieldType contentsType = new FieldType();contentsType.setStored(true);contentsType.setTokenized(true);contentsType.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS);contentsType.setStoreTermVectors(true);〜略〜doc.add(new Field("contents", sb.toString(), contentsType));writer.addDocument(doc);

}writer.commit();writer.close();

}}

•Read file -> add Document -> Commit

Page 22: Getting Started Japanese Search and Calculate Similarity with Apache Lucene

Indexer• Files• Found examples on the internet :)• http://www.pahoo.org/e-soul/webtech/php06/php06-21-01.shtm

PHP: Hypertext Preprocessor(ピー・エイチ・ピー ハイパーテキスト プリプロセッサー)とは、動的に HTML データを⽣成することによって、動的なウェブページを実現することを主な⽬的としたプログラミング⾔語、およびその⾔語処理系である。PHP は、HTML 埋め込み型のサーバサイド・スクリプト⾔語として分類される。この⾔語処理系⾃体は、C⾔語で記述されている。

PHP(Hypertext Preprocessor;ピー・エイチ・ピー)とは、動的に HTML データを⽣成することによって、動的なウェブページを実現すること⽬的としたプログラミング⾔語である。PHP は、HTML 埋め込み型のサーバサイド・スクリプト⾔語の⼀種で、処理系⾃体は C⾔語で記述されている。

Page 23: Getting Started Japanese Search and Calculate Similarity with Apache Lucene

Indexer• Files• Found examples on the internet :)• http://www.fisproject.jp/2015/01/cosine_similarity/

• Exactly same

A Cat sat on the mat.

Cats are sitting on the mat.

⼈⼝から無作為に選択されて、⼈⼝に関する仮説を試験するために使⽤される項⽬となっております。

⼈⼝から無作為に選択されて、⼈⼝に関する仮説を試験するために使⽤される項⽬となっております。

Page 24: Getting Started Japanese Search and Calculate Similarity with Apache Lucene

Indexer•Run

Page 25: Getting Started Japanese Search and Calculate Similarity with Apache Lucene

Luke• Index Browsing

Page 26: Getting Started Japanese Search and Calculate Similarity with Apache Lucene

Luke• Index Browsing

Page 27: Getting Started Japanese Search and Calculate Similarity with Apache Lucene

Luke• Index Browsing

$mvn package./luke.sh

Page 28: Getting Started Japanese Search and Calculate Similarity with Apache Lucene

Luke• Index Browsing

Page 29: Getting Started Japanese Search and Calculate Similarity with Apache Lucene

Luke• Index Browsing

Page 30: Getting Started Japanese Search and Calculate Similarity with Apache Lucene

Calcurate Document Similarity•mocobeta/CalcCosineSimilarityTest.java• https://gist.github.com/mocobeta/5525864• Search document from index• TF-IDF from Term Vector

• TF-IDF• how important a word is to a document in a collection or corpus

• TF: how frequently a term occurs in a document• IDF: it's a measure of the rareness of a term

• Get Cosine-Similarity• Lower is similar

Page 31: Getting Started Japanese Search and Calculate Similarity with Apache Lucene

Calcurate Document Similaritypublic class CalcCosineSimilarityTester {

public static void main(String args[]) throws IOException {〜略〜TopDocs hits = searcher.search(new TermQuery(new Term("path", path1)), 1);int docId1 = hits.scoreDocs[0].doc;Map<String, Double> map1 = buildDocumentVector(docId1);

hits = searcher.search(new TermQuery(new Term("path", path2)), 1);int docId2 = hits.scoreDocs[0].doc;Map<String, Double> map2 = buildDocumentVector(docId2);

System.out.println(computeAngle(map1, map2));

// create HashMap(Key:Keyword, Value:TF-IDF) for each documentprivate Map<String, Double> buildDocumentVector(int docId) {

〜略〜

// calculate cosine similarityprivate double computeAngle(map1, map2) {

〜略〜

Page 32: Getting Started Japanese Search and Calculate Similarity with Apache Lucene

Calcurate Document Similarityprivate Map<String, Double> buildDocumentVector(int docId) throws IOException {

Terms vector = reader.getTermVector(docId, "contents");〜略〜// get TF-IDF from Term VectorTermsEnum itr = vector.iterator();〜略〜while ((ref = itr.next()) != null) {

String term = ref.utf8ToString();TermFreq freq = new TermFreq(term, maxDoc);freq.setTc(itr.totalTermFreq());freq.setDf(reader.docFreq(new Term("contents", term)));list.add(freq);tcSum += itr.totalTermFreq();

}// Build HashMap Key:Keyword, Value:TF-IDFMap<String, Double> docVector = new HashMap<String, Double>();for (TermFreq freq : list) {

〜略〜}return docVector;

}

Page 33: Getting Started Japanese Search and Calculate Similarity with Apache Lucene

Calcurate Document Similarityprivate double computeAngle(Map<String, Double> vec1, Map<String, Double> vec2) {

double dotProduct = 0; // inner productfor (String term : vec1.keySet()) {

if (vec2.containsKey(term)) {dotProduct += vec1.get(term) * vec2.get(term);

}}

double denominator = getNorm(vec1) * getNorm(vec2);double ratio = dotProduct / denominator; // cosine value

return Math.acos(ratio);}

private double getNorm(Map<String, Double> vec) {double sumOfSquares = 0;for (Double val : vec.values()){

sumOfSquares += val * val;}return Math.sqrt(sumOfSquares);

}

Page 34: Getting Started Japanese Search and Calculate Similarity with Apache Lucene

Calcurate Document Similarity• result• 0.5000430658877127

PHP: Hypertext Preprocessor(ピー・エイチ・ピー ハイパーテキスト プリプロセッサー)とは、動的に HTML データを⽣成することによって、動的なウェブページを実現することを主な⽬的としたプログラミング⾔語、およびその⾔語処理系である。PHP は、HTML 埋め込み型のサーバサイド・スクリプト⾔語として分類される。この⾔語処理系⾃体は、C⾔語で記述されている。

PHP(Hypertext Preprocessor;ピー・エイチ・ピー)とは、動的に HTML データを⽣成することによって、動的なウェブページを実現すること⽬的としたプログラミング⾔語である。PHP は、HTML 埋め込み型のサーバサイド・スクリプト⾔語の⼀種で、処理系⾃体は C⾔語で記述されている。

Page 35: Getting Started Japanese Search and Calculate Similarity with Apache Lucene

Calcurate Document Similarity• result• 1.2734113128621865

A Cat sat on the mat.

Cats are sitting on the mat.

Page 36: Getting Started Japanese Search and Calculate Similarity with Apache Lucene

Calcurate Document Similarity• result• 0.0

⼈⼝から無作為に選択されて、⼈⼝に関する仮説を試験するために使⽤される項⽬となっております。

⼈⼝から無作為に選択されて、⼈⼝に関する仮説を試験するために使⽤される項⽬となっております。

Page 37: Getting Started Japanese Search and Calculate Similarity with Apache Lucene

Lucene 6.0•Bunch of changes..

Page 38: Getting Started Japanese Search and Calculate Similarity with Apache Lucene

Lucene 6.0•N-best • LUCENE-6837: Add N-best output capability to JapaneseTokenizer

Page 39: Getting Started Japanese Search and Calculate Similarity with Apache Lucene

N-best•Contribute from Yahoo! Japan

http://www.slideshare.net/techblogyahoo/17lucenesolr-solrjp-apache-lucene-solrnbest

Page 40: Getting Started Japanese Search and Calculate Similarity with Apache Lucene

N-best•Contribute from Yahoo! Japan

http://www.slideshare.net/techblogyahoo/17lucenesolr-solrjp-apache-lucene-solrnbest

Page 41: Getting Started Japanese Search and Calculate Similarity with Apache Lucene

N-best•Contribute from Yahoo! Japan

Page 42: Getting Started Japanese Search and Calculate Similarity with Apache Lucene

Nihongo Muzukashii-ne…•Need to analyze more or maintain dictionaries??

http://www.slideshare.net/techblogyahoo/17lucenesolr-solrjp-apache-lucene-solrnbest

Page 43: Getting Started Japanese Search and Calculate Similarity with Apache Lucene

Nihongo Muzukashii-ne…•Doesnʼt hit with “⼀眼レフ”(Single-lens reflex)?

http://blog.yoslab.com/entry/2014/09/12/005207

Page 44: Getting Started Japanese Search and Calculate Similarity with Apache Lucene

N-best•Seems cool J• Iʼm going to try…

http://www.slideshare.net/techblogyahoo/17lucenesolr-solrjp-apache-lucene-solrnbest