Pairwise Document Similarity in Large Collections with MapReduce Tamer Elsayed, Jimmy Lin, and...

Pairwise Document Similarity in Large Collections with MapReduce

Tamer Elsayed, Jimmy Lin, and Douglas W. OardAssociation for Computational Linguistics, 2008

May 15, 2014Kyung-Bin Lim

2 / 19

Outline

Introduction Methodology Discussion Conclusion

3 / 19

Pairwise Similarity of Documents

PubMed – “More like this” Similar blog posts Google – Similar pages

4 / 19

Abstract Problem

Applications:– Clustering– “more-like-that” queries

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

0.740.2

5 / 19

Outline

Introduction Methodology Results Conclusion

6 / 19

Trivial Solution

Load each vector O(N) times O(N2) dot products

scalable and efficient solu-tion

for large collections

7 / 19

Better Solution

Load weights for each term once Each term contributes O(dft

2) partial scores

Each term contributes only if appears in

8 / 19

Better Solution

A term contributes to each pair that contains it

For example, if a term t1 appears in documents x, y, z :

List of documents that contain a particular term: Inverted Index

t1 appears in x, y, z

t1 contributes for pairs:

(x, y) (x, z) (y, z)

9 / 19

Algorithm

10 / 19

MapReduce Programming

Framework that supports distributed computing on clusters of computers

Introduced by Google in 2004 Map step Reduce step Combine step (Optional) Applications

11 / 19

MapReduce Model

12 / 19

Computation Decomposition

reduce

Load weights for each term once Each term contributes o(dft2) partial scores

Each term contributes only if appears in

13 / 19

MapReduce Jobs

(1) Inverted Index Computation

(2) Pairwise Similarity

14 / 19

Job1: Inverted Index

(A,(d1,2))

(B,(d1,1))

(C,(d1,1))

(B,(d2,1))

(D,(d2,2))

(A,(d3,1))

(B,(d3,2))

(E,(d3,1))

(A,[(d1,2),

(d3,1)])(B,[(d1,1),

(d2,1),

(d3,2)])(C,[(d1,1)])

(D,[(d2,2)])

(E,[(d3,1)])

shuffle

reduce

(A,[(d1,2),

(d3,1)])(B,[(d1,1),

(d2,1),

(d3,2)])

(C,[(d1,1)])

(D,[(d2,2)])

(E,[(d3,1)])

A A B C

A B B E

15 / 19

Job2: Pairwise Similarity

(A,[(d1,2),

(d3,1)])(B,[(d1,1),

(d2,1),

(d3,2)])

(C,[(d1,1)])

(D,[(d2,2)])

(E,[(d3,1)])

((d1,d3),2)

((d1,d2),1)

((d1,d3),2)

((d2,d3),2)

shuffle

((d1,d2),[1])

((d1,d3),

[2,2])

((d2,d3),[2])

reduce

((d1,d2),1)

((d1,d3),4)

((d2,d3),2)

16 / 19

Implementation Issues

df-cut– Drop common terms

Intermediate tuples dominated by very high df terms

Implemented 99% cut

efficiency Vs. effectiveness

17 / 19

Outline

18 / 19

Experimental Setup

Hadoop 0.16.0 Cluster of 19 machines– Each with two processors (single core)

Aquaint-2 collection– 2.5GB of text– 906k documents

Okapi BM25 Subsets of collection

19 / 19

Running Time of Pairwise Similarity Comparisons

R2 = 0.997

0 10 20 30 40 50 60 70 80 90 100

Corpus Size (%)

20 / 19

Number of Intermediate Pairs

0 10 20 30 40 50 60 70 80 90 100

Corpus Size (%)

df-cut at 99%df-cut at 99.9%df-cut at 99.99%df-cut at 99.999%no df-cut

21 / 19

Outline

22 / 19

Conclusion

Simple and efficient MapReduce solution– 2H for ~million-doc collection

Effective linear-time-scaling approximation– 99.9% df-cut achieves 98% relative accuracy– df-cut controls efficiency vs. effectiveness tradeoff

Pairwise Document Similarity in Large Collections with MapReduce Tamer Elsayed, Jimmy Lin, and...

Documents

Transcript of Pairwise Document Similarity in Large Collections with MapReduce Tamer Elsayed, Jimmy Lin, and...

By Mahmoud Elsayed Hamed Ibrahim Faculty of Education Dr ... · By Mahmoud Elsayed Hamed Ibrahim Dissertation submitted in partial fulfillment of Master of Education – TESOL Faculty

المحاضرة الأولى مقدمة إدارة أعمالIntro business management dr elsayed nasser

Evolutionary inaccuracy of pairwise structural alignments (slide)

Prediction of Dynamic Pairwise Wake Vortex Separations for … · 2012. 4. 3. · Prediction of Dynamic Pairwise Wake Vortex Separations for Approach and Landing – the WSVBS Frank

Post-processing long pairwise alignments

Elsayed Amer CV

Uncountable collections of pairwise disjoint non-chainable ...

Douglas Oard, Tamer Elsayed, Yejun Wu, Pengyi Zhang, Eileen Abels, Jimmy Lin, and Dagobert Soergel

Sistemas de Recomendação Hibridos baseados em Mineração de Preferências “pairwise”

المحاضرة الأولى مقدمة علم تسويقIntro to marketing dr elsayed nasser

Как «сварить» Pairwise и не пуститься во все тяжкие

K-group ANOVA & Pairwise Comparisons ANOVA for multiple condition designs Pairwise comparisons and RH Testing Alpha inflation & Correction LSD & HSD procedures.

Iterasys - Cobertura de Teste - Pairwise

Code-Coverage auf Embedded SystemsStatistisches Testen Advanced Pairwise Testing Kontrollflussorientiertes Testen Etablierte Testtechnik für sicherheitskritische Embedded Systems

Penentuan Parameter Terbobot Menggunakan Pairwise ...

Pairwise structural testing of object and aspect-oriented Java … · Pairwise structural testing of object and aspect-oriented Java programs∗ Ivan Gustavo Franchin, Otavio Augusto

PAIRWISE Prof. Dr. José Manuel Sánchez Martín Universidad de Extremadura.

Presentation Elsayed Elasy

Pairwise testing - Strategic test case design

Information Filtering LBSC 878 Douglas W. Oard and Dagobert Soergel Week 10: April 12, 1999.