SpotSigs : Robust and Efficient Near Duplicate Detection in Large Web Collections

Post on 23-Feb-2016

43 views 0 download

description

SpotSigs : Robust and Efficient Near Duplicate Detection in Large Web Collections. Presenter: Tsai Tzung Ruei Authors: Martin Theobald , Jonathan Siddharth , and Andreas Paepcke. 國立雲林科技大學 National Yunlin University of Science and Technology. SIGIR. 2008. Outline. Motivation Objective - PowerPoint PPT Presentation

Transcript of SpotSigs : Robust and Efficient Near Duplicate Detection in Large Web Collections

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

SpotSigs: Robust and Efficient Near Duplicate Detection in Large Web Collections

Presenter: Tsai Tzung Ruei Authors: Martin Theobald, Jonathan Siddharth, and Andreas Paepcke

SIGIR. 2008

國立雲林科技大學National Yunlin University of Science and Technology

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Outline

Motivation Objective Methodology Experiments Conclusion Comments

2

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Motivation

Detecting near-duplicate documents and records in large data sets is a long-standing problem. Syntactically, near duplicates are pairs of items that are very similar along some dimensions, but different enough that simple byte-by-byte comparisons fail.

3

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Objective

To avoid exact duplicates during the collection of Web archives, near duplicates frequently slip into the corpus.

4

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Methodology

SPOT SIGNATURE EXTRACTION MATCHING

5

WebDatabase document

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Methodology

SPOT SIGNATURE EXTRACTION A = {aj(dj, cj)}

6

Example

a(1,2), an(1,2), the(1,2) and is(1,2)

“ At a rally to kick off a weeklong campaign for the South Carolina primary, Obama tried to set the record straight from an attack circulating widely on the Internet that is designed to play into prejudices against Muslims and fears of terrorism.”

ResultS = {a:rally:kick,a:weeklong:campain, the:south:carolina, the:record:straight,an:attack:circulating, the:internet:designed, is:designed:play}

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Methodology

SPOT SIGNATURE MATCHING Jaccard Similarity for Sets

7

Generalization for Multi-Sets

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Methodology

SPOT SIGNATURE MATCHING

8

SPOT SIGNATURE

partition

partition

partition

Inverted Index Pruning

Jaccard Similarity for Sets

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Methodology

Optimal Partitioning

9

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Methodology

Inverted Index Pruning

10

Exampled1 = {s1:5, s2:4, s3:4}, with |d1| = 13d2 = {s1:8, s2:4}, |d2| = 12d3 = {s1:4, s2:5, s3:5} , |d3| = 14τ = 0.8δ1 = 0δ2 = |d1| − |d3| = −1

SPOT SIGNATURE

partition

partition

partition

Inverted Index Pruning

Jaccard Similarity for Sets

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Experiments

Gold Set of Near Duplicate News Articles SpotSigs vs. Shingling Choice of Spot Signatures SpotSigs vs. Hashing

TREC WT10g SpotSigs vs. Hashing

11

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Experiments

Gold Set of Near Duplicate News Articles

12

SpotSigs vs. Shingling

Choice of Spot SignaturesSpotSigs vs. Hashing

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Experiments

TREC WT10g SpotSigs vs. Hashing

13

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Conclusion

MAJOR CINTRIBUTION SpotSigs proved to provide both increased robustness of signatures as

well as highly efficient deduplication compared to various state-of-the-art approaches.

FUTURE WORK Future work will focus on efficient access to disk-based index

structures, as well as generalizing the bounding approach toward other metrics such as Cosine.

14

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Comments

Advantage The SpotSigs deduplication algorithm runs “right out of the box”

without the need for further tuning, while remaining exact and efficient.

Drawback …..

Application information retrieval

15