Web People Search via Connection Analysis Authors : Dmitri V. Kalashnikov, Zhaoqi (Stella) Chen,...

99
Web People Search via Connection Analysis Authors : Dmitri V. Kalashnikov, Zhaoqi (Stella) Chen, Sharad Mehrotra, and Rabia Nuray-Turan From : IEEE Trans. on Knowledge and Data Engineering 2008 Presenter : 陳陳陳 Citation : 21 (Google Scholar) 1

Transcript of Web People Search via Connection Analysis Authors : Dmitri V. Kalashnikov, Zhaoqi (Stella) Chen,...

1

Web People Search via Connection Analysis

Authors : Dmitri V. Kalashnikov, Zhaoqi (Stella) Chen, Sharad Mehrotra, and Rabia Nuray-Turan From : IEEE Trans. on Knowledge and Data Engineering 2008Presenter : 陳仲詠Citation : 21 (Google Scholar)

2

Outline

• 1. Introduction• 2. Overview of the approach• 3. Generating a graph representation• 4. Disambiguation algorithm• 5. Interpreting clustering results• 6. Related work• 7. Experimental Results• 8. Conclusions and Future work

3

Introduction (1/7)

• Searching for web pages related to a person accounts for more than 5 percent of the current Web searches [24].

• A search for a person such as say “Andrew McCallum” will return pages relevant to any person with the name Andrew McCallum.

[24] R. Guha and A. Garg, Disambiguating People in Search. Stanford Univ., 2004.

4

Introduction (2/7)

• Assume (for now) that for each such web page, the search-engine could determine which real entity (i.e., which Andrew McCallum) the page refers to.

• Provide a capability of clustered person search, the returned results are clustered by associating each cluster to a real person.

5

Introduction (3/7)

• The user can hone in on the cluster of interest to her and get all pages in that cluster.

• For example, only the pages associated with that Andrew McCallum.

6

Introduction (4/7)

• In reality, it is not obvious that it indeed is a better option compared to searching for people using keyword-based search.

• If clusters identified by the search engine corresponded to a single person, then the clustered-based approach would be a good choice.

7

Introduction (5/7)

• The key issue is the quality of clustering algorithms in disambiguating different web pages of the namesake.

8

Introduction (6/7)

• 1. Develop a novel algorithm for disambiguating among people that have the same name.

• 2. Design a cluster-based people search approach based on the disambiguation algorithm.

9

Introduction (7/7)

• The main contributions of this paper are the following :

• A new approach for Web People Search that shows high-quality clustering.

• A thorough empirical evaluation of the proposed solution (Section 7), and

• A new study of the impact on search of the proposed approach (Section 7.3).

10

Overview of the approach (1/4)

• The processing of a user query consists of the following steps:

• 1. User input : A user submits a query.• 2. Web page retrieval : Retrieves a fixed

number (top K) of relevant web pages.

11

Overview of the approach (2/4)

• 3. Preprocessing : – TF/IDF. noun phrase identification.– Extraction. Named entities (NEs) and Web-related

information.• 4. Graph creation : The entity-relationship

(ER) graph is generated based on data extracted.

12

Overview of the approach (3/4)

• 5. Clustering : The result is a set of clusters of these pages with the aim being to cluster web pages based on association to real person.

13

Overview of the approach (4/4)

• 6. Cluster processing : – Sketches : A set of keywords that represent the

web pages within a cluster.– Cluster ranking.– Web page ranking.

• 7. Visualization of results

14

Generating a graph representation (1/6)

• Extracted : • 1)the entities• 2)relationships• 3)hyperlinks• 4)e-mail addresses • from the web pages.

15

Generating a graph representation (2/6)

• For example, a person “John Smith” might be extracted from two different web pages.

Doc1

Doc2

John Smith

Regardless whether the two pages refer to the same person or to two different people.

16

Generating a graph representation (3/6)

17

Generating a graph representation (3/6)

18

Generating a graph representation (3/6)

19

Generating a graph representation (3/6)

20

Generating a graph representation (4/6)

• The relationship edges are typed.• Any hyperlinks and e-mail addresses

extracted from the web page are handled in an analogous fashion.

21

Generating a graph representation (5/6)

• A hyperlink has the form :

• For example, for the URL : www.cs.umass.edu/~ mccallum/

have d3 = cs, d2 = umass, d1 = edu

p1 = ~mccallum.

22

Generating a graph representation (6/6)

23

Disambiguation algorithm

• 1. Input the entity relationship graph.• 2. Uses a Correlation Clustering (CC) algorithm

to cluster the pages.• 3. The outcome is a set of clusters with each

cluster corresponding to a person.

24

Disambiguation algorithmCorrelation Clustering (1/3)

• CC has been applied in the past to group documents of the same topic and to other problems.

• It assumes that there is a similarity function s(u, v) learned on the past data.

• Each (u, v) edge is assigned a “+” (similar) or “-” (different) label, according to the similarity function s(u, v).

25

Disambiguation algorithmCorrelation Clustering (2/3)

• The goal is to find the partition of the graph into clusters that agrees the most with the assigned labels.

• The CC does not take k (the number of the resulting clusters) as its input parameter.

26

Disambiguation algorithmCorrelation Clustering (3/3)

• The goal of CC is formulated formally :– maximize the agreement– minimize the disagreement.

• The problem of CC is known to be NP-hard.

27

Disambiguation algorithmConnection Strength (1/6)

• Use the notion of the Connection Strength c(u, v) between two objects u and v to define the similarity function s(u, v).

• The disambiguation algorithm is based on analyzing : – object features and – the ER graph for the data set.

28

Disambiguation algorithmConnection Strength (2/6)

• A path between u and v semantically captures interactions between them via intermediate entities.

• If the combined attraction of all these paths is sufficiently large, the objects are likely to be the same.

29

Disambiguation algorithmConnection Strength (3/6)

• Analyzing paths : • The assumption is that each path between

two objects carries in itself a certain degree of attraction.

30

Disambiguation algorithmConnection Strength (4/6)

• The attraction between two nodes u and v via paths is measured using the connection strength measure c(u, v).

• Defined as the sum of attractions contributed by each path:

31

Disambiguation algorithmConnection Strength (5/6)

• Puv denotes the set of all L-short simple paths between u and v. – A path is L-short if its length does not exceed L and

is simple if it does not contain duplicate nodes.

• wp denotes the weight contributed by path p.– The weight path p contributes is derived from the

type of that path.

32

Disambiguation algorithmConnection Strength (6/6)

• Let Puv consist of c1 paths of type 1, c2 paths of type 2, . . . ; cn paths of type n.

33

Disambiguation algorithmSimilarity Function (1/4)

• The goal is to design a powerful similarity function s(u, v) that would minimize mislabeling of the data.

• Design a flexible function s(u, v), such that it will be able to automatically self-tune itself to the particular domain being processed.

34

Disambiguation algorithmSimilarity Function (2/4)• The similarity function s(u, v) labels data by

comparing the s(u, v) value against the threshold γ.• Use the δ - band (“clear margin”) approach, label

the edge (u, v).

• To avoid committing to + or - decision, when it does not have enough evidence for that.

35

Disambiguation algorithmSimilarity Function (3/4)

• Employs the standard TF/IDF scheme to compute its feature-based similarity f(u, v).– Noun phrases– Larger terms

• The entire document corpus consists of K documents

• N distinct terms T = {t1, t2, . . . ,tN}.

36

Disambiguation algorithmSimilarity Function (4/4)

• Each document u :

• wui is the weight

37

Disambiguation algorithmTraining the Similarity Function (1/2)

• For each (u, v) edge, require that :

• In practice, s(u, v) is unlikely to be perfect and that would manifest itself in cases where the inequalities in (5) will be violated for some of the (u, v) edges

• It can be resolved in a similar manner by adding slack to each inequality in (5).

38

Disambiguation algorithmTraining the Similarity Function (2/2)

• The task becomes to solve the linear programming problem (6) to determine the optimal values for path type weights w1, w2,…,wn and threshold γ.

39

Disambiguation algorithmChoosing Negative Weight (1/7)

• A CC algorithm will assign an entity u to a cluster if the number of positive edges between u and the other entities in the cluster outnumbers that of the negative edges.

• The number of positive edges is more than half (i.e., 50 percent).

40

Disambiguation algorithmChoosing Negative Weight (2/7)

• To keep an entity in a cluster, it is sufficient to have only 25 percent of positive edges.

• Using the w+=+1 weight for all positive edges and w-=-1/3 weight for all negative edges will achieve the desired effect.

41

Disambiguation algorithmChoosing Negative Weight (3/7)

• One solution for choosing a good value for the weight of negative edges w is to learn it on past data.

• The number of namesakes n in the top k web pages.– If n = 1, w- = 0– All the pair connected via positive edges will be

merged.

42

Disambiguation algorithmChoosing Negative Weight (4/7)– If n = k, it is best to choose w- = 1. – This would produce maximum negative evidence

for pairs not to be merged.• w- = w-(n)

43

Disambiguation algorithmChoosing Negative Weight (5/7)

• This observation raises two issues :– 1) n is not known to the algorithm beforehand. – 2) how to choose the w-(n) function.

44

Disambiguation algorithmChoosing Negative Weight (6/7)

• n is not known, compute its estimated value ^n by running the disambiguation algorithm with a fixed value of w-.

• The algorithm would output certain number of clusters ^n, which can be employed as an estimation of n.

45

Disambiguation algorithmChoosing Negative Weight (7/7)

• The value of w-(^n) : – when ^n < threshold, w-(^n) = 0.– when ^n > threshold, w-(^n) = -1.

• This threshold is learned from the data.

46

A brief Summary

47

Interpreting Clustering Results (1/4)

• Now describe how these clusters are used to build people search.

• The goal is to provide the user with a set of clusters based on association to real person.– 1. Rank the clusters.– 2. Provide a summary description with each

cluster.

48

Interpreting Clustering Results (2/4)

• Cluster rank : – Select the highest ranked page.

• Cluster sketch : – The set of terms above a certain threshold is

selected and used as a summary for the cluster.

49

Interpreting Clustering Results (3/4)

• Web page rank :– These pages are displayed according to their

original search engine order.

50

Interpreting Clustering Results (4/4)

• Affinity to cluster : – Defined as the sum of the similarity values

between the page p and each page v in the cluster C :

• The remainder pages are displayed, the user has the option to get to these web pages too.

51

Experimental ResultsExperimental Setup (1/8)

• The three data sets :– 1. WWW 2005 data set[8] : 12 different people

names.– 2. WEPS data set : SemEval workshop [3], consist

of :• Trail : 9 person names.• Training : 49 person names. • Test : 30 persons names.

[3] J. Artiles, J. Gonzalo, and S. Sekine, “The SemEval-2007 WePS Evaluation: Establishing a Benchmark for the Web People Search Task,” Proc. Int’l Workshop Semantic Evaluations (SemEval ’07), June 2007.[8] R. Bekkerman and A. McCallum, “Disambiguating Web Appearances of People in a Social Network,” Proc. Int’l World Wide Web Conf. (WWW), 2005.

52

Experimental ResultsExperimental Setup (2/8)– 3. Context data set : • Issuing nine queries to Google, each in the form of a

person name along with context keywords.

• The top 100 returned web pages of the Web search were gathered for each person.

53

Experimental ResultsExperimental Setup (3/8)

• To get the “ground truth” for these data sets, the pages for each person name have then been assigned to distinct real persons by manual examination.

54

Experimental ResultsExperimental Setup (4/8)

• Used the GATE [19] system for the extraction of NEs from the web pages in the data set.

• To train the free parameters of algorithm, apply leave-one-out cross validation on WWW 2005, WEPS Trial, and Context data sets.

[19] H. Cunningham, D. Maynard, K. Bontcheva, and V. Tablan, “GATE: A Framework and Graphical Development Environment for Robust NLP Tools and Applications,” Proc. Ann. Meeting of the Assoc. Computational Linguistics (ACL), 2002.

55

Experimental ResultsExperimental Setup (5/8)

• Before the “ground truth” for its WEPS Test portion was released, tested the approach on the WEPS Training set by a twofold cross validation.

56

Experimental ResultsExperimental Setup (6/8)

• After the “ground truth” of the WEPS Test portion became available, trained the algorithm on the whole WEPS Training portion and tested on the WEPS Test portion.

57

Experimental ResultsExperimental Setup (7/8)

• Quality evaluation measures : – the B-cubed [6] and FP measures.

• Baseline methods : – the Agglomerative Vector Space clustering

algorithm with TF/IDF as the Baseline method.– The threshold parameter for this method is

trained the same way

58

Experimental ResultsExperimental Setup (8/8)

• Statistical significance test : – 1-tailed paired t-test, with α = 0.05.

59

Testing Disambiguation QualityExperiment 1 (Disambiguation quality : overall) (1/7)

60

Testing Disambiguation QualityExperiment 1 (Disambiguation quality : overall) (2/7)

* s(u, v) = c(u, v) represents the approach where only the connection strength is employed for disambiguation. * Relies only on the extracted NEs and hyperlink information, and it does not use the TF/IDF.

61

Testing Disambiguation QualityExperiment 1 (Disambiguation quality : overall) (3/7)

* With the analysis of the features of web pages f(u, v), in the form of their TF/ IDF similarity.

62

Testing Disambiguation QualityExperiment 1 (Disambiguation quality : overall) (4/7)

Picks w- according to the function w-(^n) of the predicted number of namesakes.

Gains 7.8 percent improvement in terms of B-cubed over the baseline (WWW 2005 ).

Gets 6.1 percent improvement (WEPS Training) and 10.7 percent improvement (WEPS Test).

63

Testing Disambiguation QualityExperiment 1 (Disambiguation quality : overall) (5/7)

Also compare the results with the top runners in the WEPS challenge [3]. The first runner in the challenge reports 0.78 for Fp and 0.70 for B-cubed measures.

[3] J. Artiles, J. Gonzalo, and S. Sekine, “The SemEval-2007 WePS Evaluation: Establishing a Benchmark for the Web People Search Task,” Proc. Int’l Workshop Semantic Evaluations (SemEval ’07), June 2007.

64

Testing Disambiguation QualityExperiment 1 (Disambiguation quality per namesake) (6/7)

The “#” field shows the number of namesakes for a particular name in the corresponding 100 web pages.

[4]J. Artiles, J. Gonzalo, and F. Verdejo, “A Testbed for People Searching Strategies in the WWW,” Proc. SIGIR, 2005. (C : 39)

65

Testing Disambiguation QualityExperiment 1 (Disambiguation quality per namesake) (7/7)

The table shows that the proposed approach outperforms that in [4] by 9.5 percent in terms of the FP measure.

[4]J. Artiles, J. Gonzalo, and F. Verdejo, “A Testbed for People Searching Strategies in the WWW,” Proc. SIGIR, 2005. (C : 39)

66

Testing Disambiguation QualityExperiment 2 (Disambiguation quality : group identification)

The 1,085 web pages of the WWW 2005 data set.The task is to find the web pages related to the meant N people.

67

Testing Disambiguation QualityExperiment 2 (Disambiguation quality : group identification)

The field “#W” in Table 3 is the number of the to-be-found web pages related to the namesake of interest.

The field “#C” is the number of web pages found correctly.The field “#I” is the number of pages found incorrectly in the resulting groups.

68

Testing Disambiguation QualityExperiment 3 (Disambiguation quality: queries with context)

Generated a data set by querying Google with a person name and context keyword(s) that is related to that person.

Used nine different queries.

69

Testing Disambiguation QualityExperiment 4 (Quality of generating cluster sketches)

The set of terms above a certain threshold (or top N terms) is selected and used as a summary for the cluster.

If the search is for UMass professor Andrew McCallum, his cluster can easily be identified with the terms like “machine learning” and “artificial intelligence.”

70

Impact on Search

In case of a traditional search interface, at each observation i, where i = 1, 2,…,K, the user looks at the sketch provided for the i-th returned web page.

71

Impact on Search

For the new interface, supported by a cluster-based people search, the user first looks at the “people search” interface.

1. The user sequentially reads cluster sketches/ descriptions, until on the m-th observation the user find the cluster of interest.

2. Clicks on that cluster. 3. Shows the original set of K web pages returned by the

search engine.

72

Impact on Search

Measures :Compare the quality of the new and standard

interface using Precision, Recall, and F-measure.In general, the fewer observations are needed in

a given interface, the faster the user can find the related pages.

73

Experiment 5 (Impact on search)Case 1 : First-dominant cluster

ObservationStandard New interface

To discover 50 percent of the relevant pages.

44 33

To discover 90 percent of the relevant pages.

92 55

74

Experiment 5 (Impact on search)Case 2 : Regular cluster

ObservationStandard New interface

To discover 50 percent of the relevant pages.

51 16

To discover 90 percent of the relevant pages.

79 17

Andrew McCallum the Customer Support person. His cluster consists of three pages.

75

Experiment 5 (Impact on search)Case 3 : Average

The average of Recall, Precision, and F measures for search impact on the WWW 2005.

Some of the person names have many namesakes.

Show that, even with the imperfect clustering, the curves for the new interface largely dominate those for the standard interface.

76

Experiment 5 (Impact on search)Impact on search with context

In that case, one can expect to see no difference between the new and the standard interface.

The query is “Andrew McCallum” musicThe number of namesakes for that query is

surprisingly large: 23.Andrew McCallum the UMass professor, who

is interested in music.

77

Experiment 5 (Impact on search)Impact on search with context

Andrew McCallum the DJ/ musician. In both cases, the new interface performs

better than the standard one.

ObservationStandard New interface

To discover 90 percent of the prof.

90 60

To discover 90 percent of the DJ.

90 20

78

Experiment 6 (efficiency)

That takes 3.82 seconds per web page (downloads and preprocesses pages.)

The clustering algorithm itself executes in 4.7 seconds on the average per queried name.

79

CONCLUSIONS AND FUTURE WORK

•Attempted to answer the question of which maximum quality the approach can get if it uses only the information stored in the top-k web pages being processed.

•Future work : 1. Employ external data sources for disambiguation.2. Use more advances extraction capabilities.3. Work on algorithms for a generic entity search, where entities are not limited to people.

80

Related Work

• Disambiguation and entity resolution techniques are key to any Web people search applications.

81

The differences among the disambiguation methodology in this paper and most related existing work are multilevel (see Table 1).

82

Level 1: Problem type.Two different common types of the disambiguation challenge: (fuzzy) Lookup [27], [28], and (fuzzy) Grouping [10], [13].

83

Level 2: Data with respect to GLuv.

*The methodology is based on analyzing GLuv in this paper.

*The majority of the existing techniques do not analyze GLuv.

84

Name co-occurrence.[12] analyzes only co-occurrences of names of authors via publications for a publication data set.

[12]I. Bhattacharya and L. Getoor, “Collective Entity Resolution in Relational Data,” Bull. IEEE CS Technical Committee Data Eng., vol. 29, no. 2, pp. 4-12, June 2006.

85

Name co-occurrence.When analyzing authors A1 and A5, the approach in [10], [11], and [13] would only be interested in author A3, which is a co-occurring author in publications P1 and P2, which are connected to A1 and A5, respectively.

[10] I. Bhattacharya and L. Getoor, “Iterative Record Linkage for Cleaning and Integration,” Proc. ACM SIGMOD Workshop Research Issues in Data Mining and Knowledge Discovery (DMKD), 2004.[11] I. Bhattacharya and L. Getoor, “Relational Clustering for Multi- Type Entity Resolution,” Proc. Multi-Relational Data Mining Workshop (MRDM), 2005.[13] I. Bhattacharya and L. Getoor, “A Latent Dirichlet Model for Unsupervised Entity Resolution,” Proc. SIAM Data Mining Conf. (SDM), 2006.

86

Name co-occurrence.[12] would be interested only in the sub-graph shown in Fig. 5. The methodology in this paper instead analyzes the whole GL

uv.

[12]I. Bhattacharya and L. Getoor, “Collective Entity Resolution in Relational Data,” Bull. IEEE CS Technical Committee Data Eng., vol. 29, no. 2, pp. 4-12, June 2006.

87

Restrictions on types.[12] understands only one type of relationship. The approach proposed here can analyze all of the types of relationships and entities.

88

*[26], [31], and [33] often still analyzes just portions of GLuv.

* The adaptive approach in [33] analyzes G2uv, see Fig. 7.

[26] R. Holzer, B. Malin, and L. Sweeney, “Email Alias Detection Using Social Network Analysis,” Proc. ACM SIGKDD, 2005.[31] B. Malin, “Unsupervised Name Disambiguation via Social Network Similarity,” Proc. Workshop Link Analysis, Counterterrorism, and Security, 2005.[33] E. Minkov, W. Cohen, and A. Ng, “Contextual Search and Name Disambiguation in Email Using Graphs,” Proc. SIGIR, 2006.

89

*[31] simply looks at people and connects them via “are-related”

[31] B. Malin, “Unsupervised Name Disambiguation via Social Network Similarity,” Proc. Workshop Link Analysis, Counterterrorism, and Security, 2005.

90

*Level 3: Analysis of GLuv.

*The methodology in this paper is based on analyzing paths in Puv and building mathematical models for c(u, v).* The existing work (e.g., [27], [28]) analyze the direct neighbors and [26] analyzes the shortest u-v path.

[26] R. Holzer, B. Malin, and L. Sweeney, “Email Alias Detection Using Social Network Analysis,” Proc. ACM SIGKDD, 2005.[27] D.V. Kalashnikov and S. Mehrotra, “Domain-Independent Data Cleaning via Analysis of Entity-Relationship Graph,” ACM Trans. Database Systems, vol. 31, no. 2, pp. 716-767, June 2006.[28] D.V. Kalashnikov, S. Mehrotra, and Z. Chen, “Exploiting Relationships for Domain-Independent Data Cleaning,” Proc. SIAM Int’l Conf. Data Mining (SDM ’05), Apr. 2005.

91

*Level 4 : Way to use c(u, v).*[10] and [11] employ agglomerative clustering.*[27], [28], the disambiguation problem is converted into an optimization problem, which is then solved iteratively.

[10] I. Bhattacharya and L. Getoor, “Iterative Record Linkage for Cleaning and Integration,” Proc. ACM SIGMOD Workshop Research Issues in Data Mining and Knowledge Discovery (DMKD), 2004.[11] I. Bhattacharya and L. Getoor, “Relational Clustering for Multi- Type Entity Resolution,” Proc. Multi-Relational Data Mining Workshop (MRDM), 2005.[27] D.V. Kalashnikov and S. Mehrotra, “Domain-Independent Data Cleaning via Analysis of Entity-Relationship Graph,” ACM Trans. Database Systems, vol. 31, no. 2, pp. 716-767, June 2006.[28] D.V. Kalashnikov, S. Mehrotra, and Z. Chen, “Exploiting Relationships for Domain-Independent Data Cleaning,” Proc. SIAM Int’l Conf. Data Mining (SDM ’05), Apr. 2005.

92

*Level 5: Domain independence.*Some of the existing techniques are applicable to only certain types of domains or just one domain.

93

Related WorkWSD (1/3)

• Word Sense Disambiguation : – determine the exact sense of an ambiguous word

given a list of word senses.• Word Sense Discrimination : – determine, which instances of the ambiguous

word can be clustered as sharing the same meaning.

94

Related WorkWSD (2/3)

• External knowledge sources : – Using lexical knowledge associated with a

dictionary and WordNet.• Approach : – supervised– unsupervised

95

Related WorkWSD (3/3)

• If view the ambiguous word as a reference and the word sense as an entity.

• The two instances of WSD problem are similar to the Lookup and Grouping instances of Entity Resolution/WePS.

96

Related WorkWePS (1/4)

• WePS can be implemented in two different settings.– Server-side setting : the disambiguation

mechanism is integrated into the search-engine directly.

– Middleware approach : build people search capabilities on top of an existing search-engine such as Google by “wrapping” the original engine.

97

Related WorkWePS (2/4)

• Clusty (http://www.clusty.com)• Grokker (http://www.grokker.com)• Kartoo (http://www.kartoo.com)

98

Related WorkWePS (3/4)

• ZoomInfo (http://www.zoominfo.com)

99

Related WorkWePS (4/4)

• But, this system has a high cost and low scalability.• Because the person information in the systems is collected

primarily manually.

• Does not rely on any such pre-compiled knowledge and thus will scale to person search for any person on the Web.