Peters matthew periodictableseo
-
Upload
mattthemathman -
Category
Technology
-
view
11.379 -
download
0
description
Transcript of Peters matthew periodictableseo
2
“philadelphia phillies”
3
“philadelphia phillies”
4
“Relevance” vs “Ranking”
Conceptually “relevance” determination and “ranking” can be thought of a two different steps (even if they are implemented as one in a search engine)
5
“Relevance” vs “Ranking”
Conceptually “relevance” determination and “ranking” can be thought of a two different steps (even if they are implemented as one in a search engine)
Relevance
6
“Relevance” vs “Ranking”
Conceptually “relevance” determination and “ranking” can be thought of a two different steps (even if they are implemented as one in a search engine)
Relevance
Ranking
1
2
7
Is this page relevant to “philadelphia phillies”?
8
Is this page relevant to “philadelphia phillies”?
query-body similarity: 0.74
9
Is this page relevant to “philadelphia phillies”?
query-body similarity: 0.74
query-title similarity: 0.8
query-H1 similarity: 1.0
etc …
10
Measuring query-document similarity
Goal: given query + document string, compute “similarity”
11
Measuring query-document similarity
See “Introduction to Information Retrieval” by Manning et al:http://nlp.stanford.edu/IR-book/
> 700 papers
Goal: given query + document string, compute “similarity”
Measuring query-document similarity
“philadelphia phillies”
In this context “document” can also refer to title tag, meta description, H1, etc.
0.74
Measuring query-document similarity
“philadelphia phillies”
Query Model
tokenizationnormalization (stemming)query expansionintent
In this context “document” can also refer to title tag, meta description, H1, etc.
0.74
Measuring query-document similarity
“philadelphia phillies”
Query Model
tokenizationnormalization (stemming)query expansionintent
Document Model
tokenizationnormalization (stemming)vector space representationlanguage model
In this context “document” can also refer to title tag, meta description, H1, etc.
0.74
Measuring query-document similarity
“philadelphia phillies”
Query Model
tokenizationnormalization (stemming)query expansionintent
Document Model
tokenizationnormalization (stemming)vector space representationlanguage model
In this context “document” can also refer to title tag, meta description, H1, etc.
Scoring function
0.74
Query representation
Language identification
Word segmentation(Japanese, Chinese)
Tokenization + normalization{reviews, reviewer, reviewing} -> review
Spelling correction
Query representation
Language identification
Word segmentation(Japanese, Chinese)
Tokenization + normalization{reviews, reviewer, reviewing} -> review
Query expansion
User intent (transactional, navigational, informational) Local
Classification(images, video, news)
Spelling correction
Query representation
Language identification
Word segmentation(Japanese, Chinese)
Tokenization + normalization{reviews, reviewer, reviewing} -> review
Query expansion
User intent (transactional, navigational, informational) Local
Classification(images, video, news)
Topic Model (LDA)
Entity extraction
Spelling correction
Document representation
TF-IDF
Document representation
TF-IDF Language ModelP(optimization | search, engine) >>P(walking | search, engine)
Document representation
Probability Ranking Principle P(R = 1 | d, q) or P(R = 0 | d, q)
TF-IDF Language ModelP(optimization | search, engine) >>P(walking | search, engine)
Which method performs best?
What are the characteristics of sites that rank highly?
14,000+ keywordsTop 50 results600,000 URLsGoogle-US, no personalizationMarch 2013
Mean Spearman Correlation
Remember: “correlation is not causation”
Which method performs best?
We tried a few different types of smoothing for the language model, Dirichlet worked best (Zhai and Lafferty SIGIR 2001)
Impact of stemming
Porter stemmer provided a slight increase in correlations
These correlations are still relatively low compared to other factors
50 results
450 random pages
movie reviews
50 results
450 random pages
movie reviews For each query:500 pages10% relevant90% irrelevant
50 results
450 random pages
movie reviews For each query:500 pages10% relevant90% irrelevant
URL ID PA In SERP?
86 92 1
355 90 0
… … …
27 18 0
URL ID LanguageModel
In SERP?
213 0.97 1
156 0.95 1
… … …
355 0.06 0
50 results
450 random pages
movie reviews For each query:500 pages10% relevant90% irrelevant
URL ID PA In SERP?
86 92 1
355 90 0
… … …
27 18 0
URL ID LanguageModel
In SERP?
213 0.97 1
156 0.95 1
… … …
355 0.06 0
P@50 is the “Precision of the top 50 results”. It is the percentage of top 50 results by PA/Language Model that are actually in the SERP.
Top 50 ranked
50 results
450 random pages
movie reviews For each query:500 pages10% relevant90% irrelevant
URL ID PA In SERP?
86 92 1
355 90 0
… … …
27 18 0
URL ID LanguageModel
In SERP?
213 0.97 1
156 0.95 1
… … …
355 0.06 0
P@50 is the “Precision of the top 50 results”. It is the percentage of top 50 results by PA/Language Model that are actually in the SERP.
Top 50 ranked
Takeaways
Implication: Query-document similarity is based on decades of research. It’s immune to algorithm change.
Takeaways
Implication: Query-document similarity is based on decades of research. It’s immune to algorithm change.
Action item: With sophisticated query and document models, no need to optimize separately for similar words, e.g. “movie reviews” vs “movie review”.
Takeaways
Implication: Query-document similarity is based on decades of research. It’s immune to algorithm change.
Action item: With sophisticated query and document models, no need to optimize separately for similar words, e.g. “movie reviews” vs “movie review”.
Action item: Each page is relevant to many different keywords, so optimize each page for a broad set of related keywords, instead of a single keyword.
Takeaways
Implication: Query-document similarity is based on decades of research. It’s immune to algorithm change.
Action item: With sophisticated query and document models, no need to optimize separately for similar words, e.g. “movie reviews” vs “movie review”.
Action item: Each page is relevant to many different keywords, so optimize each page for a broad set of related keywords, instead of a single keyword.
Use case: Content creation. What keywords will this new blog post target? Is it relevant to a set of queries?