Peters matthew periodictableseo

35
Modern On Page Factors 1 SMX Advanced Matthew Peters, PhD [email protected] @mattthemathman

description

 

Transcript of Peters matthew periodictableseo

Page 1: Peters matthew periodictableseo

1

Modern On Page FactorsSMX Advanced

Matthew Peters, [email protected] @mattthemathman

Page 2: Peters matthew periodictableseo

2

“philadelphia phillies”

Page 3: Peters matthew periodictableseo

3

“philadelphia phillies”

Page 4: Peters matthew periodictableseo

4

“Relevance” vs “Ranking”

Conceptually “relevance” determination and “ranking” can be thought of a two different steps (even if they are implemented as one in a search engine)

Page 5: Peters matthew periodictableseo

5

“Relevance” vs “Ranking”

Conceptually “relevance” determination and “ranking” can be thought of a two different steps (even if they are implemented as one in a search engine)

Relevance

Page 6: Peters matthew periodictableseo

6

“Relevance” vs “Ranking”

Conceptually “relevance” determination and “ranking” can be thought of a two different steps (even if they are implemented as one in a search engine)

Relevance

Ranking

1

2

Page 7: Peters matthew periodictableseo

7

Is this page relevant to “philadelphia phillies”?

Page 8: Peters matthew periodictableseo

8

Is this page relevant to “philadelphia phillies”?

query-body similarity: 0.74

Page 9: Peters matthew periodictableseo

9

Is this page relevant to “philadelphia phillies”?

query-body similarity: 0.74

query-title similarity: 0.8

query-H1 similarity: 1.0

etc …

Page 10: Peters matthew periodictableseo

10

Measuring query-document similarity

Goal: given query + document string, compute “similarity”

Page 11: Peters matthew periodictableseo

11

Measuring query-document similarity

See “Introduction to Information Retrieval” by Manning et al:http://nlp.stanford.edu/IR-book/

> 700 papers

Goal: given query + document string, compute “similarity”

Page 12: Peters matthew periodictableseo

Measuring query-document similarity

“philadelphia phillies”

In this context “document” can also refer to title tag, meta description, H1, etc.

0.74

Page 13: Peters matthew periodictableseo

Measuring query-document similarity

“philadelphia phillies”

Query Model

tokenizationnormalization (stemming)query expansionintent

In this context “document” can also refer to title tag, meta description, H1, etc.

0.74

Page 14: Peters matthew periodictableseo

Measuring query-document similarity

“philadelphia phillies”

Query Model

tokenizationnormalization (stemming)query expansionintent

Document Model

tokenizationnormalization (stemming)vector space representationlanguage model

In this context “document” can also refer to title tag, meta description, H1, etc.

0.74

Page 15: Peters matthew periodictableseo

Measuring query-document similarity

“philadelphia phillies”

Query Model

tokenizationnormalization (stemming)query expansionintent

Document Model

tokenizationnormalization (stemming)vector space representationlanguage model

In this context “document” can also refer to title tag, meta description, H1, etc.

Scoring function

0.74

Page 16: Peters matthew periodictableseo

Query representation

Language identification

Word segmentation(Japanese, Chinese)

Tokenization + normalization{reviews, reviewer, reviewing} -> review

Spelling correction

Page 17: Peters matthew periodictableseo

Query representation

Language identification

Word segmentation(Japanese, Chinese)

Tokenization + normalization{reviews, reviewer, reviewing} -> review

Query expansion

User intent (transactional, navigational, informational) Local

Classification(images, video, news)

Spelling correction

Page 18: Peters matthew periodictableseo

Query representation

Language identification

Word segmentation(Japanese, Chinese)

Tokenization + normalization{reviews, reviewer, reviewing} -> review

Query expansion

User intent (transactional, navigational, informational) Local

Classification(images, video, news)

Topic Model (LDA)

Entity extraction

Spelling correction

Page 19: Peters matthew periodictableseo

Document representation

TF-IDF

Page 20: Peters matthew periodictableseo

Document representation

TF-IDF Language ModelP(optimization | search, engine) >>P(walking | search, engine)

Page 21: Peters matthew periodictableseo

Document representation

Probability Ranking Principle P(R = 1 | d, q) or P(R = 0 | d, q)

TF-IDF Language ModelP(optimization | search, engine) >>P(walking | search, engine)

Page 22: Peters matthew periodictableseo

Which method performs best?

What are the characteristics of sites that rank highly?

14,000+ keywordsTop 50 results600,000 URLsGoogle-US, no personalizationMarch 2013

Mean Spearman Correlation

Remember: “correlation is not causation”

Page 23: Peters matthew periodictableseo

Which method performs best?

We tried a few different types of smoothing for the language model, Dirichlet worked best (Zhai and Lafferty SIGIR 2001)

Page 24: Peters matthew periodictableseo

Impact of stemming

Porter stemmer provided a slight increase in correlations

Page 25: Peters matthew periodictableseo

These correlations are still relatively low compared to other factors

Page 26: Peters matthew periodictableseo

50 results

450 random pages

movie reviews

Page 27: Peters matthew periodictableseo

50 results

450 random pages

movie reviews For each query:500 pages10% relevant90% irrelevant

Page 28: Peters matthew periodictableseo

50 results

450 random pages

movie reviews For each query:500 pages10% relevant90% irrelevant

URL ID PA In SERP?

86 92 1

355 90 0

… … …

27 18 0

URL ID LanguageModel

In SERP?

213 0.97 1

156 0.95 1

… … …

355 0.06 0

Page 29: Peters matthew periodictableseo

50 results

450 random pages

movie reviews For each query:500 pages10% relevant90% irrelevant

URL ID PA In SERP?

86 92 1

355 90 0

… … …

27 18 0

URL ID LanguageModel

In SERP?

213 0.97 1

156 0.95 1

… … …

355 0.06 0

P@50 is the “Precision of the top 50 results”. It is the percentage of top 50 results by PA/Language Model that are actually in the SERP.

Top 50 ranked

Page 30: Peters matthew periodictableseo

50 results

450 random pages

movie reviews For each query:500 pages10% relevant90% irrelevant

URL ID PA In SERP?

86 92 1

355 90 0

… … …

27 18 0

URL ID LanguageModel

In SERP?

213 0.97 1

156 0.95 1

… … …

355 0.06 0

P@50 is the “Precision of the top 50 results”. It is the percentage of top 50 results by PA/Language Model that are actually in the SERP.

Top 50 ranked

Page 31: Peters matthew periodictableseo

Takeaways

Implication: Query-document similarity is based on decades of research. It’s immune to algorithm change.

Page 32: Peters matthew periodictableseo

Takeaways

Implication: Query-document similarity is based on decades of research. It’s immune to algorithm change.

Action item: With sophisticated query and document models, no need to optimize separately for similar words, e.g. “movie reviews” vs “movie review”.

Page 33: Peters matthew periodictableseo

Takeaways

Implication: Query-document similarity is based on decades of research. It’s immune to algorithm change.

Action item: With sophisticated query and document models, no need to optimize separately for similar words, e.g. “movie reviews” vs “movie review”.

Action item: Each page is relevant to many different keywords, so optimize each page for a broad set of related keywords, instead of a single keyword.

Page 34: Peters matthew periodictableseo

Takeaways

Implication: Query-document similarity is based on decades of research. It’s immune to algorithm change.

Action item: With sophisticated query and document models, no need to optimize separately for similar words, e.g. “movie reviews” vs “movie review”.

Action item: Each page is relevant to many different keywords, so optimize each page for a broad set of related keywords, instead of a single keyword.

Use case: Content creation. What keywords will this new blog post target? Is it relevant to a set of queries?

Page 35: Peters matthew periodictableseo

35

Thanks for watching!Matthew Peters

[email protected] @mattthemathman