Peters matthew periodictableseo

Post on 25-Jan-2015

11.379 views 0 download

description

 

Transcript of Peters matthew periodictableseo

1

Modern On Page FactorsSMX Advanced

Matthew Peters, PhDmatt@moz.com @mattthemathman

2

“philadelphia phillies”

3

“philadelphia phillies”

4

“Relevance” vs “Ranking”

Conceptually “relevance” determination and “ranking” can be thought of a two different steps (even if they are implemented as one in a search engine)

5

“Relevance” vs “Ranking”

Conceptually “relevance” determination and “ranking” can be thought of a two different steps (even if they are implemented as one in a search engine)

Relevance

6

“Relevance” vs “Ranking”

Conceptually “relevance” determination and “ranking” can be thought of a two different steps (even if they are implemented as one in a search engine)

Relevance

Ranking

1

2

7

Is this page relevant to “philadelphia phillies”?

8

Is this page relevant to “philadelphia phillies”?

query-body similarity: 0.74

9

Is this page relevant to “philadelphia phillies”?

query-body similarity: 0.74

query-title similarity: 0.8

query-H1 similarity: 1.0

etc …

10

Measuring query-document similarity

Goal: given query + document string, compute “similarity”

11

Measuring query-document similarity

See “Introduction to Information Retrieval” by Manning et al:http://nlp.stanford.edu/IR-book/

> 700 papers

Goal: given query + document string, compute “similarity”

Measuring query-document similarity

“philadelphia phillies”

In this context “document” can also refer to title tag, meta description, H1, etc.

0.74

Measuring query-document similarity

“philadelphia phillies”

Query Model

tokenizationnormalization (stemming)query expansionintent

In this context “document” can also refer to title tag, meta description, H1, etc.

0.74

Measuring query-document similarity

“philadelphia phillies”

Query Model

tokenizationnormalization (stemming)query expansionintent

Document Model

tokenizationnormalization (stemming)vector space representationlanguage model

In this context “document” can also refer to title tag, meta description, H1, etc.

0.74

Measuring query-document similarity

“philadelphia phillies”

Query Model

tokenizationnormalization (stemming)query expansionintent

Document Model

tokenizationnormalization (stemming)vector space representationlanguage model

In this context “document” can also refer to title tag, meta description, H1, etc.

Scoring function

0.74

Query representation

Language identification

Word segmentation(Japanese, Chinese)

Tokenization + normalization{reviews, reviewer, reviewing} -> review

Spelling correction

Query representation

Language identification

Word segmentation(Japanese, Chinese)

Tokenization + normalization{reviews, reviewer, reviewing} -> review

Query expansion

User intent (transactional, navigational, informational) Local

Classification(images, video, news)

Spelling correction

Query representation

Language identification

Word segmentation(Japanese, Chinese)

Tokenization + normalization{reviews, reviewer, reviewing} -> review

Query expansion

User intent (transactional, navigational, informational) Local

Classification(images, video, news)

Topic Model (LDA)

Entity extraction

Spelling correction

Document representation

TF-IDF

Document representation

TF-IDF Language ModelP(optimization | search, engine) >>P(walking | search, engine)

Document representation

Probability Ranking Principle P(R = 1 | d, q) or P(R = 0 | d, q)

TF-IDF Language ModelP(optimization | search, engine) >>P(walking | search, engine)

Which method performs best?

What are the characteristics of sites that rank highly?

14,000+ keywordsTop 50 results600,000 URLsGoogle-US, no personalizationMarch 2013

Mean Spearman Correlation

Remember: “correlation is not causation”

Which method performs best?

We tried a few different types of smoothing for the language model, Dirichlet worked best (Zhai and Lafferty SIGIR 2001)

Impact of stemming

Porter stemmer provided a slight increase in correlations

These correlations are still relatively low compared to other factors

50 results

450 random pages

movie reviews

50 results

450 random pages

movie reviews For each query:500 pages10% relevant90% irrelevant

50 results

450 random pages

movie reviews For each query:500 pages10% relevant90% irrelevant

URL ID PA In SERP?

86 92 1

355 90 0

… … …

27 18 0

URL ID LanguageModel

In SERP?

213 0.97 1

156 0.95 1

… … …

355 0.06 0

50 results

450 random pages

movie reviews For each query:500 pages10% relevant90% irrelevant

URL ID PA In SERP?

86 92 1

355 90 0

… … …

27 18 0

URL ID LanguageModel

In SERP?

213 0.97 1

156 0.95 1

… … …

355 0.06 0

P@50 is the “Precision of the top 50 results”. It is the percentage of top 50 results by PA/Language Model that are actually in the SERP.

Top 50 ranked

50 results

450 random pages

movie reviews For each query:500 pages10% relevant90% irrelevant

URL ID PA In SERP?

86 92 1

355 90 0

… … …

27 18 0

URL ID LanguageModel

In SERP?

213 0.97 1

156 0.95 1

… … …

355 0.06 0

P@50 is the “Precision of the top 50 results”. It is the percentage of top 50 results by PA/Language Model that are actually in the SERP.

Top 50 ranked

Takeaways

Implication: Query-document similarity is based on decades of research. It’s immune to algorithm change.

Takeaways

Implication: Query-document similarity is based on decades of research. It’s immune to algorithm change.

Action item: With sophisticated query and document models, no need to optimize separately for similar words, e.g. “movie reviews” vs “movie review”.

Takeaways

Implication: Query-document similarity is based on decades of research. It’s immune to algorithm change.

Action item: With sophisticated query and document models, no need to optimize separately for similar words, e.g. “movie reviews” vs “movie review”.

Action item: Each page is relevant to many different keywords, so optimize each page for a broad set of related keywords, instead of a single keyword.

Takeaways

Implication: Query-document similarity is based on decades of research. It’s immune to algorithm change.

Action item: With sophisticated query and document models, no need to optimize separately for similar words, e.g. “movie reviews” vs “movie review”.

Action item: Each page is relevant to many different keywords, so optimize each page for a broad set of related keywords, instead of a single keyword.

Use case: Content creation. What keywords will this new blog post target? Is it relevant to a set of queries?

35

Thanks for watching!Matthew Peters

matt@moz.com @mattthemathman