Peters matthew periodictableseo

1

Modern On Page FactorsSMX Advanced

Matthew Peters, [email protected] @mattthemathman

2

“philadelphia phillies”

3


4

“Relevance” vs “Ranking”

Conceptually “relevance” determination and “ranking” can be thought of a two different steps (even if they are implemented as one in a search engine)

5



Relevance

6



Relevance

Ranking

1

2

7

Is this page relevant to “philadelphia phillies”?

8


query-body similarity: 0.74

9


query-body similarity: 0.74

query-title similarity: 0.8

query-H1 similarity: 1.0

etc …

10

Measuring query-document similarity

Goal: given query + document string, compute “similarity”

11


See “Introduction to Information Retrieval” by Manning et al:http://nlp.stanford.edu/IR-book/

> 700 papers

Goal: given query + document string, compute “similarity”



In this context “document” can also refer to title tag, meta description, H1, etc.

0.74



Query Model

tokenizationnormalization (stemming)query expansionintent


0.74



Query Model


Document Model

tokenizationnormalization (stemming)vector space representationlanguage model


0.74



Query Model


Document Model

tokenizationnormalization (stemming)vector space representationlanguage model


Scoring function

0.74

Query representation

Language identification

Word segmentation(Japanese, Chinese)

Tokenization + normalization{reviews, reviewer, reviewing} -> review

Spelling correction





Query expansion

User intent (transactional, navigational, informational) Local

Classification(images, video, news)

Spelling correction





Query expansion

User intent (transactional, navigational, informational) Local

Classification(images, video, news)

Topic Model (LDA)

Entity extraction

Spelling correction

Document representation

TF-IDF


TF-IDF Language ModelP(optimization | search, engine) >>P(walking | search, engine)


Probability Ranking Principle P(R = 1 | d, q) or P(R = 0 | d, q)

TF-IDF Language ModelP(optimization | search, engine) >>P(walking | search, engine)

Which method performs best?

What are the characteristics of sites that rank highly?

14,000+ keywordsTop 50 results600,000 URLsGoogle-US, no personalizationMarch 2013

Mean Spearman Correlation

Remember: “correlation is not causation”

Which method performs best?

We tried a few different types of smoothing for the language model, Dirichlet worked best (Zhai and Lafferty SIGIR 2001)

Impact of stemming

Porter stemmer provided a slight increase in correlations

These correlations are still relatively low compared to other factors

50 results

450 random pages

movie reviews

50 results

450 random pages

movie reviews For each query:500 pages10% relevant90% irrelevant

50 results

450 random pages


URL ID PA In SERP?

86 92 1

355 90 0

… … …

27 18 0

URL ID LanguageModel

In SERP?

213 0.97 1

156 0.95 1

… … …

355 0.06 0

50 results

450 random pages


URL ID PA In SERP?

86 92 1

355 90 0

… … …

27 18 0

URL ID LanguageModel

In SERP?

213 0.97 1

156 0.95 1

… … …

355 0.06 0

P@50 is the “Precision of the top 50 results”. It is the percentage of top 50 results by PA/Language Model that are actually in the SERP.

Top 50 ranked

Takeaways

Implication: Query-document similarity is based on decades of research. It’s immune to algorithm change.

Takeaways


Action item: With sophisticated query and document models, no need to optimize separately for similar words, e.g. “movie reviews” vs “movie review”.

Takeaways



Action item: Each page is relevant to many different keywords, so optimize each page for a broad set of related keywords, instead of a single keyword.

Takeaways



Action item: Each page is relevant to many different keywords, so optimize each page for a broad set of related keywords, instead of a single keyword.

Use case: Content creation. What keywords will this new blog post target? Is it relevant to a set of queries?

35

Thanks for watching!Matthew Peters

[email protected] @mattthemathman

Peters matthew periodictableseo

Technology

Transcript of Peters matthew periodictableseo