Information Retrieval Models

Information Retrieval Models

PengBoOct 30, 2010

Basic Index TechniquesInverted indexDictionary & PostingsScoring and RankingTerm weightingtfidfVector Space ModelCosine SimilarityIR evaluationPrecision, Recall, FInterpolationMAP, interpolated AP

Information Retrieval ModelsVector Space Model (VSM)Latent Semantic Model (LSI)Language Model (LM)

Relevance FeedbackQuery Expansion

Vector Space Model

Documents as vectors j term log-scaled tf.idfSo we have a vector spaceterms are axesdocs live in this spacestemming, may have 20,000+ dimensions

D1D2D3D4D5D64.10.03.75.93.10.04.54.50011.6003.52.902.13.903.15.112.8002.9002.2007.10004.43.8

IntuitionPostulate: vector spaceclose together talk about the same things.Query-by-exampleFree Text query as vector

Cosine similarityd1d2 closenesscosine of the angle x.Normalization

#1.COS Similarity digital cameras digital cameras and video cameras . N = 10,000,000, querydocumentlogarithmic term weighting (wf columns), queryidf weighting documentcosine normalization. andstop word.

#2. Evaluationprecision-recall graph precision/recall. breakeven point precision recall. breakeven point

Latent Semantic Model

Vector Space Model: ProsAutomatic selection of index termsPartial matching of queries and documents (dealing with the case where no document contains all search terms)Ranking according to similarity score (dealing with large result sets)Term weighting schemes (improves retrieval performance)Various extensionsDocument clusteringRelevance feedback (modifying query vector)Geometric foundation

Problems with Lexical SemanticsPolysemy: multitude of meanings Vector Space Modelambiguity.

Synonymy: termsidentical or a similar meaning. Vector Space Modelassociations.

Issues in the VSMtermstermsetc.termsterm-document

/?

Singular Value Decompositionterm-document Singular Value Decompositionr, rank, singular valuesD, T, (TT=I, DD=I)WWTWTWWWT

Singular Values gives an ordering to the dimensionssingular values at "noise"low-value dimensions noise

Low-rank Approximationt dt rwtd=Tr rDTr dt dt kw'td=k kk dTDT

Latent Semantic Indexing (LSI)Perform a low-rank approximation of term-document matrix (typical rank 100-300)General ideaMap documents (and terms) to a low-dimensional representation.Design a mapping such that the low-dimensional space reflects semantic associations (latent semantic space).Compute document similarity based on the inner product in this latent semantic space

What it isterm-documentAr, Ak.Ak termdocument k

LSI Term matrix TT matrixtermLSI spacematrix: termsd-dimensionalTDimensionstermssynonyms, contextually-related words, variant endings(T) term

Document matrix DD matrixLSI spaceT vectorsdimensionality(DT) documentsimilarity

Retrieval with LSILSI/LSIDTfolded in W=TDTqDTq q = TqTq= (-1T-1q)T = qT-1Folded in document/query vector T-1

DTdot-product

Improved Retrieval with LSInoisestem terms (variants will co-occur)stop list, though

C=

Tr=

r=

DrT=

2=

2 D2T=

ExampleMap into 2-dimenstion space

Latent Semantic AnalysisLatent semantic space: illustrating examplecourtesy of Susan Dumais

Empirical evidenceExperiments on TREC 1/2/3 DumaisPrecision at or above median TREC precisionTop scorer on almost 20% of TREC topicsSlightly better on average than straight vector spacesEffect of dimensionality:

DimensionsPrecision2500.3673000.3713460.374

LSI has many other applications,feature-object matrix.low-rank approximation.the termsfeaturesthe docsobjects. Latent Semantic Indexopinionsusers (e.g., users opinions), .Powerful general analytical technique

Language Models

IR based on Language Model (LM)queryInformation needdocument collectionsearchqueryThe LM approach directly exploits that idea!

Formal Language (Model) generative model: stringsFinite state machines or regular grammars, etc.Example: I wishI wish I wishI wish I wish I wishI wish I wish I wish I wish(I wish) *

Stochastic Language ModelsModels probability of generating strings in the language (commonly all strings over alphabet ) 0.2the0.1a0.01man0.01woman0.03said0.02likesthemanlikesthewoman0.20.010.020.20.01Model MP(s | M) = 0.00000008

Stochastic Language ModelsModel probability of generating any string0.2the0.01class0.0001sayst0.0001pleaseth0.0001yon0.0005maiden0.01womanModel M1Model M2P(s|M2) > P(s|M1)0.2the0.0001class0.03sayst0.02pleaseth0.1yon0.01maiden0.0001woman

Stochastic Language ModelsProbability distribution over strings in a given languageM

Unigram and higher-order models

Unigram Language Models

Bigram (generally, n-gram) Language Models

Other Language ModelsGrammar-based models (PCFGs), etc.Probably not the first thing to try in IREasy.Effective!

The fundamental problem of LMs M

ModelM

Using Language Models in IRmodelP(d | q)P(d | q) = P(q | d) x P(d) / P(q)P(q) is the same for all documents, so ignoreP(d) [the prior] is often treated as the same for all dBut we could use criteria like authority, length, genreP(q | d) is the probability of q given ds modelVery general formal approach

Language Models for IRLanguage Modeling Approachesquery generation process querythe probability that a query would be observed as a random sample from the respective document modelMultinomial approach

Retrieval based on probabilistic LMqueryInfer a language model.Estimate the probabilityqueryRank .Unigram model

Query generation probability (1)

:

Unigram assumption:Given a particular language model, the query terms occur independently

Insufficient dataZero probabilityquerytermGeneral approachtermcollection.If ,

: raw count of term t in the collection : raw collection size(total number of tokens in the collection)

Insufficient dataZero probabilities spell disastersmooth probabilitiesDiscount nonzero probabilitiesGive some probability mass to unseen thingsadding 1, or to counts, Dirichlet priors, discounting, and interpolation[See FSNLP ch. 6 if you want more]use a mixture between the document multinomial and the collection multinomial distribution

Mixture modelP(w|d) = Pmle(w|Md) + (1 )Pmle(w|Mc) conjunctive-like (cf. Dirichlet prior or Witten-Bell smoothing)

Basic mixture model summaryGeneral formulation of the LM for IR

general language modelindividual-document model

ExampleDocument collection (2 documents)d1: Xerox reports a profit but revenue is downd2: Lucent narrows quarter loss but revenue decreases furtherModel: MLE unigram from documents; = Query: revenue downP(Q|d1) = [(1/8 + 2/16)/2] x [(1/8 + 1/16)/2]= 1/8 x 3/32 = 3/256P(Q|d2) = [(1/8 + 2/16)/2] x [(0 + 1/16)/2] = 1/8 x 1/32 = 1/256Ranking: d1 > d2

Alternative Models of Text GenerationQuery ModelQueryDoc ModelDocSearcherWriterIs this the same model?

Retrieval Using Language ModelsQuery ModelQueryDoc ModelDoc123Query likelihood (1)Document likelihood (2),Model comparison (3)

Query LikelihoodP(Q|Dm)modeli.e. smoothing techniques instead of tf.idf weightse.g. UMass, BBN, Twente, CMU relevance feedback, query expansion, structured queries

Document LikelihoodP(D|R)/P(D|NR)P(w|R) is estimated by P(w|Qm)Qm is the query or relevance modelP(w|NR) is estimated by collection probabilities P(w)relevance modelTreat query as generated by mixture of topic and backgroundEstimate relevance model from related documents (query expansion)Relevance feedback is easily incorporatedGood retrieval results e.g. UMass at SIGIR 01inconsistent with heterogeneous document collections

Model ComparisonquerydocumentKL divergence D(Qm||Dm)

Language models: pro & conNovel way of looking at the problem of text retrieval based on probabilistic language modelingConceptually simple and explanatoryFormal mathematical modelNatural use of collection statistics, not heuristics (almost)LMs provide effective retrieval and can be improved to the extent that the following conditions can be metOur language models are accurate representations of the data.Users have some sense of term distribution.

Comparison With Vector Spacetf.idf models:(unscaled) term frequency is directly in modelthe probabilities do length normalization of term frequenciesthe effect of doing a mixture with overall collection frequencies is a little like idf: terms rare in the general collection but common in some documents will have a greater influence on the ranking

Comparison With Vector SpaceTerm weights based on frequencyTerms often used as if they were independentInverse document/collection frequency usedSome form of length normalization usedBased on probability rather than similarityIntuitions are probabilistic rather than geometricDetails of use of document length and term, document, and collection frequency differ

Latent Semantic Indexingsingular value decompositionMatrix Low-rank ApproximationLanguageModelGenerative modelsmooth probabilitiesMixture model

ResourcesThe Template Numerical Toolkit (TNT) http://math.nist.gov/tnt/documentation.htmlThe Lemur Toolkit for Language Modeling and Information Retrieval. http://www-2.cs.cmu.edu/~lemur/ CMU/Umass LM and IR system in C(++), currently actively developed.

Thank You!Q&A

[1] IIR Ch12, Ch18[2] M. Alistair, Z. Justin, and H. David, "Recommended reading for IR research students" SIGIR Forum, vol. 39, pp. 3-14, 2005.

#2 EvaluationQuestion abreakeven pointIRbreakeven pointARaprecision=|Ra|/|A|recall=|Ra|/|R|breakeven pointprecision=recall|R|=|A|k (k>0)breakeven pointprecision=|Ra|/|A|recall=|Ra|/|R||A|=|R||A|=|A|+kk>0|A|=|R|breakeven pointbreakevenpointbreakeven point

Matrix Low-rank Approximation for LSI

Eigenvalues & EigenvectorsEigenvectors (for a square mm matrix S)

How many eigenvalues are there at most?eigenvalue(right) eigenvector

Matrix-vector multiplicationhas eigenvalues 3, 2, 0 withcorresponding eigenvectorsAny vector (say x= ) can be viewed as a combination ofthe eigenvectors: x = 2v1 + 4v2 + 6v3

Matrix vector multiplicationThus a matrix-vector multiplication such as Sx (S, x as in the previous slide) can be rewritten in terms of the eigenvalues/vectors:

Even though x is an arbitrary vector, the action of S on x is determined by the eigenvalues/vectors.Suggestion: the effect of small eigenvalues is small.

Eigenvalues & Eigenvectors

ExampleLet

Then

The eigenvalues are 1 and 3 (nonnegative, real). The eigenvectors are orthogonal (and real):Real, symmetric.Plug in these values and solve for eigenvectors.

Let be a square matrix with m linearly independent eigenvectors Theorem: Exists an eigen decomposition

Columns of U are eigenvectors of SDiagonal elements of are eigenvalues of

Eigen/diagonal Decomposition

Diagonal decomposition: why/howAnd S=UU1.Thus SU=U, or U1SU=

Diagonal decomposition - exampleRecall The eigenvectors and form Inverting, we haveThen, S=UU1 =RecallUU1 =1.

Example continuedLets divide U (and multiply U1) by Then, S=Q(Q-1= QT )Why? Stay tuned

If is a symmetric matrix:Theorem: Exists a (unique) eigen decompositionwhere Q is orthogonal:Q-1= QTColumns of Q are normalized eigenvectorsColumns are orthogonal.(everything is real)

Symmetric Eigen Decomposition

Time out!What do these matrices have to do with text?Recall m n term-document matrices But everything so far needs square matrices so

Singular Value DecompositionFor an m n matrix A of rank r there exists a factorization(Singular Value Decomposition = SVD) as follows:The columns of U are orthogonal eigenvectors of AAT.The columns of V are orthogonal eigenvectors of ATA.

Singular Value DecompositionIllustration of SVD dimensions and sparseness

SVD exampleLetTypically, the singular values arranged in decreasing order.

SVD can be used to compute optimal low-rank approximations.Approximation problem: Find Ak of rank k such that

Ak and X are both mn matrices.Typically, want k

Solution via SVDLow-rank Approximationset smallest r-ksingular values to zero

Approximation errorHow good (bad) is this approximation?Its the best possible, measured by the Frobenius norm of the error:

where the i are ordered such that i i+1.Suggests why Frobenius error drops as k increased.

SVD Low-rank approximationWhereas the term-doc matrix A may have m=50000, n=10 million (and rank close to 50000)We can construct an approximation A100 with rank 100.Of all rank 100 matrices, it would have the lowest Frobenius error.Great but why would we??Answer: Latent Semantic IndexingC. Eckart, G. Young, The approximation of a matrix by another of lower rank. Psychometrika, 1, 211-218, 1936.

Performing the mapsEach row and column of A gets mapped into the k-dimensional LSI space, by the SVD.A query q is also mapped into this space, by

*******Still Bag of words modelHere log-scaled tf.idf

This values are not calculated on true dataDf 24,100 1860 19600 1270*The vector space model*A vector can be normalized (given a length of 1) by dividing each of its components by its length here we use the L2 norm

This maps vectors onto the unit sphere:

***Some contents from www.cs.umbc.edu/~nicholas/676/12LSI.ppt IIR Ch18

Latent : to lie hidden

Indexing by Latent Semantic Analysis (1990) by Scott Deerwester , Susan T. Dumais , George W. Furnas , Thomas K. Landauer , Richard Harshman Journal of the American Society for Information Science **www.seoptimiser.com/seoptimiser-link-building.php f*Ambiguity and association in natural language

*Goals of LSISimilar terms map to similar location in low dimensional spaceNoise reduction by dimension reduction*T maps terms into LSI space DT maps documents into LSI space

**minimize the Frobenius norm ofthe matrix difference X = W Wk

IIR 18.15*Similar terms map to similar location in low dimensional spaceNoise reduction by dimension reduction*Latent Semantic Analysis via SVD****stop words are used uniformly throughout collection, so they tend to appear in the first dimension

QUESTION:>>T-1>>S=T*Sigma*D,SDT-1 D

ANSWER:you are right,

qnT*sigma*qn = q qn = inv(T*sigma)*qinv(T*sigma) = inv(sigma)*inv(T) = inv(sigma)* T'sigmasigma'= sigma; (inv(sigma))' = inv(sigma)svdT'=inv(T), D'= inv(D) (you can check it out, i didn't searchproof yet) qt*T*inv(sigma) (inv(sigma* T' * q)'q*T*inv(sigma)qqt1 x M term, qt *inv()*Try this in matlab:

>> C=[1 0 1 0 0 0; 0 1 0 0 0 0 ; 1 1 0 0 0 0 ; 1 0 0 1 1 0; 0 0 0 1 0 1];>> [t,s,d] = svd(C );

Got s is 5X6 matrix, d 6X6 matrixbut rank is 5, the last column of s is all zero, just delete it.And delete last row of Dt

now set low rank Approximation , k= 2>>s(3,3)=0;s(4,4)=0;s(5,5)=0;

Get new dt>>s*d

*****Close, principled analog to clustering methods.*IIR Ch12*************FSNLP: foundation of statistic Natural language processing**The user has a document in mind, and generates the query from this document.The equation represents the probability that the document that the user had in mind was in fact this one.

*****Relevance model other than query model,Querysmall sample, relevant documentssample

*equivalent to query-likelihood approach if simple empirical distribution used for query model

More general risk minimization framework has been proposedZhai and Lafferty 2001

****************nA nB ABBAEA BA

************

Information Retrieval Models

Documents

Transcript of Information Retrieval Models