Information Retrieval Models

download Information Retrieval  Models

If you can't read please download the document

description

Information Retrieval Models. PengBo Oct 30, 2010. 上次课回顾. Basic Index Techniques Inverted index Dictionary & Postings Scoring and Ranking Term weighting tf·idf Vector Space Model Cosine Similarity IR evaluation Precision , Recall , F Interpolation MAP , interpolated AP. 本次课大纲. - PowerPoint PPT Presentation

Transcript of Information Retrieval Models

  • Information Retrieval Models

    PengBoOct 30, 2010

  • Basic Index TechniquesInverted indexDictionary & PostingsScoring and RankingTerm weightingtfidfVector Space ModelCosine SimilarityIR evaluationPrecision, Recall, FInterpolationMAP, interpolated AP

  • Information Retrieval ModelsVector Space Model (VSM)Latent Semantic Model (LSI)Language Model (LM)

  • Relevance FeedbackQuery Expansion

  • Vector Space Model

  • Documents as vectors j term log-scaled tf.idfSo we have a vector spaceterms are axesdocs live in this spacestemming, may have 20,000+ dimensions

    D1D2D3D4D5D64.10.03.75.93.10.04.54.50011.6003.52.902.13.903.15.112.8002.9002.2007.10004.43.8

  • IntuitionPostulate: vector spaceclose together talk about the same things.Query-by-exampleFree Text query as vector

  • Cosine similarityd1d2 closenesscosine of the angle x.Normalization

  • #1.COS Similarity digital cameras digital cameras and video cameras . N = 10,000,000, querydocumentlogarithmic term weighting (wf columns), queryidf weighting documentcosine normalization. andstop word.

  • #2. Evaluationprecision-recall graph precision/recall. breakeven point precision recall. breakeven point

  • Latent Semantic Model

  • Vector Space Model: ProsAutomatic selection of index termsPartial matching of queries and documents (dealing with the case where no document contains all search terms)Ranking according to similarity score (dealing with large result sets)Term weighting schemes (improves retrieval performance)Various extensionsDocument clusteringRelevance feedback (modifying query vector)Geometric foundation

  • Problems with Lexical SemanticsPolysemy: multitude of meanings Vector Space Modelambiguity.

    Synonymy: termsidentical or a similar meaning. Vector Space Modelassociations.

  • Issues in the VSMtermstermsetc.termsterm-document

    /?

  • Singular Value Decompositionterm-document Singular Value Decompositionr, rank, singular valuesD, T, (TT=I, DD=I)WWTWTWWWT

  • Singular Values gives an ordering to the dimensionssingular values at "noise"low-value dimensions noise

  • Low-rank Approximationt dt rwtd=Tr rDTr dt dt kw'td=k kk dTDT

  • Latent Semantic Indexing (LSI)Perform a low-rank approximation of term-document matrix (typical rank 100-300)General ideaMap documents (and terms) to a low-dimensional representation.Design a mapping such that the low-dimensional space reflects semantic associations (latent semantic space).Compute document similarity based on the inner product in this latent semantic space

  • What it isterm-documentAr, Ak.Ak termdocument k
  • LSI Term matrix TT matrixtermLSI spacematrix: termsd-dimensionalTDimensionstermssynonyms, contextually-related words, variant endings(T) term

  • Document matrix DD matrixLSI spaceT vectorsdimensionality(DT) documentsimilarity

  • Retrieval with LSILSI/LSIDTfolded in W=TDTqDTq q = TqTq= (-1T-1q)T = qT-1Folded in document/query vector T-1

    DTdot-product

  • Improved Retrieval with LSInoisestem terms (variants will co-occur)stop list, though

  • C=

    Tr=

    r=

  • DrT=

    2=

    2 D2T=

  • ExampleMap into 2-dimenstion space

  • Latent Semantic AnalysisLatent semantic space: illustrating examplecourtesy of Susan Dumais

  • Empirical evidenceExperiments on TREC 1/2/3 DumaisPrecision at or above median TREC precisionTop scorer on almost 20% of TREC topicsSlightly better on average than straight vector spacesEffect of dimensionality:

    DimensionsPrecision2500.3673000.3713460.374

  • LSI has many other applications,feature-object matrix.low-rank approximation.the termsfeaturesthe docsobjects. Latent Semantic Indexopinionsusers (e.g., users opinions), .Powerful general analytical technique

  • Language Models

  • IR based on Language Model (LM)queryInformation needdocument collectionsearchqueryThe LM approach directly exploits that idea!

  • Formal Language (Model) generative model: stringsFinite state machines or regular grammars, etc.Example: I wishI wish I wishI wish I wish I wishI wish I wish I wish I wish(I wish) *

  • Stochastic Language ModelsModels probability of generating strings in the language (commonly all strings over alphabet ) 0.2the0.1a0.01man0.01woman0.03said0.02likesthemanlikesthewoman0.20.010.020.20.01Model MP(s | M) = 0.00000008

  • Stochastic Language ModelsModel probability of generating any string0.2the0.01class0.0001sayst0.0001pleaseth0.0001yon0.0005maiden0.01womanModel M1Model M2P(s|M2) > P(s|M1)0.2the0.0001class0.03sayst0.02pleaseth0.1yon0.01maiden0.0001woman

  • Stochastic Language ModelsProbability distribution over strings in a given languageM

  • Unigram and higher-order models

    Unigram Language Models

    Bigram (generally, n-gram) Language Models

    Other Language ModelsGrammar-based models (PCFGs), etc.Probably not the first thing to try in IREasy.Effective!

  • The fundamental problem of LMs M

    ModelM

  • Using Language Models in IRmodelP(d | q)P(d | q) = P(q | d) x P(d) / P(q)P(q) is the same for all documents, so ignoreP(d) [the prior] is often treated as the same for all dBut we could use criteria like authority, length, genreP(q | d) is the probability of q given ds modelVery general formal approach

  • Language Models for IRLanguage Modeling Approachesquery generation process querythe probability that a query would be observed as a random sample from the respective document modelMultinomial approach

  • Retrieval based on probabilistic LMqueryInfer a language model.Estimate the probabilityqueryRank .Unigram model

  • Query generation probability (1)

    :

    Unigram assumption:Given a particular language model, the query terms occur independently

  • Insufficient dataZero probabilityquerytermGeneral approachtermcollection.If ,

    : raw count of term t in the collection : raw collection size(total number of tokens in the collection)

  • Insufficient dataZero probabilities spell disastersmooth probabilitiesDiscount nonzero probabilitiesGive some probability mass to unseen thingsadding 1, or to counts, Dirichlet priors, discounting, and interpolation[See FSNLP ch. 6 if you want more]use a mixture between the document multinomial and the collection multinomial distribution

  • Mixture modelP(w|d) = Pmle(w|Md) + (1 )Pmle(w|Mc) conjunctive-like (cf. Dirichlet prior or Witten-Bell smoothing)

  • Basic mixture model summaryGeneral formulation of the LM for IR

    general language modelindividual-document model

  • ExampleDocument collection (2 documents)d1: Xerox reports a profit but revenue is downd2: Lucent narrows quarter loss but revenue decreases furtherModel: MLE unigram from documents; = Query: revenue downP(Q|d1) = [(1/8 + 2/16)/2] x [(1/8 + 1/16)/2]= 1/8 x 3/32 = 3/256P(Q|d2) = [(1/8 + 2/16)/2] x [(0 + 1/16)/2] = 1/8 x 1/32 = 1/256Ranking: d1 > d2

  • Alternative Models of Text GenerationQuery ModelQueryDoc ModelDocSearcherWriterIs this the same model?

  • Retrieval Using Language ModelsQuery ModelQueryDoc ModelDoc123Query likelihood (1)Document likelihood (2),Model comparison (3)

  • Query LikelihoodP(Q|Dm)modeli.e. smoothing techniques instead of tf.idf weightse.g. UMass, BBN, Twente, CMU relevance feedback, query expansion, structured queries

  • Document LikelihoodP(D|R)/P(D|NR)P(w|R) is estimated by P(w|Qm)Qm is the query or relevance modelP(w|NR) is estimated by collection probabilities P(w)relevance modelTreat query as generated by mixture of topic and backgroundEstimate relevance model from related documents (query expansion)Relevance feedback is easily incorporatedGood retrieval results e.g. UMass at SIGIR 01inconsistent with heterogeneous document collections

  • Model ComparisonquerydocumentKL divergence D(Qm||Dm)

  • Language models: pro & conNovel way of looking at the problem of text retrieval based on probabilistic language modelingConceptually simple and explanatoryFormal mathematical modelNatural use of collection statistics, not heuristics (almost)LMs provide effective retrieval and can be improved to the extent that the following conditions can be metOur language models are accurate representations of the data.Users have some sense of term distribution.

  • Comparison With Vector Spacetf.idf models:(unscaled) term frequency is directly in modelthe probabilities do length normalization of term frequenciesthe effect of doing a mixture with overall collection frequencies is a little like idf: terms rare in the general collection but common in some documents will have a greater influence on the ranking

  • Comparison With Vector SpaceTerm weights based on frequencyTerms often used as if they were independentInverse document/collection frequency usedSome form of length normalization usedBased on probability rather than similarityIntuitions are probabilistic rather than geometricDetails of use of document length and term, document, and collection frequency differ

  • Latent Semantic Indexingsingular value decompositionMatrix Low-rank ApproximationLanguageModelGenerative modelsmooth probabilitiesMixture model

  • ResourcesThe Template Numerical Toolkit (TNT) http://math.nist.gov/tnt/documentation.htmlThe Lemur Toolkit for Language Modeling and Information Retrieval. http://www-2.cs.cmu.edu/~lemur/ CMU/Umass LM and IR system in C(++), currently actively developed.

  • Thank You!Q&A

  • [1] IIR Ch12, Ch18[2] M. Alistair, Z. Justin, and H. David, "Recommended reading for IR research students" SIGIR Forum, vol. 39, pp. 3-14, 2005.

  • #2 EvaluationQuestion abreakeven pointIRbreakeven pointARaprecision=|Ra|/|A|recall=|Ra|/|R|breakeven pointprecision=recall|R|=|A|k (k>0)breakeven pointprecision=|Ra|/|A|recall=|Ra|/|R||A|=|R||A|=|A|+kk>0|A|=|R|breakeven pointbreakevenpointbreakeven point

  • Matrix Low-rank Approximation for LSI

  • Eigenvalues & EigenvectorsEigenvectors (for a square mm matrix S)

    How many eigenvalues are there at most?eigenvalue(right) eigenvector

  • Matrix-vector multiplicationhas eigenvalues 3, 2, 0 withcorresponding eigenvectorsAny vector (say x= ) can be viewed as a combination ofthe eigenvectors: x = 2v1 + 4v2 + 6v3

  • Matrix vector multiplicationThus a matrix-vector multiplication such as Sx (S, x as in the previous slide) can be rewritten in terms of the eigenvalues/vectors:

    Even though x is an arbitrary vector, the action of S on x is determined by the eigenvalues/vectors.Suggestion: the effect of small eigenvalues is small.

  • Eigenvalues & Eigenvectors

  • ExampleLet

    Then

    The eigenvalues are 1 and 3 (nonnegative, real). The eigenvectors are orthogonal (and real):Real, symmetric.Plug in these values and solve for eigenvectors.

  • Let be a square matrix with m linearly independent eigenvectors Theorem: Exists an eigen decomposition

    Columns of U are eigenvectors of SDiagonal elements of are eigenvalues of

    Eigen/diagonal Decomposition

  • Diagonal decomposition: why/howAnd S=UU1.Thus SU=U, or U1SU=

  • Diagonal decomposition - exampleRecall The eigenvectors and form Inverting, we haveThen, S=UU1 =RecallUU1 =1.

  • Example continuedLets divide U (and multiply U1) by Then, S=Q(Q-1= QT )Why? Stay tuned

  • If is a symmetric matrix:Theorem: Exists a (unique) eigen decompositionwhere Q is orthogonal:Q-1= QTColumns of Q are normalized eigenvectorsColumns are orthogonal.(everything is real)

    Symmetric Eigen Decomposition

  • Time out!What do these matrices have to do with text?Recall m n term-document matrices But everything so far needs square matrices so

  • Singular Value DecompositionFor an m n matrix A of rank r there exists a factorization(Singular Value Decomposition = SVD) as follows:The columns of U are orthogonal eigenvectors of AAT.The columns of V are orthogonal eigenvectors of ATA.

  • Singular Value DecompositionIllustration of SVD dimensions and sparseness

  • SVD exampleLetTypically, the singular values arranged in decreasing order.

  • SVD can be used to compute optimal low-rank approximations.Approximation problem: Find Ak of rank k such that

    Ak and X are both mn matrices.Typically, want k

  • Solution via SVDLow-rank Approximationset smallest r-ksingular values to zero

  • Approximation errorHow good (bad) is this approximation?Its the best possible, measured by the Frobenius norm of the error:

    where the i are ordered such that i i+1.Suggests why Frobenius error drops as k increased.

  • SVD Low-rank approximationWhereas the term-doc matrix A may have m=50000, n=10 million (and rank close to 50000)We can construct an approximation A100 with rank 100.Of all rank 100 matrices, it would have the lowest Frobenius error.Great but why would we??Answer: Latent Semantic IndexingC. Eckart, G. Young, The approximation of a matrix by another of lower rank. Psychometrika, 1, 211-218, 1936.

  • Performing the mapsEach row and column of A gets mapped into the k-dimensional LSI space, by the SVD.A query q is also mapped into this space, by

    *******Still Bag of words modelHere log-scaled tf.idf

    This values are not calculated on true dataDf 24,100 1860 19600 1270*The vector space model*A vector can be normalized (given a length of 1) by dividing each of its components by its length here we use the L2 norm

    This maps vectors onto the unit sphere:

    ***Some contents from www.cs.umbc.edu/~nicholas/676/12LSI.ppt IIR Ch18

    Latent : to lie hidden

    Indexing by Latent Semantic Analysis (1990) by Scott Deerwester , Susan T. Dumais , George W. Furnas , Thomas K. Landauer , Richard Harshman Journal of the American Society for Information Science **www.seoptimiser.com/seoptimiser-link-building.php f*Ambiguity and association in natural language

    *Goals of LSISimilar terms map to similar location in low dimensional spaceNoise reduction by dimension reduction*T maps terms into LSI space DT maps documents into LSI space

    **minimize the Frobenius norm ofthe matrix difference X = W Wk

    IIR 18.15*Similar terms map to similar location in low dimensional spaceNoise reduction by dimension reduction*Latent Semantic Analysis via SVD****stop words are used uniformly throughout collection, so they tend to appear in the first dimension

    QUESTION:>>T-1>>S=T*Sigma*D,SDT-1 D

    ANSWER:you are right,

    qnT*sigma*qn = q qn = inv(T*sigma)*qinv(T*sigma) = inv(sigma)*inv(T) = inv(sigma)* T'sigmasigma'= sigma; (inv(sigma))' = inv(sigma)svdT'=inv(T), D'= inv(D) (you can check it out, i didn't searchproof yet) qt*T*inv(sigma) (inv(sigma* T' * q)'q*T*inv(sigma)qqt1 x M term, qt *inv()*Try this in matlab:

    >> C=[1 0 1 0 0 0; 0 1 0 0 0 0 ; 1 1 0 0 0 0 ; 1 0 0 1 1 0; 0 0 0 1 0 1];>> [t,s,d] = svd(C );

    Got s is 5X6 matrix, d 6X6 matrixbut rank is 5, the last column of s is all zero, just delete it.And delete last row of Dt

    now set low rank Approximation , k= 2>>s(3,3)=0;s(4,4)=0;s(5,5)=0;

    Get new dt>>s*d

    *****Close, principled analog to clustering methods.*IIR Ch12*************FSNLP: foundation of statistic Natural language processing**The user has a document in mind, and generates the query from this document.The equation represents the probability that the document that the user had in mind was in fact this one.

    *****Relevance model other than query model,Querysmall sample, relevant documentssample

    *equivalent to query-likelihood approach if simple empirical distribution used for query model

    More general risk minimization framework has been proposedZhai and Lafferty 2001

    ****************nA nB ABBAEA BA

    ************