ร างขอบเขตของงาน (Terms Of Reference : TOR)ร างขอบเขตของงาน (Terms Of Reference : TOR) การจ ดซ อคร ภ
10-1 Vocab of Terms
-
Upload
anochenson -
Category
Documents
-
view
80 -
download
4
Transcript of 10-1 Vocab of Terms
Alan NochensonIST 511
10/1/2012
Motivation Real-world example Techniques
Tokenization Stop words Normalization Stemming/lemmatization
Using a variety of techniques, we want to improve IR systems so that they “understand” more of what we want from a query
E.g. When searching for a paper about Facebook, the following queries should all return the paper The facebook, facebook, face-book
Damerau–Levenshtein distance is the number of ops between two words Insert Delete Change Swap
adidas = adiidas == adifas (distance 1) But: cat != rat != hat (distance 1)
Breaking up sentences on a variety of rules Split on non-alphanumeric?
Good: The dog ran to the park Bad: Ms. O’Hannety went to O’Flaggerty’s pub
(Ms, O, Hannety, went, to, O, Flaggerty, s, pub) Split on space?
Bad: San Fransisco is a great city.
E.g. Lebensversicherungsgesellschaftsangestellter = life insurance company employee
Would not get split by any of the previously mentioned methods
Drop common ‘useless’ words How useless are they (“President of the USA”)
Not a big problem to include them, space or time-wise
What I did at Amazon (codenamed BrandSims normalization)
Maps words/phrases that are semantically related to each other, so they can refer to the same content
E.g. Alan went to the store = Alan go store
Mainly dropped since they were not always supported
Problematic since in certain languages accents are critical to understanding
Standardize to all caps or all lowercase (more common)
Everywhere in the sentence? Bad: We went to the White House
Better solution is the beginning of a sentence and in titles
More complicated than previous normalization techniques
Goal is to remove things like tense, number, possession from strings
Chop off the end of the word Con: Crude and sometime ineffective Pro: Fast and no overhead
E.g. cookies -> cooki, cup->c
Use a vocab list and morphological (structural) list [which may or may not help much]
Recognize context in a sentence (saw would become see if used as a verb, not a noun)
Porter’s algorithm:
Understand the type of queries that will be submitted
It is all about tradeoffs between precision and recall
These techniques can be used differently depending on the context.