Text mining Pre-processing
-
Upload
bankfacil -
Category
Data & Analytics
-
view
736 -
download
0
Transcript of Text mining Pre-processing
Text Mining
Barbara Barbosa @bahbbc
BankFacil
26th February 2016
Barbara Barbosa @bahbbc BankFacil
Text Mining
What is it?
The process to deriving information from the text. It usuallyrequires a preprocessing of the input data.
Barbara Barbosa @bahbbc BankFacil
Text Mining
Learning problem
Figure: Flow chart of learning problem
Barbara Barbosa @bahbbc BankFacil
Text Mining
Corpus
Corpus is the set of n documents. Each of these documents isdefined as a set of m terms (radicals, words or a set of words).
The corpus will be all text available by clients from the BankFacil’spage on facebook (https://www.facebook.com/bankfacil)
You can check the code in R - http://bit.ly/1XQ0mWw
Barbara Barbosa @bahbbc BankFacil
Text Mining
Tokenizing - Lexical Analysis
� Convert to lower case
� Remove punctuation
� Remove numbers
Barbara Barbosa @bahbbc BankFacil
Text Mining
StopWords
Stopwords 1 is a list of words that doesn’t have the potential tocontribute to characterize the content in the text.
They can reduce the size of texts by 30% to 50%.
1Portuguese stopwords available at:http://snowball.tartarus.org/algorithms/portuguese/stop.txt
Barbara Barbosa @bahbbc BankFacil
Text Mining
Stemming
Figure:
There are experiments that shows 5% of reduction from thedocument original size.
Barbara Barbosa @bahbbc BankFacil
Text Mining
Space Vector Model
� Binary
� Frequency
� tf-idf
� tf-idf normalized
Barbara Barbosa @bahbbc BankFacil
Text Mining
TF-IDF
TF-IDF (Term Frequency - Inverse Document Frequency)
tfidf(tk, dj) = #(tk, dj) ∗ log|#Tr|Tr(tk)
(1)
� Tr - representa o numero total de documentos (corpus)
� #(tk, dj) - o numero de vezes que tk ocorre em dj
� Tr(tk) - numero de documentos em Tr em que tk aparece
Barbara Barbosa @bahbbc BankFacil
Text Mining
Zipf’s law
Zipf’s law states that given some corpus, the frequency of anyword is inversely proportional to its rank in the frequency table.
More about Zipf’s law
https://www.youtube.com/watch?v=fCn8zs912OE
Barbara Barbosa @bahbbc BankFacil
Text Mining
Bibliography
Based on slides from Prof. Sarajane Marques Peres in Data Miningcourse
Barbara Barbosa @bahbbc BankFacil
Text Mining