Text mining Pre-processing

13
Text Mining Barbara Barbosa @bahbbc BankFacil 26th February 2016 Barbara Barbosa @bahbbc BankFacil Text Mining

Transcript of Text mining Pre-processing

Text Mining

Barbara Barbosa @bahbbc

BankFacil

26th February 2016

Barbara Barbosa @bahbbc BankFacil

Text Mining

What is it?

The process to deriving information from the text. It usuallyrequires a preprocessing of the input data.

Barbara Barbosa @bahbbc BankFacil

Text Mining

Learning problem

Figure: Flow chart of learning problem

Barbara Barbosa @bahbbc BankFacil

Text Mining

Corpus

Corpus is the set of n documents. Each of these documents isdefined as a set of m terms (radicals, words or a set of words).

The corpus will be all text available by clients from the BankFacil’spage on facebook (https://www.facebook.com/bankfacil)

You can check the code in R - http://bit.ly/1XQ0mWw

Barbara Barbosa @bahbbc BankFacil

Text Mining

Tokenizing - Lexical Analysis

� Convert to lower case

� Remove punctuation

� Remove numbers

Barbara Barbosa @bahbbc BankFacil

Text Mining

StopWords

Stopwords 1 is a list of words that doesn’t have the potential tocontribute to characterize the content in the text.

They can reduce the size of texts by 30% to 50%.

1Portuguese stopwords available at:http://snowball.tartarus.org/algorithms/portuguese/stop.txt

Barbara Barbosa @bahbbc BankFacil

Text Mining

Stemming

Figure:

There are experiments that shows 5% of reduction from thedocument original size.

Barbara Barbosa @bahbbc BankFacil

Text Mining

Space Vector Model

� Binary

� Frequency

� tf-idf

� tf-idf normalized

Barbara Barbosa @bahbbc BankFacil

Text Mining

TF-IDF

TF-IDF (Term Frequency - Inverse Document Frequency)

tfidf(tk, dj) = #(tk, dj) ∗ log|#Tr|Tr(tk)

(1)

� Tr - representa o numero total de documentos (corpus)

� #(tk, dj) - o numero de vezes que tk ocorre em dj

� Tr(tk) - numero de documentos em Tr em que tk aparece

Barbara Barbosa @bahbbc BankFacil

Text Mining

Luhn’s experiment

Figure:Barbara Barbosa @bahbbc BankFacil

Text Mining

Zipf’s law

Zipf’s law states that given some corpus, the frequency of anyword is inversely proportional to its rank in the frequency table.

More about Zipf’s law

https://www.youtube.com/watch?v=fCn8zs912OE

Barbara Barbosa @bahbbc BankFacil

Text Mining

Bibliography

Based on slides from Prof. Sarajane Marques Peres in Data Miningcourse

Barbara Barbosa @bahbbc BankFacil

Text Mining

Text Mining

Barbara Barbosa @bahbbc

BankFacil

26th February 2016

Barbara Barbosa @bahbbc BankFacil

Text Mining