Is That ADuplicate Quora Question?
Abhishek Thakur@abhi1thakur
About Me➢ I’m a data scientist
➢ I like:○ scikit-learn○ keras○ xgboost○ python
➢ I don’t like:○ errrR○ excel
I like big data and
I cannot lie
The Problem➢ ~ 13 million questions (as of March, 2017)➢ Many duplicate questions➢ Cluster and join duplicates together➢ Remove clutter
➢ First public data release: 24th January, 2017
Duplicate Questions➢ How does Quora quickly mark questions as needing improvement?➢ Why does Quora mark my questions as needing improvement/clarification
before I have time to give it details? Literally within seconds…
➢ What practical applications might evolve from the discovery of the Higgs Boson?
➢ What are some practical benefits of discovery of the Higgs Boson?
➢ Why did Trump win the Presidency?➢ How did Donald Trump win the 2016 Presidential Election?
Non-Duplicate Questions➢ Who should I address my cover letter to if I'm applying for a big company like
Mozilla?➢ Which car is better from safety view?""swift or grand i10"".My first priority is
safety?
➢ Mr. Robot (TV series): Is Mr. Robot a good representation of real-life hacking and hacking culture? Is the depiction of hacker societies realistic?
➢ What mistakes are made when depicting hacking in ""Mr. Robot"" compared to real-life cybersecurity breaches or just a regular use of technologies?
➢ How can I start an online shopping (e-commerce) website?➢ Which web technology is best suitable for building a big E-Commerce
website?
The Data➢ 400,000+ pairs of questions➢ Initially data was very skewed➢ Negative samples from related questions➢ Not real distribution on Quora’s website➢ Noise exists (as usual)
https://data.quora.com/First-Quora-Dataset-Release-Question-Pairs
The Data➢ 255045 negative samples (non-duplicates) ➢ 149306 positive samples (duplicates)➢ 40% positive samples
The Data➢ Average number characters in question1: 59.57➢ Minimum number of characters in question1: 1➢ Maximum number of characters in question1: 623
➢ Average number characters in question2: 60.14➢ Minimum number of characters in question2: 1➢ Maximum number of characters in question2: 1169
Basic Feature Engineering➢ Length of question1➢ Length of question2➢ Difference in the two lengths➢ Character length of question1 without spaces➢ Character length of question2 without spaces➢ Number of words in question1➢ Number of words in question2➢ Number of common words in question1 and question2
Basic Feature Engineering
➢ Basic feature set: fs-1
data['len_q1'] = data.question1.apply(lambda x: len(str(x)))
data['len_q2'] = data.question2.apply(lambda x: len(str(x)))
data['diff_len'] = data.len_q1 - data.len_q2
data['len_char_q1'] = data.question1.apply(lambda x: len(''.join(set(str(x).replace(' ', '')))))
data['len_char_q2'] = data.question2.apply(lambda x: len(''.join(set(str(x).replace(' ', '')))))
data['len_word_q1'] = data.question1.apply(lambda x: len(str(x).split()))
data['len_word_q2'] = data.question2.apply(lambda x: len(str(x).split()))
data['common_words'] = data.apply(lambda x:
len(set(str(x['question1']).lower().split()).intersection(set(str(x['question2']).lower().split()))), axis=1)
Fuzzy Features➢ Also known as approximate string matching➢ Number of “primitive” operations required to convert string to exact match➢ Primitive operations:
○ Insertion○ Deletion○ Substitution
➢ Typically used for:○ Spell checking○ Plagiarism detection○ DNA sequence matching○ Spam filtering
Fuzzy Features➢ pip install fuzzywuzzy
➢ Uses Levenshtein distance➢ QRatio➢ WRatio➢ Token set ratio➢ Token sort ratio➢ Partial token set ratio➢ Partial token sort ratio➢ etc. etc. etc.
https://github.com/seatgeek/fuzzywuzzy
Fuzzy Features
➢ Fuzzy feature set: fs-2
data['fuzz_qratio'] = data.apply(lambda x: fuzz.QRatio(str(x['question1']), str(x['question2'])), axis=1)
data['fuzz_WRatio'] = data.apply(lambda x: fuzz.WRatio(str(x['question1']), str(x['question2'])), axis=1)
data['fuzz_partial_ratio'] = data.apply(lambda x: fuzz.partial_ratio(str(x['question1']), str(x['question2'])), axis=1)
data['fuzz_partial_token_set_ratio'] = data.apply(lambda x: fuzz.partial_token_set_ratio(str(x['question1']), str(x['question2'])),
axis=1)
data['fuzz_partial_token_sort_ratio'] = data.apply(lambda x: fuzz.partial_token_sort_ratio(str(x['question1']),
str(x['question2'])), axis=1)
data['fuzz_token_set_ratio'] = data.apply(lambda x: fuzz.token_set_ratio(str(x['question1']), str(x['question2'])), axis=1)
data['fuzz_token_sort_ratio'] = data.apply(lambda x: fuzz.token_sort_ratio(str(x['question1']), str(x['question2'])), axis=1)
TF-IDF➢ TF(t) = Number of times a term t appears in a document / Total number of
terms in the document➢ IDF(t) = log(Total number of documents / Number of documents with term t in
it)➢ TF-IDF(t) = TF(t) * IDF(t)
tfidf = TfidfVectorizer(min_df=3, max_features=None,
strip_accents='unicode', analyzer='word',token_pattern=r'\w{1,}',
ngram_range=(1, 2), use_idf=1, smooth_idf=1, sublinear_tf=1,
stop_words = 'english')
SVD➢ Latent semantic analysis➢ scikit-learn version of SVD➢ 120 components
svd = decomposition.TruncatedSVD(n_components=120)
xtrain_svd = svd.fit_transform(xtrain)
xtest_svd = svd.transform(xtest)
A Combination of TF-IDF & SVD➢ TF-IDF features: fs3-1
A Combination of TF-IDF & SVD➢ TF-IDF features: fs3-2
A Combination of TF-IDF & SVD➢ TF-IDF + SVD features: fs3-3
A Combination of TF-IDF & SVD➢ TF-IDF + SVD features: fs3-4
A Combination of TF-IDF & SVD➢ TF-IDF + SVD features: fs3-5
Word2Vec Features➢ Multi-dimensional vector for all the words in any dictionary➢ Always great insights➢ Very popular in natural language processing tasks➢ Google news vectors 300d
Word2Vec Features
Word2Vec Features➢ Representing words➢ Representing sentences
def sent2vec(s):
words = str(s).lower().decode('utf-8')
words = word_tokenize(words)
words = [w for w in words if not w in stop_words]
words = [w for w in words if w.isalpha()]
M = []
for w in words:
M.append(model[w])
M = np.array(M)
v = M.sum(axis=0)
return v / np.sqrt((v ** 2).sum())
W2V Features: Cosine Distance
W2V Features: Manhattan Distance➢ Also known as cityblock distance
W2V Features: Canberra Distance
W2V Features: Minkowski Distance
W2V Features: Braycurtis Distance
W2V Features: WMD
Kusner, M., Sun, Y., Kolkin, N. & Weinberger, K.. (2015). From Word Embeddings To Document Distances.
W2V Features: Skew➢ Skew = 0 for normal distribution➢ Skew > 0: more weight in left tail
W2V Features: Kurtosis➢ 4th central moment over the square of variance➢ Types:
○ Pearson○ Fisher: subtract 3.0 from result such that result is 0 for normal distribution
W2V Features➢ Word2Vec feature set: fs-4
scipy.spatial.distance
scipy.stats
minkowski
jaccard
manhattanbraycurtis
euclidean
cosine
canberra
kurtosisskew
Raw Word2Vec Vectors
https://www.kaggle.com/jeffd23/visualizing-word-vectors-with-t-sne
➢ Raw W2V feature set: fs-5
Features Snapshot
Feature Snapshot
Machine Learning Models
Machine Learning Models➢ Logistic regression➢ Xgboost➢ 5 fold cross-validation➢ Accuracy as a comparison metric (also, precision + recall)➢ Why accuracy?
Results
Deep Learning
LSTM➢ Long short term memory➢ A type of RNN➢ Learn long term dependencies➢ Used two LSTM layers
1D CNN➢ One dimensional convolutional layer➢ Temporal convolution➢ Simple to implement:
for i in range(sample_length):
y[i] = 0
for j in range(kernel_length):
y[i] += x[i-j] * h[j]
Embedding Layers➢ Simple layer➢ Converts indexes to vectors➢ [[4], [20]] -> [[0.25, 0.1], [0.6, -0.2]]
Time Distributed Dense Layer➢ TimeDistributed wrapper around dense layer➢ TimeDistributed applies the layer to every temporal slice of input➢ Followed by Lambda layer➢ Implements “translation” layer used by Stephen Merity (keras snli model)
model1 = Sequential()
model1.add(Embedding(len(word_index) + 1,
300,
weights=[embedding_matrix],
input_length=40,
trainable=False))
model1.add(TimeDistributed(Dense(300, activation='relu')))
model1.add(Lambda(lambda x: K.sum(x, axis=1), output_shape=(300,)))
GloVe Embeddings➢ Count based model➢ Dimensionality reduction on co-occurrence counts matrix➢ word-context matrix -> word-feature matrix➢ Common Crawl
○ 840B tokens, 2.2M vocab, 300d vectors
Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global Vectors for Word Representation
Basis of Deep Learning Model➢ Keras-snli model: https://github.com/Smerity/keras_snli
Before Training DeepNets➢ Tokenize data➢ Convert text data to sequences
tk = text.Tokenizer(nb_words=200000)
max_len = 40
tk.fit_on_texts(list(data.question1.values) + list(data.question2.values.astype(str)))
x1 = tk.texts_to_sequences(data.question1.values)
x1 = sequence.pad_sequences(x1, maxlen=max_len)
x2 = tk.texts_to_sequences(data.question2.values.astype(str))
x2 = sequence.pad_sequences(x2, maxlen=max_len)
word_index = tk.word_index
Before Training DeepNets➢ Initialize GloVe embeddings
embeddings_index = {}
f = open('data/glove.840B.300d.txt')
for line in tqdm(f):
values = line.split()
word = values[0]
coefs = np.asarray(values[1:], dtype='float32')
embeddings_index[word] = coefs
f.close()
Before Training DeepNets➢ Create the embedding matrix
embedding_matrix = np.zeros((len(word_index) + 1, 300))
for word, i in tqdm(word_index.items()):
embedding_vector = embeddings_index.get(word)
if embedding_vector is not None:
embedding_matrix[i] = embedding_vector
Final Deep Learning Model
Final Deep Learning Model
Model 1 and Model 2
model1 = Sequential()
model1.add(Embedding(len(word_index) + 1,
300,
weights=[embedding_matrix],
input_length=40,
trainable=False))
model1.add(TimeDistributed(Dense(300, activation='relu')))
model1.add(Lambda(lambda x: K.sum(x, axis=1),
output_shape=(300,)))
model2 = Sequential()
model2.add(Embedding(len(word_index) + 1,
300,
weights=[embedding_matrix],
input_length=40,
trainable=False))
model2.add(TimeDistributed(Dense(300, activation='relu')))
model2.add(Lambda(lambda x: K.sum(x, axis=1),
output_shape=(300,)))
Final Deep Learning Model
Model 3 and Model 4
Model 3 and Model 4model3 = Sequential()
model3.add(Embedding(len(word_index) + 1,
300,
weights=[embedding_matrix],
input_length=40,
trainable=False))
model3.add(Convolution1D(nb_filter=nb_filter,
filter_length=filter_length,
border_mode='valid',
activation='relu',
subsample_length=1))
model3.add(Dropout(0.2))
.
.
.
model3.add(Dense(300))
model3.add(Dropout(0.2))
model3.add(BatchNormalization())
Final Deep Learning Model
Model 5 and Model 6
model5 = Sequential()
model5.add(Embedding(len(word_index) + 1, 300, input_length=40,
dropout=0.2))
model5.add(LSTM(300, dropout_W=0.2, dropout_U=0.2))
model6 = Sequential()
model6.add(Embedding(len(word_index) + 1, 300, input_length=40,
dropout=0.2))
model6.add(LSTM(300, dropout_W=0.2, dropout_U=0.2))
Final Deep Learning Model
Merged Model
Time to Train the DeepNet➢ Total params: 174,913,917➢ Trainable params: 60,172,917➢ Non-trainable params: 114,741,000
➢ NVIDIA Titan X
Combined Results
The deep network was trained on
an NVIDIA TitanX and took
approximately 300 seconds for
each epoch and took 10-15 hours
to train. This network achieved
an accuracy of 0.848 (~0.85).
Improving Further➢ Cleaning the text data, e.g correcting mis-spellings➢ POS tagging➢ Entity recognition➢ Combining deepnet with traditional ML models
Timeline
24 Jan, 2017
27 Feb, 2017
16 Mar, 2017
7 Jun, 2017
Quora Dataset Release
My Model + Writeup Release
Kaggle Competition On The Same Data Begins
Competition EndsFrustration Plot
Frus
tratio
n
Time in competition
Conclusion & References➢ The deepnet gives near state-of-the-art result➢ BiMPM model accuracy: 88%
Some reference:
➢ Zhiguo Wang, Wael Hamza and Radu Florian. "Bilateral Multi-Perspective Matching for Natural Language Sentences," (BiMPM)
➢ Matthew Honnibal. "Deep text-pair classification with Quora's 2017 question dataset," 13 February 2017. Retreived at https://explosion.ai/blog/quora-deep-text-pair-classification
➢ Bradley Pallen’s work: https://github.com/bradleypallen/keras-quora-question-pairs
Thank you!Questions / Comments?
Code: bit.ly/quoraduplicates
Get in touch:
➢ E-mail: [email protected]➢ LinkedIn: bit.ly/thakurabhishek➢ Kaggle: kaggle.com/abhishek➢ Twitter: @abhi1thakur
If everything fails, use Xgboost
Top Related