Canonical Formatted Address Data

Canonical Formatted Address Data

Concept to clean noisy address data and correct spelling errors

© msg systems ag, August 2014 1


The challenge is to clean noisy data from free texts in table format. This is still a

matter of research. But there are some approaches that can lead to the desired

outcome.

Therefore we build a Language Model that „understands“ the language used based

on probabilities. This will be used to improve misspelled inputs and brings it into a

canonical format. The Language Model consist of a Lucene word index and a

corresponding MultiMap. To improve performance Bloom Filters and caches are

considered useful.



Additionally we take a classifier to strip all the not needed descriptions from the

address information texts.

The NeeText instances holds valuable information on texts. NEE means

Native Expression Enhancement where „native“ refers to any natural language.



Raw Input Data


• Take all free text columns into consideration

• Texts contain off street names and often have

human-readable descriptions

• The challenge to strip those human-readable descriptions to get a canonical

format: suitable for Business Intelligence classification tasks

• Is like getting a location‘s reverse-hash: identical fingerprints for nearby places

from differing texts

Raw Input Data


Regular Expression

• Verbatim Word as is in text /^\S{2,}$/

Must have at least two characters

• Normalized /^[a-z]{2,}$/

Verbatim word in lower case,

language-specific replacements,

dropped any other special character & numbers

Language-specific replacements may be • ñ n (Spanish)

• ç c (French)

• é ee (French)

• à aa (French)

• ø o (Norwegian, Danish, etc.)

• ä ae (German)

• ß ss (German)

• Stemmed /^[a-z]+$/

Applied Porter Stemmer algorithm to normalized word Guarantees faster word lookups from MultiMap

Lifecycle Of Words


Building The Language Model


Character-leveled sequence learning

• Training a Hidden Markov Model to evaluate words as unigrams

• Transition probabilities for each character Every red dot is a character

The single yellow dot indicates the word‘s end to train word lengths as well

The transition probabilities matrix in the HMM is a 27 by 27 matrix (|a-z| + word‘s end)

Both transition and initial probabilities will be trained

Guarantees to yield identical results when model is immutable onto run • Good choice for canonical format

• Does not guarantee to yield the correct spelled word by dictionary (but the „correct-looking“ one)

Sequence Model


MultiMap<StemmedWord, PriorityQueue<NeeText>>

• Allows fast word lookup

• Counts each word‘s occurrences

• Parsed NeeText instances knows: Verbatim word

Normalized word

(lower case and replaced special characters like ñ, ç, é, à, ø, ä, etc.)

Stemmed word

Likelyhood being a correct term in this language (from HMM model)

MultiMap Word Index


• Allows fuzzy word lookups

• Scores for best approximate match

• Points to parsed NeeTexts in MultiMap to score words by occurrence counts

Lucene Word Index


• In the end: compact both word indices by filtering out outliers

• Improves word lookup times – shrinks indices‘ sizes

Filtering / Compaction


• Consist of HMM model

MultiMap

Lucene word index (referencing MultiMap for indepth information)

Bloom Filters

Guava Caches

• Language Model can be trained based on Unigrams from Peter Norvig / Google,

Address dataset itself, or

Any text

• Language in training data has only to match test / productive data set‘s native

language

Language Model


Train Language Model

• Mappers Default / identity mappers

• 1 Reducer Trains words as unigrams to Hidden Markov Model

Feed corrected words to Lucene Word Index and reference to MultiMap entry

Compact Language Model

Emit Language Model

MapReduce


Address Information Classification


• Stanford Named Entity Recognizer (NER) From Natural Language Processing group at The Farm

Supervised Learning – so approach will need some training data

Classify Address Information


• Picks only relevant address information and discards the rest



Train Markov Random Field Classifier

• Mappers Default / identity mappers

• 1 Reducer Markov Random Field learns relevant address information from training dataset

Emit MRF classifier

MapReduce


Using The Models


• Lookup approximate strings from index Find best match to correct

• Lookup misspelled word „jarvert“ may produce this scoring:

Fuzzy Queries


• Take best match from custom scoring: „harvard“ here

• Replace the misspelled word „jarvert“ by „harvard“

• Increase occurences count for „harvard“ in backlog buffer

Fuzzy Queries


• Allows over all model improvements

• Hardens the model on any text used Avoids creating a Language Model from a vast sized data set

Model grows by usage

Bare Language Model is immutable as the outcome may otherwise be unstable as

MapReduce may alter the scheduling

• Recently gathered yet unknown words may be merged in or replace the existing

Language Model

Backlog Buffer


Using the Language Model & MRF classifier

• Mappers Fix spelling errors using the Language Model

• Takes Language Model and MRF classifier from distributed cache

• Assign each word a probability of being correct

• Lookup word in the word index and take the best scored of the two

Use the MRF classifier to detect descriptions and remove them

Emit key (city) and value (cleaned data)

• Reducers by city (optional) Fix outliers in cleaned data and in geo coordinates

Emit cleaned data

MapReduce


Solution Approaches


• Porter Stemmer Stems words to improve word lookups & correction

Stemmed words only on indices

Building a MultiMap – MultiMap<StemmedWord, PriorityQueue<NeeTexts>> • Allows fast word lookup

• Counts each word‘s occurrences

• Hidden Markov Model Training model to evaluate strings

Assigns each word a probability of being correct

Improves scoring by taking the most likely string as canonical format

• Markov Random Field Build classifier to remove descriptions from address columns

OpenSource implementation from Stanford University

Our Approach: Statistical Modelling


• Lucene Index Approximate String Matching approach

Fuzzy queries to lookup misspelled words

• Hadoop Once two MapReduce jobs to build statistical models:

• Language Model and

• MRF classifier

One MapReduce job to correct spelling errors & strip description in addresses

(here Reduce phase is optional)

• Backlog Buffer Helps improve the Language Model for future usage

Our Approach: Techniques


• Bloom Filters Probabilistic data structure to avoid fuzzy costly queries

Bloom Filters are used in databases such as Oracle & Cassandra – makes them open

files from disk only iff the desired information might be in there

false negatives not possible

false positives can have any probability greater than zero

• Guava Caches Avoids doing costly computations over and over again

• MultiMap<StemmedWord, PriorityQueue<NeeText>> Allows fast word lookup

Counts each word‘s occurrences

Parsed & compacted NeeText instances only

Our Approach: Performance Improvements


• String Similarity Measures E.g. Levenshtein, Block Distance, Euclidean Distance, etc.

Allow comparison of two string candidates

HMM model may score best match of both candidates

But is one-on-one comparison only so it is expensive to find approximate matches

• String Kernels Allow comparison of two string candidates

Can get the two word‘s common stem

HMM model may score best match of both candidates

But is one-on-one comparison only so it is expensive to find approximate matches

• Support Vector Machines Takes each word as class

Can determine what „correct“ word a misspelled one belongs to

But the expected amount of possible classes is far too high

Additional Approaches


Indices with Lucene


MapReduce


After all the effort:

• Normalizing & stemming each word

• Correcting spelling errors

• Classifying the relevant street name parts

the result will look like …

Final Result


Final Result


• Problem is still a topic of research in Artificial Intelligence

Research

Vielen Dank für Ihre Aufmerksamkeit


Daniel Schulz

GB L / Data Scientist

Telefon: +49 151 613 500 15

[email protected]

www.msg-systems.com

Canonical Formatted Address Data

Science

Transcript of Canonical Formatted Address Data