Panama Papers Neo4j Budapest Meetup
-
Upload
janos-szendi-varga -
Category
Technology
-
view
48 -
download
0
Transcript of Panama Papers Neo4j Budapest Meetup
Ez a technológiáról fog szólni….
PANAMA PAPERS ÉS A GRÁFOK
FORRÁS
2.6TB adat. Relációs adatbázisok, emailek, különböző banki dokumentumok, cégiratok, amelyek a 215,000 offshore céghez kapcsolódnak, akik a panamai Mossack Fonseca jogi szolgáltató cég ügyfelei voltak 1977 és 2015 között.
A FOLYAMAT
1. Acquire documents2. Classify documents
a. Scan / OCR —Tesseractb. Extract document metadata — Apache Tika https://tika.apache.org
3. Whiteboard domaina. Determine entities and their relationshipsb. Determine potential entity and relationship propertiesc. Determine sources for those entities and their properties
4. Work out analyzers, rules, parsers and named entity recognition for documents —Apache Solr, Blacklight http://projectblacklight.org, Nuix https://www.nuix.com
5. Parse and store document metadata and document and entity relationships —Talend http://www.talend.coma. Parse by author, named entities, dates, sources and classification
6. Infer entity relationships7. Compute similarities, transitive cover and triangles 8. Analyze data using graph queries and visualizations —Neo4j, Linkurious http://linkurio.us
ENTITÁSOK
• Clients
• Companies
• Addresses
• Officers (both natural people and companies)
RELÁCIÓK
• (:Officer)-[:is officer of]->(:Company)
• (:Officier)-[:registered address]->(:Address)
• (:Client)-[:registered]->(:Company)
• (:Officer)-[:has similar name and address]->(:Officer)
GRÁF MODELL
GRÁF MODELL
RUGALMAS ADATMODELL
Új entitások: Documents: E-Mail, PDF, Contract, DB-Record, …
Money Flow: Accounts / Banks / Intermediaries
Új relációk: Family / business ties
Conversations
Peer Groups / Rings
Similar Roles
Mentions / Topic-Of
Money Flow
FELFEDEZÉS
Once the database was set up, it was a simple matter to install and configure Linkurious to essentially provide a GUI (graphical user interface) atop the database. Having the visual depiction of the graph of names and addresses was critical in making sense of the data, especially for non-technical reporters.