Corporate Residence Fraud Detection 指導老師:徐立群教授 學生:陳威翰 R16034177...

48
Corporate Residence Fraud Detection 指指指指 指指指指指 指指 指指指 R16034177 指指指 R16034208

Transcript of Corporate Residence Fraud Detection 指導老師:徐立群教授 學生:陳威翰 R16034177...

Page 1: Corporate Residence Fraud Detection 指導老師:徐立群教授 學生:陳威翰 R16034177 莊詠絮 R16034208.

Corporate Residence Fraud Detection

指導老師:徐立群教授學生:陳威翰 R16034177   莊詠絮 R16034208

Page 2: Corporate Residence Fraud Detection 指導老師:徐立群教授 學生:陳威翰 R16034177 莊詠絮 R16034208.

GUIDLINE

1. Introduction2. Literature Overview3. Data4. Methods5. Results and Discussion6. Conclusion

Page 3: Corporate Residence Fraud Detection 指導老師:徐立群教授 學生:陳威翰 R16034177 莊詠絮 R16034208.

1. Introduction

Page 4: Corporate Residence Fraud Detection 指導老師:徐立群教授 學生:陳威翰 R16034177 莊詠絮 R16034208.
Page 5: Corporate Residence Fraud Detection 指導老師:徐立群教授 學生:陳威翰 R16034177 莊詠絮 R16034208.

1. Introduction

• Falsifying or withholding information in order to limit the amount of tax liability .

• Fiscal fraud exists in several forms: direct (income and corporate tax) indirect (VAT) taxes.

Page 6: Corporate Residence Fraud Detection 指導老師:徐立群教授 學生:陳威翰 R16034177 莊詠絮 R16034208.

1. Introduction(cont.)

• In Belgium, fiscal fraud is acknowledged as a significant problem.

fraud

30billion

6%GDP

EURO2700

Page 7: Corporate Residence Fraud Detection 指導老師:徐立群教授 學生:陳威翰 R16034177 莊詠絮 R16034208.

1. Introduction(cont.)

• Type of fraud: Companies deceitfully attempt to place their residency in a low-tax country in order to avoid paying the higher taxes of their real location.

• Data: Structrued data Transactional data

Page 8: Corporate Residence Fraud Detection 指導老師:徐立群教授 學生:陳威翰 R16034177 莊詠絮 R16034208.

2. LITERATURE OVERVIEW

Page 9: Corporate Residence Fraud Detection 指導老師:徐立群教授 學生:陳威翰 R16034177 莊詠絮 R16034208.

2.1 The Importance of Fraud Detection

• Abuse of the tax system is a very costly fraud type:

Fraud losses

Cuts budgets Tax ↑

• Benefit: Not only is there the direct impact of recovering parts of the loss of capital, increased effectiveness can also lead to enhanced deterrence[1]

Page 10: Corporate Residence Fraud Detection 指導老師:徐立群教授 學生:陳威翰 R16034177 莊詠絮 R16034208.

2.2 Data Mining for Fraud Detection(cont.)

telecommunicationsbank false insurance

Page 11: Corporate Residence Fraud Detection 指導老師:徐立群教授 學生:陳威翰 R16034177 莊詠絮 R16034208.

2.2 Data Mining for Fraud Detection(cont.)

• High dimensionality of the transactional data, →aggregation over the transactional data.

• One way to do so is to create transaction aggregates for each user account that characterize the typical legitimate behaviour of the user.

Page 12: Corporate Residence Fraud Detection 指導老師:徐立群教授 學生:陳威翰 R16034177 莊詠絮 R16034208.

2.2 Data Mining for Fraud Detection(cont.)

• Deriving RFM (Recency, Frequency, Monetary Value) attributes from the original features over a period of time.

• Aggregating the transactions creates new structured data and loses the fine-grained information that is included in the transactions .

Page 13: Corporate Residence Fraud Detection 指導老師:徐立群教授 學生:陳威翰 R16034177 莊詠絮 R16034208.

2.3 Domain Challenges

Data scarcity• Fraud data are usually highly unbalanced.• Non-fraudulent> Fraudulent• Limited resources and the very expensive

labeling procedure further bias the class balance.

• Very little structured data is available on the foreign companies.

Page 14: Corporate Residence Fraud Detection 指導老師:徐立群教授 學生:陳威翰 R16034177 莊詠絮 R16034208.

2.3 Domain Challenges(cont.)

Volume, variety and velocity• Every quarter, the government receives

millions of tax data entries.• Fraudsters are known to change the way in

which they commit fraud in progressively more creative and covert ways to evade the detection systems in place.

Page 15: Corporate Residence Fraud Detection 指導老師:徐立群教授 學生:陳威翰 R16034177 莊詠絮 R16034208.

2.3 Domain Challenges(cont.)

Comprehensibility

• As each investigator develops his/her own expertise on tax fraud, this expertise can conflict with the predictions.

Page 16: Corporate Residence Fraud Detection 指導老師:徐立群教授 學生:陳威翰 R16034177 莊詠絮 R16034208.

Transaction Data

Structured Data

3. DATA

Page 17: Corporate Residence Fraud Detection 指導老師:徐立群教授 學生:陳威翰 R16034177 莊詠絮 R16034208.

3. DATA

• Invoicing records between 2,745,478 Belgian companies 873,640 foreign companies.

• Transaction data:1. Incoming invoice2. Outgoing invoice

Datasets(invoice logs)

Incoming invoices

outgoing invoices

Incoming+outgoing

Page 18: Corporate Residence Fraud Detection 指導老師:徐立群教授 學生:陳威翰 R16034177 莊詠絮 R16034208.

Transaction data

matrix Bipartite graph

Page 19: Corporate Residence Fraud Detection 指導老師:徐立群教授 學生:陳威翰 R16034177 莊詠絮 R16034208.

matrix

Xi,j=0101010101

Row i = foreign companyColumn j = resident company

ConnectionY:1N:0

F1 F2

B1

B2

Page 20: Corporate Residence Fraud Detection 指導老師:徐立群教授 學生:陳威翰 R16034177 莊詠絮 R16034208.

Bipartite graph

• Red square: Fraudulent foreign companies

• Grey nodes: Belgian companies

Many of the fraudulent foreign companies are connected to the same Belgian companies.

Page 21: Corporate Residence Fraud Detection 指導老師:徐立群教授 學生:陳威翰 R16034177 莊詠絮 R16034208.

`

Page 22: Corporate Residence Fraud Detection 指導老師:徐立群教授 學生:陳威翰 R16034177 莊詠絮 R16034208.

Structure data

Belgian companies

• Geographical location• Industry type• Start-up data. etc

Foreign companies

• Country(located)• Target label

Page 23: Corporate Residence Fraud Detection 指導老師:徐立群教授 學生:陳威翰 R16034177 莊詠絮 R16034208.
Page 24: Corporate Residence Fraud Detection 指導老師:徐立群教授 學生:陳威翰 R16034177 莊詠絮 R16034208.

Structure data(cont.)

• Auc and lift curve

AUC ↑ → Better

Page 25: Corporate Residence Fraud Detection 指導老師:徐立群教授 學生:陳威翰 R16034177 莊詠絮 R16034208.

4. METHODS

Page 26: Corporate Residence Fraud Detection 指導老師:徐立群教授 學生:陳威翰 R16034177 莊詠絮 R16034208.

4.1 Structured Data

• Interested in Predicting whether or not a foreign company is fraudulent, based on the aggregate, structured information of the associated resident companies.

• To deal with the many-to-one variables, we encode them in the structured data via a “weight-of-evidence” encoding.

structured variable(31)

• Location• Main activity code • legal construct type

Page 27: Corporate Residence Fraud Detection 指導老師:徐立群教授 學生:陳威翰 R16034177 莊詠絮 R16034208.

4.1 Structured Data(cont.)

• Once the features have been engineered into a structured input vector, we train an SVM with a linear kernel.

Page 28: Corporate Residence Fraud Detection 指導老師:徐立群教授 學生:陳威翰 R16034177 莊詠絮 R16034208.

4.2 Transactional Data

• The transactional data can also be represented by vectors:

For each of the n foreign companies associations with companies in Belgium.

Each of the m Belgian companies is represented by a feature.Value of feature in the foreign company’s m-

dimensional vector : 1→connection 0→otherwise.

Page 29: Corporate Residence Fraud Detection 指導老師:徐立群教授 學生:陳威翰 R16034177 莊詠絮 R16034208.

4.2 Transactional Data(cont.)

• Two main approaches:

Propositional learners(SVMs, Naive Bayes)

• on the huge, sparse matrix representation

Relational learning/inference

• on the graph representation

Page 30: Corporate Residence Fraud Detection 指導老師:徐立群教授 學生:陳威翰 R16034177 莊詠絮 R16034208.

Propositional learners

• Method 1: Gather all of the data in a big matrix and apply SVM

Data size ↑ → Time ↑ Class imbalance → NOT perform well Poor performance → Low AUC and lift values of

this approach (SVMT)

Page 31: Corporate Residence Fraud Detection 指導老師:徐立群教授 學生:陳威翰 R16034177 莊詠絮 R16034208.

Propositional learners(cont.)

• Improvement: Train the SVM on a balanced subset of the data. By equally weighing the number of positive and

negative examples SVM learns to put equal importance on each

of the classes and performs much better (SVMT (50-50)).

Page 32: Corporate Residence Fraud Detection 指導老師:徐立群教授 學生:陳威翰 R16034177 莊詠絮 R16034208.

Propositional learners(cont.)

• Method 2: Apply a binary Bernoulli Naive Bayes (NB)

• Uses the same input vectors x, but makes an estimate based on the MAP likelihood estimation of a probability parameter for each of the features.

Page 33: Corporate Residence Fraud Detection 指導老師:徐立群教授 學生:陳威翰 R16034177 莊詠絮 R16034208.

Propositional learners(cont.)

• Fraud is encoded by class label C• C = 1: fraudulence company; C=0 non-fraudulence• xi,j indicate whether a transaction was made from foreign

company j to resident company i• All features are assumed to be conditionally independent of

each other• The NB modeling procedure does not suffer from the class

skew problems of the SVM.

Page 34: Corporate Residence Fraud Detection 指導老師:徐立群教授 學生:陳威翰 R16034177 莊詠絮 R16034208.

Relational learners

• The idea is to project the bigraph into a unigraph in which foreign companies are connected

• Based on shared Belgian company connections and then apply a relational learner.

BC FCTransactional logs

Page 35: Corporate Residence Fraud Detection 指導老師:徐立群教授 學生:陳威翰 R16034177 莊詠絮 R16034208.

Relational learners(cont.)

• Equation (1) presents the weighted-voted Relational Neighbor(wvRN) inference method

• The class probability of a node in the graph (a foreign company) is equal to the weighted average probabilities of all of its neighbors

• (j N∈ (xi)).

Page 36: Corporate Residence Fraud Detection 指導老師:徐立群教授 學生:陳威翰 R16034177 莊詠絮 R16034208.

Relational learners(cont.)

• Equation (2) shows that the weighting (the similarity between two top nodes) was chosen as a sum over the tanh of the inverse of the degrees of the shared nodes.

Page 37: Corporate Residence Fraud Detection 指導老師:徐立群教授 學生:陳威翰 R16034177 莊詠絮 R16034208.

4.3 Stacked model

• Build a model that incorporates all of the available information.

• Efficacy: model incorporates more information than we did before, which should result in a lower modeling bias.

Page 38: Corporate Residence Fraud Detection 指導老師:徐立群教授 學生:陳威翰 R16034177 莊詠絮 R16034208.

5. RESULTS AND DISCUSSION

Page 39: Corporate Residence Fraud Detection 指導老師:徐立群教授 學生:陳威翰 R16034177 莊詠絮 R16034208.

5.1 Results

Page 40: Corporate Residence Fraud Detection 指導老師:徐立群教授 學生:陳威翰 R16034177 莊詠絮 R16034208.

5.1 Results(cont.)

Page 41: Corporate Residence Fraud Detection 指導老師:徐立群教授 學生:陳威翰 R16034177 莊詠絮 R16034208.

5.1 Results(cont.)

• If one is interested in a global ranking method, the stacked model would be the best design choice.

• The models based on transactional data are better suited for detecting the most likely frauds.

Page 42: Corporate Residence Fraud Detection 指導老師:徐立群教授 學生:陳威翰 R16034177 莊詠絮 R16034208.

5.2 Comprehensibility

• The auditors need to understand the exact reasons why classification models make particular decisions.

• Cases (even if they be few) where the model makes an obvious wrong decision can create disillusionment with the system and reluctance to use it.

• it is essential that the decisions made by the predictive model can be explained.

Page 43: Corporate Residence Fraud Detection 指導老師:徐立群教授 學生:陳威翰 R16034177 莊詠絮 R16034208.

5.2 Comprehensibility(cont.)

• Global explanations provide improved understanding of the complete model, and its performance over the entire space of possible instances.

• Instance-level explanations on the other hand provide explanations for the model’s prediction regarding a particular instance.

Page 44: Corporate Residence Fraud Detection 指導老師:徐立群教授 學生:陳威翰 R16034177 莊詠絮 R16034208.

5.3 Deployment

• The interaction between the two worlds [academia and government] has proven very valuable.

• Other countries are now visiting Belgium to see how the Social Intelligence and Investigation Service and the Special Tax Inspection service apply this technique.

Page 45: Corporate Residence Fraud Detection 指導老師:徐立群教授 學生:陳威翰 R16034177 莊詠絮 R16034208.

5.3 Deployment(cont.)

• During deployment, the system has to deal with large volumes of heterogeneous data and with new data arriving every quarter.

• The stacked model approach specifically deals with the variety of the data by combining the transactional data from invoices with structural data from tax declarations.

• The need to retrain the model frequently is facilitated by the scalability of the underlying (naive Bayes and wvRN) methods.

Page 46: Corporate Residence Fraud Detection 指導老師:徐立群教授 學生:陳威翰 R16034177 莊詠絮 R16034208.

6. CONCLUSION

• In this paper is the first data-mining-based method for building a system for detecting corporate residence fraud.

• The system is based on transactional and structured data, which is gathered by the Belgian government.

• The success of such a detection system in practice depends on a combination of factors, including efficiency, efficacy and comprehensibility.

Page 47: Corporate Residence Fraud Detection 指導老師:徐立群教授 學生:陳威翰 R16034177 莊詠絮 R16034208.

6. CONCLUSION(cont.)

• An important part of our research was to evaluate how one can cope with these conflicting requirements.

• It is important to continue to stress the importance of deploying counter-fraud measures for the social good of countries.

Page 48: Corporate Residence Fraud Detection 指導老師:徐立群教授 學生:陳威翰 R16034177 莊詠絮 R16034208.

• Thank you for your listening!