FRAUD DETECTION USING DATA MINING TECHNIQUES APPLICATIONS IN THE MOTOR INSURANCE INDUSTRY_第1页
FRAUD DETECTION USING DATA MINING TECHNIQUES APPLICATIONS IN THE MOTOR INSURANCE INDUSTRY_第2页
FRAUD DETECTION USING DATA MINING TECHNIQUES APPLICATIONS IN THE MOTOR INSURANCE INDUSTRY_第3页
FRAUD DETECTION USING DATA MINING TECHNIQUES APPLICATIONS IN THE MOTOR INSURANCE INDUSTRY_第4页
FRAUD DETECTION USING DATA MINING TECHNIQUES APPLICATIONS IN THE MOTOR INSURANCE INDUSTRY_第5页
已阅读5页,还剩10页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

1、fraud detection using data mining techniques:applications in the motor insurance industryhian chye koh and gabriel gervaissim university (singapore)school of business535a clementi roadsingapore, 599490tel: (65) 6248-9644 fax: (65) 6462-4377email: .sgabstractfraud costs stakeholders (e

2、.g., victims, merchants and insurance companies) billions of dollars worldwide and to prevent it, effective fraud detection is the key. this paper examines fraud detection within a data mining framework by first discussing the general approaches to fraud detection and then focusing on particular dat

3、a mining techniques that can be applied to improve it. finally, this paper illustrates the special case of fraud detection in the motor insurance industry where the identification of illicit activities can be particularly challenging because of the nature of motor insurance fraud.keywords: data mini

4、ng; fraud detection, motor insurance.introductionfraud has serious implications in business. take, for example, the case of credit card fraud which includes stolen cards, counterfeit cards and compromised accounts (e.g., application fraud and skimming). as reported in wikipedia (“credit card fraud,”

5、 2009), the cost of credit card fraud in 2006 was 7 cents per 100 dollars of transactions. given the huge volume of annual credit card transactions, this translated into fraudulent activities amounting to billions of dollars worldwide. accordingly, fraud detection has important applications. in part

6、icular, effective fraud detection can contribute to fraud prevention.this paper has two objectives. the first objective is to examine the use of data mining techniques in fraud detection. within the data mining framework, fraud detection can be done using the clustering approach, expectations approa

7、ch or predictive modelling approach. the second objective is to focus on motor insurance fraud and illustrate fraud detection in this area.while credit card fraud is often self-reported (e.g., credit cardholders will quickly find out and report fraudulent transactions made with their credit cards),

8、motor insurance fraud is a lot more difficult to get a handle on (e.g., deliberate “accidents”, inflated claims such as personal injury and unnecessary or excessive repairs). this difficulty is often compounded by possible collusions among different parties (e.g., insurance policyholders and car wor

9、kshops). hence, the issues facing fraud detection in motor insurance can be very challenging.the paper is organised into the following sections. the next (second) section reviews the literature in data mining, fraud detection and motor insurance fraud. the third section discusses the research method

10、ology, including fraud detection approaches and sample data. the fourth section presents the findings and implications. the illustrations focus on two datasets relating to repairs and claims, respectively. the clustering and expectations approaches are applied. finally, the concluding section summar

11、ises the study and highlights the limitations and future directions.it is hoped that this exploratory paper can make a contribution to the fraud detection and data mining literature.literature reviewthe term “data mining” is not new in that it has been used for a long time to denote the idea of unsc

12、ientific “fishing” or “dredging” of data in data analysis. that is, if an analyst is searching for a particular conclusion, then there is a good chance that this conclusion can be “found” by repeatedly analysing the data in various ways, including inappropriate ways. for a long time, the term “data

13、mining” has had a negative connotation.“data mining” as used today, however, refers to an entirely different concept from that of unscientific data fishing or dredging. the new concept of data mining can be considered a recently developed methodology and technology, coming into prominence only in 19

14、94 (trybula, 1997).judging from the number of its definitions in the literature (for example, hormozi giles, 2004) found six different ones), data mining appears to be a discipline whose domain is still evolving. yet, upon a closer inspection of the literature published to date, data mining research

15、ers, practitioners and users do concur on its key aims and characteristics. as a data analysis tool, data mining aims to uncover previously unknown trends and patterns or establish relationships in large datasets so as to help decision makers make better decisions. it fulfils this purpose by employi

16、ng statistical methods such as cluster and logistic analyses but by also using data analysis methods borrowed from other disciplines (e.g. neural networks in artificial intelligence and decision trees in machine learning).data mining has been used by both the public and private sectors, and increasi

17、ngly so. for instance, governments find it useful in ensuring corporate governance compliance (songini, 2004), fighting money-laundering activities (zengan & mao, 2007) and supporting counter-terrorism activities (baesens, mues, martens, & vanthienen, 2009). in the private sector, companies use data

18、 mining in business forecasting, marketing (e.g. market segmentation, advertising campaign optimisation and customer churn reduction see agosta, 2004), customer relationship management (e.g. customer acquisition and retention see hormozi & giles, 2004) as well as in corporate governance (ata & seyre

19、ck, 2009; volonino, gessner, & kermis, 2004; johnson, 2004).however, because data mining excels at anomaly detection in large datasets (a particularly useful feature to unearth outliers), financial institutions are increasingly relying on it for risk management (e.g. credit scoring and bankruptcy pr

20、ediction) (sinha & zhao, 2008) and fraud detection and prevention (oflaherty, 2005). this is done either internally against computer misuse (heatley & otto, 1998) and accounting fraud (jans, lybaert, vanhoof, 2010; dubinsky & warner, 2008) or externally to combat credit/debit card fraud (hand, whitr

21、ow, adams, juszczak, & weston, 2008; huber, 2004) or insurance claim fraud (rejesus, 2004).besides health and medical insurance frauds, motor insurance fraud represents a significant and very costly problem for many stakeholders besides the insurance companies. according to the coalition against ins

22、urance fraud, a us-based not-for-profit organisation representing the interests of a number of american insurance companies, federal and state government authorities as well as some consumer associations, motor insurance fraud falls into three categories: (1) underwriting fraud where dishonest drive

23、rs try to lower motor insurance premiums by lying on their insurance applications or renewals; (2) staged car accidents; and (3) fraudulent and abusive car accident injury claims “which added $4.8 billion to $6.8 billion in excess payments to auto injury claims in 2007” (go figure: fraud data - auto

24、 insurance, n.d.). recent studies in motor insurance fraud include viaene, derrig, & dedene (2005), agyemang et al, (2006), takeuchi & yamanishi (2006), eberle & holder (2007), deshmeh & rahmati (2008) and lian, lida, ying, lee (2009). they involve the use of bayesian learning neural networks, numer

25、ic and symbolic outlier mining techniques, time-series analyses, graphs, association analyses and cluster-based outlier detection, respectively.for example, eberle & holder (2007) presented graph-based approaches to uncovering anomalies and developed three algorithms to discover particular anomalous

26、 types. they validated all three approaches using synthetic data and found that the algorithms were able to detect the anomalies with very high detection rates and minimal false positives. they also validated the algorithms using real-world cargo data and actual fraud scenarios injected into the dat

27、aset with very good results.deshmeh & rahmati (2008) addressed the problem of detecting anomalies in horizontally distributed data. they trained local predictors and extracted association rules using the difference between predicted and actual values on a context dataset. these association rules are

28、 used to represent normal and anomalous behaviours, while a final set of learners use these representations to detect anomalies.further, in lian et al. (2009), outlier detection was applied to detect observations that were grossly different from or inconsistent with the remaining observations in the

29、 dataset. traditionally, outliers are considered as single points. however, many abnormal events have both temporal and spatial locality and might form small clusters that also need to be deemed as outliers. in this context, lian et al. (2009) presented a new definition and detection algorithm for o

30、utliers: cluster-based outliers, which is meaningful and provides importance to the local data behaviour.research methodologythis section discusses fraud detection approaches in general and the analyses performed in the study in particular. it also discusses the data used in the illustrations. for c

31、onfidentiality reasons, no real cases are incorporated in the datasets. instead, the datasets are simulated based on patterns found in motor insurance data.fraud detection approachesgenerally, fraud detection can be done using the: (1) clustering approach, (2) expectations approach, and (3) predicti

32、ve modelling approach. while the first two approaches highlight suspicious cases for further fraud investigation, the last approach directly predicts the probability of fraud.the clustering approach focuses on “normal” patterns/clusters and searches for deviations from the “norm”. these deviations f

33、lag suspicious cases that may be further investigated for fraud. they indicate outliers only and not necessarily fraud cases. on the other hand, the expectations approach focuses on what should be the (expected) value and compares it with what is the (actual) value. large deviations are suspicious.

34、this approach requires a predictive model that generates the expectations.finally, the predictive modelling approach constructs a predictive model that predicts the probability of fraud. such a model attempts to differentiate fraud from non-fraud cases and hence requires data from both categories. t

35、his data requirement may be difficult to satisfy in some types of fraud (e.g., motor insurance fraud). in particular, the fraud data may not be sufficient because there may not be many cases of confirmed fraud, relative to non-fraud cases.this research employs only the clustering and expectations ap

36、proaches. further, data mining techniques such as outliers clustering (similar to ibm-spss proprietary twostep clustering) and decision trees are used to generate the clustering results and expected/predicted values. the data mining software ibm-spss modeller (previously called spss clementine) is u

37、sed in this study.sample datatwo motor insurance datasets are used in this study. the first dataset comprises repairs data with the following inputs: (1) age of driver, (2) gender of driver, (3) claim type, (4) number of injuries, (5) excess amount category, (6) repair workshop, (7) odometer reading

38、, (8) brand of car, (9) year of manufacture, (10) initial estimate of repair costs, and (11) final amount of repair cost. there are a total of 15,000 observations in the repairs dataset.the second dataset comprises claims data and has 50,000 observations. it captures the following inputs: (1) age of

39、 insured, (2) gender of insured, (3) marital status of insured, (4) occupation of insured, (5) nationality of insured, (6) policy type, (7) policyholder type, (8) number of policy renewals, (9) type of accident, (10) type of damage or injury, (11) property or body damage, (12) insurance coverage und

40、er claim, (13) whether own damage or third party claim, (14) whether claimant is policyholder, (15) claim amount, and (16) paid out amount.although many inputs are captured in both the datasets, not all the inputs are used in the fraud detection analyses. in particular, inputs that contain substanti

41、al missing values are not included.findings and implicationsthe analyses performed can be grouped accordingly to the datasets on which they are performed, namely the repairs dataset and the claims dataset.repairs datasettwo models were constructed on the repairs dataset. the first model used the exp

42、ectations approach to estimate/compute the expected repair cost while the second model looked at the difference between the final amount and initial estimate of the repair cost (i.e., diff = final amount initial estimate).repairs model 1in this model, the following inputs were included in the analys

43、is: (1) age of driver, (2) gender of driver, (3) claim type, (4) number of injuries, (5) excess amount category, (6) repair workshop, (7) odometer reading, (8) brand of car, and (9) year of manufacture. the output was the final amount of repair cost. regression analysis, neural network, chaid and ca

44、rt (the last two being decision trees) were employed to construct the prediction model.figure 1 shows the accuracy and hit rates of the models. based on these results, the neural network performed the best, followed closely by the chaid and cart models. the regression model did not perform well in t

45、erms of predicting the final amount of repair cost. given the advantages of decision trees (e.g., ease of interpretation and deployment), the chaid decision tree was selected for fraud detection.-insert figure 1 about here-the chaid decision tree indicates that the following inputs are significantly

46、 associated with the repair cost: (1) brand of car, (2) year of manufacture, and (c) repair workshop. in particular, luxurious cars and newer cars show a higher level of repair costs. certain repair workshops tend to charge more too. figure 2 presents the chaid decision tree.-insert figure 2 about h

47、ere-to flag suspicious cases, the difference between the final amount of repair cost and the expected/predicted repair cost (as generated by the chaid decision tree) was computed. figure 3 shows the observations with high differences. in particular, 46 observations have differences greater than s$30

48、,000. these are the flagged suspicious fraud cases.-insert figure 3 about here-repairs model 2the second model looks at the difference between the final amount and initial estimate of the repair cost (i.e., diff = final amount initial estimate). the difference is dichomotised such that “2” represent

49、s a difference of more than s$1,000. a larger amount shows a greater difference between the expected repair cost (based on the initial estimate) and the actual repair cost (based on the final amount). this model highlights the “risk” factors that may need special consideration or scrutiny and identi

50、fies large differences that warrant future fraud investigation. the results shows repair workshop to be the most significant “risk” factor.claims datasettwo models were also constructed on the claims dataset. the first model used the “anomaly” node in ibm-spss modeller to cluster the observations an

51、d identify outliers. the second model was constructed to compute the expected/predicted paid out amount. the difference between the actual paid out amount and expected/predicted paid out amount can be used to flag suspicious cases. the larger the difference, the more suspicious the observation is of

52、 fraud.claims model 1in this model, the following inputs were used in the analysis: (1) age of insured, (2) gender of insured, (3) marital status of insured, (4) occupation of insured, (5) nationality of insured, (6) policy type, (7) policyholder type, (8) number of policy renewals, (9) type of acci

53、dent, (10) type of damage or injury, (11) property or body damage, (12) insurance coverage under claim, (13) whether own damage or third party claim, (14) whether claimant is policyholder, (15) claim amount, and (16) paid out amount. the “anomaly” node in ibm-spss modeller first clusters the observa

54、tions and then computes the anomaly index for each observation. observations with anomaly index greater than the threshold are flagged as outliers (or anomalies). these observations are suspicious cases that should be further investigated for fraud.the anomaly results are shown in figure 4. for this

55、 application, an anomaly index threshold of 4.0 was used, resulting in 139 flagged suspicious cases that warrant further fraud investigation. there were five natural clusters in the claims dataset. -insert figure 4 about here-claims model 2in this model, the expectations approach is used to estimate

56、/compute the expected total amount paid out. all the inputs in the claims dataset were included in the modelling except the claim amount and paid out amount (which is the output). the chaid decision tree appears to be the best compromised model between the accuracy and hit rates. the results indicat

57、e that three most significant inputs associated with the paid out amount are: (1) type of accident, (2) type of damage or injury, and (3) insurance coverage under claim. in particular, single vehicle accidents seem to attract the highest claims.to flag suspicious cases, the difference between the actual paid out amount and the expected/predicted paid out amount (as generated by the chaid decision tree) was computed. figure 5 shows the observations with high differences. in particular, 69 observations have differences greater than s$60,000. these are the flagged suspicious fraud

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论