版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
1、fraud detection using data mining techniques:applications in the motor insurance industryhian chye koh and gabriel gervaissim university (singapore)school of business535a clementi roadsingapore, 599490tel: (65) 6248-9644 fax: (65) 6462-4377email: .sgabstractfraud costs stakeholders (e
2、.g., victims, merchants and insurance companies) billions of dollars worldwide and to prevent it, effective fraud detection is the key. this paper examines fraud detection within a data mining framework by first discussing the general approaches to fraud detection and then focusing on particular dat
3、a mining techniques that can be applied to improve it. finally, this paper illustrates the special case of fraud detection in the motor insurance industry where the identification of illicit activities can be particularly challenging because of the nature of motor insurance fraud.keywords: data mini
4、ng; fraud detection, motor insurance.introductionfraud has serious implications in business. take, for example, the case of credit card fraud which includes stolen cards, counterfeit cards and compromised accounts (e.g., application fraud and skimming). as reported in wikipedia (“credit card fraud,”
5、 2009), the cost of credit card fraud in 2006 was 7 cents per 100 dollars of transactions. given the huge volume of annual credit card transactions, this translated into fraudulent activities amounting to billions of dollars worldwide. accordingly, fraud detection has important applications. in part
6、icular, effective fraud detection can contribute to fraud prevention.this paper has two objectives. the first objective is to examine the use of data mining techniques in fraud detection. within the data mining framework, fraud detection can be done using the clustering approach, expectations approa
7、ch or predictive modelling approach. the second objective is to focus on motor insurance fraud and illustrate fraud detection in this area.while credit card fraud is often self-reported (e.g., credit cardholders will quickly find out and report fraudulent transactions made with their credit cards),
8、motor insurance fraud is a lot more difficult to get a handle on (e.g., deliberate “accidents”, inflated claims such as personal injury and unnecessary or excessive repairs). this difficulty is often compounded by possible collusions among different parties (e.g., insurance policyholders and car wor
9、kshops). hence, the issues facing fraud detection in motor insurance can be very challenging.the paper is organised into the following sections. the next (second) section reviews the literature in data mining, fraud detection and motor insurance fraud. the third section discusses the research method
10、ology, including fraud detection approaches and sample data. the fourth section presents the findings and implications. the illustrations focus on two datasets relating to repairs and claims, respectively. the clustering and expectations approaches are applied. finally, the concluding section summar
11、ises the study and highlights the limitations and future directions.it is hoped that this exploratory paper can make a contribution to the fraud detection and data mining literature.literature reviewthe term “data mining” is not new in that it has been used for a long time to denote the idea of unsc
12、ientific “fishing” or “dredging” of data in data analysis. that is, if an analyst is searching for a particular conclusion, then there is a good chance that this conclusion can be “found” by repeatedly analysing the data in various ways, including inappropriate ways. for a long time, the term “data
13、mining” has had a negative connotation.“data mining” as used today, however, refers to an entirely different concept from that of unscientific data fishing or dredging. the new concept of data mining can be considered a recently developed methodology and technology, coming into prominence only in 19
14、94 (trybula, 1997).judging from the number of its definitions in the literature (for example, hormozi giles, 2004) found six different ones), data mining appears to be a discipline whose domain is still evolving. yet, upon a closer inspection of the literature published to date, data mining research
15、ers, practitioners and users do concur on its key aims and characteristics. as a data analysis tool, data mining aims to uncover previously unknown trends and patterns or establish relationships in large datasets so as to help decision makers make better decisions. it fulfils this purpose by employi
16、ng statistical methods such as cluster and logistic analyses but by also using data analysis methods borrowed from other disciplines (e.g. neural networks in artificial intelligence and decision trees in machine learning).data mining has been used by both the public and private sectors, and increasi
17、ngly so. for instance, governments find it useful in ensuring corporate governance compliance (songini, 2004), fighting money-laundering activities (zengan & mao, 2007) and supporting counter-terrorism activities (baesens, mues, martens, & vanthienen, 2009). in the private sector, companies use data
18、 mining in business forecasting, marketing (e.g. market segmentation, advertising campaign optimisation and customer churn reduction see agosta, 2004), customer relationship management (e.g. customer acquisition and retention see hormozi & giles, 2004) as well as in corporate governance (ata & seyre
19、ck, 2009; volonino, gessner, & kermis, 2004; johnson, 2004).however, because data mining excels at anomaly detection in large datasets (a particularly useful feature to unearth outliers), financial institutions are increasingly relying on it for risk management (e.g. credit scoring and bankruptcy pr
20、ediction) (sinha & zhao, 2008) and fraud detection and prevention (oflaherty, 2005). this is done either internally against computer misuse (heatley & otto, 1998) and accounting fraud (jans, lybaert, vanhoof, 2010; dubinsky & warner, 2008) or externally to combat credit/debit card fraud (hand, whitr
21、ow, adams, juszczak, & weston, 2008; huber, 2004) or insurance claim fraud (rejesus, 2004).besides health and medical insurance frauds, motor insurance fraud represents a significant and very costly problem for many stakeholders besides the insurance companies. according to the coalition against ins
22、urance fraud, a us-based not-for-profit organisation representing the interests of a number of american insurance companies, federal and state government authorities as well as some consumer associations, motor insurance fraud falls into three categories: (1) underwriting fraud where dishonest drive
23、rs try to lower motor insurance premiums by lying on their insurance applications or renewals; (2) staged car accidents; and (3) fraudulent and abusive car accident injury claims “which added $4.8 billion to $6.8 billion in excess payments to auto injury claims in 2007” (go figure: fraud data - auto
24、 insurance, n.d.). recent studies in motor insurance fraud include viaene, derrig, & dedene (2005), agyemang et al, (2006), takeuchi & yamanishi (2006), eberle & holder (2007), deshmeh & rahmati (2008) and lian, lida, ying, lee (2009). they involve the use of bayesian learning neural networks, numer
25、ic and symbolic outlier mining techniques, time-series analyses, graphs, association analyses and cluster-based outlier detection, respectively.for example, eberle & holder (2007) presented graph-based approaches to uncovering anomalies and developed three algorithms to discover particular anomalous
26、 types. they validated all three approaches using synthetic data and found that the algorithms were able to detect the anomalies with very high detection rates and minimal false positives. they also validated the algorithms using real-world cargo data and actual fraud scenarios injected into the dat
27、aset with very good results.deshmeh & rahmati (2008) addressed the problem of detecting anomalies in horizontally distributed data. they trained local predictors and extracted association rules using the difference between predicted and actual values on a context dataset. these association rules are
28、 used to represent normal and anomalous behaviours, while a final set of learners use these representations to detect anomalies.further, in lian et al. (2009), outlier detection was applied to detect observations that were grossly different from or inconsistent with the remaining observations in the
29、 dataset. traditionally, outliers are considered as single points. however, many abnormal events have both temporal and spatial locality and might form small clusters that also need to be deemed as outliers. in this context, lian et al. (2009) presented a new definition and detection algorithm for o
30、utliers: cluster-based outliers, which is meaningful and provides importance to the local data behaviour.research methodologythis section discusses fraud detection approaches in general and the analyses performed in the study in particular. it also discusses the data used in the illustrations. for c
31、onfidentiality reasons, no real cases are incorporated in the datasets. instead, the datasets are simulated based on patterns found in motor insurance data.fraud detection approachesgenerally, fraud detection can be done using the: (1) clustering approach, (2) expectations approach, and (3) predicti
32、ve modelling approach. while the first two approaches highlight suspicious cases for further fraud investigation, the last approach directly predicts the probability of fraud.the clustering approach focuses on “normal” patterns/clusters and searches for deviations from the “norm”. these deviations f
33、lag suspicious cases that may be further investigated for fraud. they indicate outliers only and not necessarily fraud cases. on the other hand, the expectations approach focuses on what should be the (expected) value and compares it with what is the (actual) value. large deviations are suspicious.
34、this approach requires a predictive model that generates the expectations.finally, the predictive modelling approach constructs a predictive model that predicts the probability of fraud. such a model attempts to differentiate fraud from non-fraud cases and hence requires data from both categories. t
35、his data requirement may be difficult to satisfy in some types of fraud (e.g., motor insurance fraud). in particular, the fraud data may not be sufficient because there may not be many cases of confirmed fraud, relative to non-fraud cases.this research employs only the clustering and expectations ap
36、proaches. further, data mining techniques such as outliers clustering (similar to ibm-spss proprietary twostep clustering) and decision trees are used to generate the clustering results and expected/predicted values. the data mining software ibm-spss modeller (previously called spss clementine) is u
37、sed in this study.sample datatwo motor insurance datasets are used in this study. the first dataset comprises repairs data with the following inputs: (1) age of driver, (2) gender of driver, (3) claim type, (4) number of injuries, (5) excess amount category, (6) repair workshop, (7) odometer reading
38、, (8) brand of car, (9) year of manufacture, (10) initial estimate of repair costs, and (11) final amount of repair cost. there are a total of 15,000 observations in the repairs dataset.the second dataset comprises claims data and has 50,000 observations. it captures the following inputs: (1) age of
39、 insured, (2) gender of insured, (3) marital status of insured, (4) occupation of insured, (5) nationality of insured, (6) policy type, (7) policyholder type, (8) number of policy renewals, (9) type of accident, (10) type of damage or injury, (11) property or body damage, (12) insurance coverage und
40、er claim, (13) whether own damage or third party claim, (14) whether claimant is policyholder, (15) claim amount, and (16) paid out amount.although many inputs are captured in both the datasets, not all the inputs are used in the fraud detection analyses. in particular, inputs that contain substanti
41、al missing values are not included.findings and implicationsthe analyses performed can be grouped accordingly to the datasets on which they are performed, namely the repairs dataset and the claims dataset.repairs datasettwo models were constructed on the repairs dataset. the first model used the exp
42、ectations approach to estimate/compute the expected repair cost while the second model looked at the difference between the final amount and initial estimate of the repair cost (i.e., diff = final amount initial estimate).repairs model 1in this model, the following inputs were included in the analys
43、is: (1) age of driver, (2) gender of driver, (3) claim type, (4) number of injuries, (5) excess amount category, (6) repair workshop, (7) odometer reading, (8) brand of car, and (9) year of manufacture. the output was the final amount of repair cost. regression analysis, neural network, chaid and ca
44、rt (the last two being decision trees) were employed to construct the prediction model.figure 1 shows the accuracy and hit rates of the models. based on these results, the neural network performed the best, followed closely by the chaid and cart models. the regression model did not perform well in t
45、erms of predicting the final amount of repair cost. given the advantages of decision trees (e.g., ease of interpretation and deployment), the chaid decision tree was selected for fraud detection.-insert figure 1 about here-the chaid decision tree indicates that the following inputs are significantly
46、 associated with the repair cost: (1) brand of car, (2) year of manufacture, and (c) repair workshop. in particular, luxurious cars and newer cars show a higher level of repair costs. certain repair workshops tend to charge more too. figure 2 presents the chaid decision tree.-insert figure 2 about h
47、ere-to flag suspicious cases, the difference between the final amount of repair cost and the expected/predicted repair cost (as generated by the chaid decision tree) was computed. figure 3 shows the observations with high differences. in particular, 46 observations have differences greater than s$30
48、,000. these are the flagged suspicious fraud cases.-insert figure 3 about here-repairs model 2the second model looks at the difference between the final amount and initial estimate of the repair cost (i.e., diff = final amount initial estimate). the difference is dichomotised such that “2” represent
49、s a difference of more than s$1,000. a larger amount shows a greater difference between the expected repair cost (based on the initial estimate) and the actual repair cost (based on the final amount). this model highlights the “risk” factors that may need special consideration or scrutiny and identi
50、fies large differences that warrant future fraud investigation. the results shows repair workshop to be the most significant “risk” factor.claims datasettwo models were also constructed on the claims dataset. the first model used the “anomaly” node in ibm-spss modeller to cluster the observations an
51、d identify outliers. the second model was constructed to compute the expected/predicted paid out amount. the difference between the actual paid out amount and expected/predicted paid out amount can be used to flag suspicious cases. the larger the difference, the more suspicious the observation is of
52、 fraud.claims model 1in this model, the following inputs were used in the analysis: (1) age of insured, (2) gender of insured, (3) marital status of insured, (4) occupation of insured, (5) nationality of insured, (6) policy type, (7) policyholder type, (8) number of policy renewals, (9) type of acci
53、dent, (10) type of damage or injury, (11) property or body damage, (12) insurance coverage under claim, (13) whether own damage or third party claim, (14) whether claimant is policyholder, (15) claim amount, and (16) paid out amount. the “anomaly” node in ibm-spss modeller first clusters the observa
54、tions and then computes the anomaly index for each observation. observations with anomaly index greater than the threshold are flagged as outliers (or anomalies). these observations are suspicious cases that should be further investigated for fraud.the anomaly results are shown in figure 4. for this
55、 application, an anomaly index threshold of 4.0 was used, resulting in 139 flagged suspicious cases that warrant further fraud investigation. there were five natural clusters in the claims dataset. -insert figure 4 about here-claims model 2in this model, the expectations approach is used to estimate
56、/compute the expected total amount paid out. all the inputs in the claims dataset were included in the modelling except the claim amount and paid out amount (which is the output). the chaid decision tree appears to be the best compromised model between the accuracy and hit rates. the results indicat
57、e that three most significant inputs associated with the paid out amount are: (1) type of accident, (2) type of damage or injury, and (3) insurance coverage under claim. in particular, single vehicle accidents seem to attract the highest claims.to flag suspicious cases, the difference between the actual paid out amount and the expected/predicted paid out amount (as generated by the chaid decision tree) was computed. figure 5 shows the observations with high differences. in particular, 69 observations have differences greater than s$60,000. these are the flagged suspicious fraud
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 2024-2030年中国微灌溉系统行业投资潜力及前景供需平衡预测研究报告
- 2024-2030年中国彩色缩微胶卷行业发展策略及产业竞争格局建议研究报告
- 仓储合同范本答案
- 连锁企业股权转让合同范本
- 剧组管理人员劳动合同范本
- “爱路护路安全教育活动”实施方案
- 零售业店铺音乐与消费情绪研究考核试卷
- 新学期的计划范文集锦九篇
- 货币专用设备生产成本分析考核试卷
- 陶瓷制品热处理工艺考核试卷
- 2024年中建八局土木公司招聘笔试参考题库含答案解析
- 仓库主管工作内容与绩效考核
- DIP-SOP通用版本-副本
- 人民医院泌尿外科临床技术操作规范2023版
- 2023春国开合同法第5章试题及答案
- 管网漏水探测技术及设备应用
- 第七章 2015-9-28 新药研发领域的伦理.2
- 游戏工作室合作协议
- 全国中小学中医药文化知识读本
- 拍卖公司管理制度-拍卖有限公司内部管理制度
- 铺床叠被 (说课稿)-二年级上册劳动浙教版
评论
0/150
提交评论