版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
1、2022-6-231 Generate descriptions for characterization and comparison of the datathe simplest kind of descriptive data miningsometimes called class description when the concept to be described refers to a class of objectsCharacterization: provide a concise and succinct summarization of the given coll
2、ection of dataComparison (discrimination): provide descriptions comparing two or more collections of data第1页/共41页第一页,共41页。2022-6-232 both characterization and discrimination are based on data generalization and summarizationData generalization a process which abstracts a large set of task-relevant d
3、ata in a database from a relatively low conceptual level to higher conceptual levels Data generalization approaches: data cube approach attribute-oriented induction approach第2页/共41页第二页,共41页。2022-6-233 The data for analysis are stored in a multidimensional database, or data cube generalization and sp
4、ecialization can be performed on a data cube by roll-up and drill-down this is not an approach for concept description, only for data generalization Limitations:most commercial data cube implementations confine the types of dimensions to simple nonnumeric data and of measures to simple aggregated nu
5、meric values concept hierarchies can be automatically generated from numeric data to form numeric dimensions, however, this is a result of recent data mining research and is not available in most commercial systemscannot tell which dimensions should be used and what levels should the generalization
6、reach第3页/共41页第三页,共41页。2022-6-234OLAPrestricted to certain kinds of attributes and measure typesuser-controlled processConcept descriptioncan handle complex data types of the attributes and their aggregationsa more automated process第4页/共41页第四页,共41页。2022-6-235proposed in 1989Y. Cai, N. Cercone, and J.
7、 Han, KDD Workshop at IJCAI-89in its initial proposal, AOI is a relational database query-oriented, generalization-based, online data analysis techniquenow data cube and offline precomputation can also be usedcan be used for both characterization and discriminationgeneral idea:collect the task-relev
8、ant dataperform generalization by attribute removal or attribute generalizationapply aggregation by merging identical, generalized tuples and accumulating their respective countsinteractive presentation with users第5页/共41页第五页,共41页。2022-6-236Data focusing the specification of task-relevant data, whose
9、 result is the initial relationData generalization attribute removal if there is a large set of distinct values for an attribute, but either (1) there is no generalization operator on the attribute, or (2) its higher level concepts are expressed in terms of other attributes attribute generalization
10、if there is a large set of distinct values for an attribute, and there exists a set of generalization operators on the attributePresentation第6页/共41页第六页,共41页。2022-6-237 the control of how high an attribute should be generalized is quite subjective the control of this process is called attribute gener
11、alization controlattribute generalization threshold controlif the number of distinct values in an attribute is greater than the threshold, then further attribute removal or generalization should be performedeither set a generalization threshold for all the attributes, or set one threshold for each a
12、ttributetypically ranging from 2 to 8generalized relation threshold controlif the number of distinct tuples in a generalized relation is greater than the threshold, then further generalization should be performedset a threshold for the size of the generalized relation typically ranging from 10 to 30
13、第7页/共41页第七页,共41页。2022-6-238Initial relation Prime generalized relation第8页/共41页第八页,共41页。2022-6-239InitialRelquery processing of task-relevant data, deriving the initial relationPreGenbased on the analysis of the number of distinct values in each attribute, determine generalization plan for each attri
14、bute: removal? or how high to generalize?PrimeGenbased on the PreGen plan, perform generalization to the appropriate level to derive a “prime generalized relation”, accumulating the countsPresentationuser interaction: (1) adjust levels by drilling, (2) pivoting, (3) mapping into rules, cross tabs, v
15、isualization presentations第9页/共41页第九页,共41页。2022-6-2310Construct a data cube on-the-fly if either the task-relevant data set is too specific to match any predefined data cube, or it is not very large benefitfacilitate efficient drill-down analysis costincrease response time because of the computation
16、 of data cube balancecompute a cube-structured “subprime” relation in which each dimension of the generalized relation is a few levels deeper than the level of the prime relation第10页/共41页第十页,共41页。2022-6-2311Use a predefined data cube if the granularity of the task-relevant data can match that of the
17、 predefined data cube and the set of task-relevant data is quite large benefit facilitate attribute analysis, attribute-oriented induction, slicing and dicing, drill-down, and roll-up cost the cost of cube computation, and the nontrivial storage overhead第11页/共41页第十一页,共41页。2022-6-2312Generalized rela
18、tionsales in 1997第12页/共41页第十二页,共41页。2022-6-2313Crosstabsales in 1997第13页/共41页第十三页,共41页。2022-6-2314 Bar chartsales in 1997第14页/共41页第十四页,共41页。2022-6-2315 Pie chartsales in 1997第15页/共41页第十五页,共41页。2022-6-2316 3-D cube第16页/共41页第十六页,共41页。2022-6-2317 Quantitative characteristic rule X, item(X)=“computer”(l
19、ocation(X)=“Asia”)t:25.00% location(X)=“Europe”)t:30.00% (location(X)=“North_America”)t:45.00%a logic rule that is associated with quantitative information is called a quantitative rulethe general form of a quantitative characteristic rule is: X, target_class(X)condition1(X)t:w1conditionn(X)t:wnwher
20、e t-weight describes the typicality of each disjunct in the rulecharacteristic rule is necessary condition of the target class Nttnqcountqcountweightt1)()(_第17页/共41页第十七页,共41页。2022-6-2318it is difficult for users to determine which dimensions should be includedattribute relevance analysis is used tof
21、ilter out statistically irrelevant or weakly relevant attributesretain or even rank the most relevant attributesclass characterization which includes the analysis of attribute/dimension relevance is called analytical characterizationclass comparison which includes such analysis is called analytical
22、comparison even within the same dimension, different levels of concepts may have different powers for distinguishing a class from others第18页/共41页第十八页,共41页。2022-6-2319 Intuitively, an attribute is considered highly relevant to a given class if it is likely that the values of the attribute may be used
23、 to distinguish the class from othersData collectioncollect data for both target class and contrasting classfor class characterization, the contrasting class is taken to be the set of comparable data, i.e. data sharing similar attributes, of other classes in the databaseAnalytical generalizationperf
24、orm attribute removal and attribute generalization based on the set of provided attribute analytical thresholdRelevant analysissort and then select the most relevant attributes第19页/共41页第十九页,共41页。2022-6-2320 The general idea behind attribute relevance analysis is to compute some measure which is used
25、 to quantify the relevance of an attribute with respect to a given classpopular measures include:information gaingain ratiogini indexuncertaintycorrelation coefficients第20页/共41页第二十页,共41页。2022-6-2321S: training setSi: training instances of class Ci (i = 1, , m)aj: values of attribute A (j = 1, , v)th
26、e information needed to correctly classify the training set isSSSSSSSIimiim2121log),( suppose attribute A is selected to partition the training set into the subsets S1, S2, , Sv, then the entropy of A, i.e. the information needed to classify all the instances in those subsets is mijijjijvjjSSSSSSAEn
27、t121log)(where Sij is the instances of class Ci that are covered by Sjthen the information gain of selecting A isthe bigger the information gain, the more relevant the attribute A)(),()(21AEntSSSIAGainm 第21页/共41页第二十一页,共41页。2022-6-2322Target class: Graduate students (=120)Contrasting class: Undergrad
28、uate students (=130)gendermajorbirth_countryage_rangegpacountMFMFMFScience Science Engineering Science Science Engineering Canada Foreign Foreign Foreign Canada Canada 20-25 25-30 25-30 25-30 20-25 20-25 Very_good ExcellentExcellent Excellent Excellent Excellent 16221825 2118gendermajorbirth_country
29、age_rangegpacountMFMFMFScience Business Business Science Engineering Engineering Foreign Canada Canada Canada Foreign Canada20 20 2020-2520-2520 Very_goodFairFairFair Very_good Excellent 18202224 2224第22页/共41页第二十二页,共41页。2022-6-2323Target class:16221825 2118countGSGSGSGSGSGSVery_good ExcellentExcelle
30、nt Excellent Excellent Excellent 20-25 25-30 25-30 25-30 20-25 20-25 Canada Foreign Foreign Foreign Canada Canada Science Science Engineering Science Science Engineering MFMFMFclassgpaage_rangebirth_countrymajorgender18202224 2224SSSSSSVery_goodFairFairFair Very_good Excellent 20 20 2020-2520-2520 F
31、oreign Canada Canada Canada Foreign CanadaScience Business Business Science Engineering Engineering MFMFMF第23页/共41页第二十三页,共41页。2022-6-2324the information needed to correctly classify the training set is9988. 0250130log250130250120log250120)130,120(),(2221 ISSI suppose attribute major is selected to p
32、artition the training setfor major = “Science”: S11 = 84, S21 = 429183. 012642log1264212684log12684),(222111 SSIfor major = “Engineer”: S12 = 36, S22 = 469892. 08246log82468236log8236),(222212 SSIfor major = “Engineer”: S13 = 0, S23 = 420),(2313 SSI then the entropy of major is7873. 0),(25042),(2508
33、2),(250126)(231322122111 SSISSISSImajorEnt第24页/共41页第二十四页,共41页。2022-6-2325then the information gain of major is now suppose we use an attribute relevance threshold of 0.1:gender and birth_country are removed as weakly relevant attributesmajor, gpa, and age_range are kept as strong relevant attributes
34、2115. 0)(),()(21 majorEntSSImajorGain we can also get the information gain of other attributes:5971. 0)_(4490. 0)(0407. 0)_(0003. 0)( rabgeageGaingpaGaincountrybirthGaingenderGain第25页/共41页第二十五页,共41页。2022-6-2326most of the techniques developed for characterization can also be used in comparisondata g
35、eneralization (including attribute removal and attribute generalization) should be performed synchronously among all the classes e.g. comparing sales in China on Nov. 9 with sales in USA in year 2000 is almost meaningless however, the user can over-write such an synchronous comparison with his own c
36、hoices e.g. the user may want to compare sales in Shanghai with sales in Vietnam第26页/共41页第二十六页,共41页。2022-6-2327Data collectioncollect data for both target class and contrasting classRelevance analysisuse attribute relevance analysis for analytical class comparisonSynchronous generalizationcontrolled
37、 by user specified dimension thresholdsDrill-down, roll-up and other OLAP operationsadjust the level of abstraction for the resulted descriptionPresentationas the same forms as that for characterization, except the rule form第27页/共41页第二十七页,共41页。2022-6-2328Prime generalized relation for the target cla
38、ss: Graduate studentsPrime generalized relation for the contrasting class: Undergraduate studentsbirth_countryage_rangegpacount %CanadaCanadaCanadaOther20-2525-30Over_30 Over_30 Good Good Very_good Excellent 5.532.32 5.86 4.68 birth_countryage_rangegpacount %CanadaCanadaCanadaOther15-2015-20 25-30 O
39、ver_30 Fair Good Good Excellent5.53 4.535.02 0.68第28页/共41页第二十八页,共41页。2022-6-2329the general form of a quantitative discriminant rule is: X,Target_class(X)condition1(X)d:w1conditionn(X)d:wnwhere d-weight describes the discriminability of each disjunct in the rule NiiajaCqcountCqcountweightd1)()(_disc
40、riminant rule is sufficient condition of the target classexample:statusbirth_countryage_rangegpacountGraduateCanada25-30Good90UndergraduateCanada 25-30 Good210 X,graduate_student(X) birth_country(X)=“Canada” age_range(X)=“25-30” gpa(X)=“good”d:30%第29页/共41页第二十九页,共41页。2022-6-2330Characteristic rulenec
41、essary condition of the target classt_weight1+t_weightn=100Discriminant rulesufficient condition of the target classcondport1 d_weight1+condportn d_weightn=tclass_port where condporti is the portion of the instances covered by the i-th antecedents of the rule, tclass_port is the portion of the insta
42、nces belong to the target class第30页/共41页第三十页,共41页。2022-6-2331quantitative description rule can also be expressed as a crosstabexample: X, target_class(X) condtion1(X)t:w1, d:w1condtionn(X)t:wn, d:wn X, Europe(X) (item(X)=“TV”)t:25%, d:40%(item(X)=“computer”)t:75%,d:30第31页/共41页第三十一页,共41页。2022-6-2332M
43、otivation: to better understand the data: central tendency, dispersiondispersion: the degree to which numeric data tend to spreadmeasures for central tendency:meanmedianmodemidrangemeasures for dispersion:quartilesvariancestandard deviation第32页/共41页第三十二页,共41页。2022-6-2333Medianmiddle value if odd num
44、ber of values, or average of the middle two values otherwise (holistic)Modevalue that occurs most frequently in the data set (holistic) unimodal, bimodal, trimodal, multimodal, no modeMidrangethe average of the min and max values (algebraic) Mean weighted arithmetic mean (algebraic) niixnx11 niiniiiwxwx11第33页/共41页第三十三页,共41页。2022-6-2334QuartilesQuartiles (holistic): Q1 (25th percentile), Q3 (75th percentile) Inter-quartile range (holisti
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 2024年度南京居民住宅装修工程合同解除条件合同3篇
- 2024年土地储备项目合作开发合同3篇
- 2024地铁隧道非开挖施工合同
- 二零二四年度股权激励合同的业绩考核指标3篇
- 2024年度企业文化建设合同:制衣厂与企业文化公司的企业文化建设合同2篇
- 2024年专业化妆品OEM生产合作协议范本
- 石油仓储租赁协议三篇
- 2024年专业性有偿担保服务协议一
- 二零二四年度股权激励合同:管理层股权激励与业绩挂钩协议2篇
- 2024年农场有机肥使用及技术辅导合同3篇
- 24秋国家开放大学《城市管理学》形考任务1答案(第1套)
- 农业体验实践课程设计
- 2024年国家公务员考试《行测》真题卷(行政执法)答案和解析
- 商场防恐防暴应急预案
- 第二次月考卷-2024-2025学年统编版语文六年级上册
- 概率论与数理统计(浙大内部课件)
- 2022年《数据结构(本)》形考任务实践活动3
- 2024年贵州专业技术继续教育公需科目考试部分试题(含答案)
- 惠州市2024年四年级数学第一学期期末联考试题含解析
- 2022年江苏镇江中考满分作文《其实是缺点》2
- 3.4沉淀溶解平衡及影响因素的探究课件高二上学期化学人教版选择性必修1
评论
0/150
提交评论