版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
1、浅探关节镜下盘状半月板损伤的治疗RecapTodays topicsFeature selection for text classificationMeasuring classification performanceNearest neighbor categorizationFeature Selection: Why?Text collections have a large number of features10,000 1,000,000 unique words and moreMake using a particular classifier feasibleSome c
2、lassifiers cant deal with 100,000s of featsReduce training timeTraining time for some methods is quadratic or worse in the number of features (e.g., logistic regression)Improve generalizationEliminate noise featuresAvoid overfittingRecap: Feature ReductionStandard ways of reducing feature space for
3、textStemmingLaugh, laughs, laughing, laughed - laughStop word removalE.g., eliminate all prepositionsConversion to lower caseTokenizationBreak on all special characters: fire-fighter - fire, fighterFeature SelectionYang and Pedersen 1997Comparison of different selection criteriaDF document frequency
4、IG information gainMI mutual informationCHI chi squareCommon strategyCompute statistic for each termKeep n terms with highest value of this statisticInformation Gain(Pointwise) Mutual InformationChi-SquareTerm presentTerm absentDocument belongs to categoryABDocument does not belong to categoryCDX2 =
5、 N(AD-BC)2 / ( (A+B) (A+C) (B+D) (C+D) )Use either maximum or average X2Value for complete independence?Document FrequencyNumber of documents a term occurs in Is sometimes used for eliminating both very frequent and very infrequent termsHow is document frequency measure different from the other 3 me
6、asures?Yang&Pedersen: ExperimentsTwo classification methodskNN (k nearest neighbors; more later)Linear Least Squares FitRegression methodCollectionsReuters-2217392 categories16,000 unique termsOhsumed: subset of medline14,000 categories72,000 unique termsLtc term weighting Yang&Pedersen: Experiments
7、Choose feature set sizePreprocess collection, discarding non-selected features / wordsApply term weighting - feature vector for each documentTrain classifier on training setEvaluate classifier on test setDiscussionYou can eliminate 90% of features for IG, DF, and CHI without decreasing performance.I
8、n fact, performance increases with fewer features for IG, DF, and CHI.Mutual information is very sensitive to small counts.IG does best with smallest number of features.Document frequency is close to optimal. By far the simplest feature selection method.Similar results for LLSF (regression).ResultsW
9、hy is selecting common terms a good strategy?IG, DF, CHI Are Correlated.Information Gain vs Mutual InformationInformation gain is similar to MI for random variablesIndependence?In contrast, pointwise MI ignores non-occurrence of termsE.g., for complete dependence, you get:P(AB)/ (P(A)P(B) = 1/P(A) l
10、arger for rare terms than for frequent termsYang&Pedersen: Pointwise MI favors rare termsFeature Selection:Other ConsiderationsGeneric vs Class-SpecificCompletely generic (class-independent)Separate feature set for each classMixed (a la Yang&Pedersen)Maintainability over timeIs aggressive features s
11、election good or bad for robustness over time?Ideal: Optimal features selected as part of trainingYang&Pedersen: LimitationsDont look at class specific feature selectionDont look at methods that cant handle high-dimensional spacesEvaluate category ranking (as opposed to classification accuracy)Featu
12、re Selection: Other MethodsStepwise term selection ForwardBackwardExpensive: need to do n2 iterations of trainingTerm clusteringDimension reduction: PCA / SVDWord Rep. vs. Dimension ReductionWord representations: one dimension for each word (binary, count, or weight)Dimension reduction: each dimensi
13、on is a unique linear combination of all words (linear case)Dimension reduction is good for generic topics (“politics”), bad for specific classes (“ruanda”). Why?SVD/PCA computationally expensiveHigher complexity in implementationNo clear examples of higher performance through dimension reductionWor
14、d Rep. vs. Dimension ReductionMeasuring ClassificationFigures of MeritAccuracy of classification Main evaluation criterion in academiaMore in a momenSpeed of training statistical classifierSpeed of classification (docs/hour)No big differences for most algorithmsExceptions: kNN, complex preprocessing
15、 requirementsEffort in creating training set (human hours/topic)More on this in Lecture 9 (Active Learning)Measures of AccuracyError rate Not a good measure for small classes. Why?Precision/recall for classification decisionsF1 measure: 1/F1 = (1/P + 1/R)Breakeven pointCorrect estimate of size of ca
16、tegoryWhy is this different?Precision/recall for ranking classesStability over time / concept driftUtilityPrecision/Recall for Ranking ClassesExample: “Bad wheat harvest in Turkey”True categoriesWheatTurkeyRanked category list0.9: turkey0.7: poultry0.5: armenia0.4: barley0.3: georgiaPrecision at 5:
17、0.1, Recall at 5: 0.5Precision/Recall for Ranking ClassesConsider problems with many categories (10)Use method returning scores comparable across categories (not: Nave Bayes)Rank categories and compute average precision recall (or other measure characterizing precision/recall curve)Good measure for
18、interactive support of human categorizationUseless for an “autonomous” system (e.g. a filter on a stream of newswire stories)Concept DriftCategories change over timeExample: “president of the united states”1999: clinton is great feature2002: clinton is bad featureOne measure of a text classification
19、 system is how well it protects against concept drift.Feature selection: good or bad to protect against concept drift?Micro- vs. Macro-AveragingIf we have more than one class, how do we combine multiple performance measures into one quantity?Macroaveraging: Compute performance for each class, then a
20、verage.Microaveraging: Collect decisions for all classes, compute contingency table, evaluate.Micro- vs. Macro-Averaging: ExampleTruth: yesTruth: noClassifier: yes1010Classifier: no10970Truth: yesTruth: noClassifier: yes9010Classifier: no10890Truth: yesTruth: noClassifier: yes10020Classifier: no2018
21、60Class 1Class 2Micro.Av. TableMacroaveraged precision: (0.5 + 0.9)/2 = 0.7Microaveraged precision: 100/120 = .83Why this difference?Reuters 1Newswire textStatistics (vary according to version used)Training set: 9,610Test set: 3,66250% of documents have no category assignedAverage document length: 9
22、0.6Number of classes: 92Example classes: currency exchange, wheat, goldMax classes assigned: 14Average number of classes assigned1.24 for docs with at least one categoryReuters 1Only about 10 out of 92 categories are largeMicroaveraging measures performance on large categories.Factors Affecting Meas
23、uresVariability of dataDocument size/lengthquality/style of authorshipuniformity of vocabularyVariability of “truth” / gold standardneed definitive judgement on which topic(s) a doc belongs tousually humanIdeally: consistent judgementsAccuracy measurementConfusion matrix53Topic assigned by classifie
24、rActual TopicThis (i, j) entry means 53 of the docs actually intopic i were put in topic j by the classifier.Confusion matrixFunction of classifier, topics and test docs.For a perfect classifier, all off-diagonal entries should be zero.For a perfect classifier, if there are n docs in category j than
25、 entry (j,j) should be n.Straightforward when there is 1 category per document.Can be extended to n categories per document.Confusion measures (1 class / doc)Recall: Fraction of docs in topic i classified correctly:Precision: Fraction of docs assigned topic i that are actually about topic i:“Correct
26、 rate”: (1- error rate) Fraction of docs classified correctly:Integrated Evaluation/OptimizationPrincipled approach to trainingOptimize the measure that performance is measured withs: vector of classifier decision, z: vector of true classesh(s,z) = cost of making decisions s for true assignments zUt
27、ility / CostOne cost function h is based on contingency table.Assume identical cost for all false positives etc.Cost C = l11 * A + l12 *B + l21*C + l22*DFor this cost c, we get the following optimality criterionTruth: yesTruth: noClassifier: yesCost:11Count:ACost:12Count:BClassifier: noCost:21Count;
28、CCost:22Count:DUtility / CostTruth: yesTruth: noClassifier: yes1112Classifier: no2122Most common cost: 1 for error, 0 for correct. Pi ? Product cross-sale: high cost for false positive, low cost for false negative.Patent search: low cost for false positive, high cost for false negative.Are All Optim
29、al Rules of Form p?In the above examples, all you need to do is estimate probability of class membership.Can all problems be solved like this?No!Probability is often not sufficientUser decision depends on the distribution of relevanceExample: information filter for terrorismNave BayesVector Space Cl
30、assificationNearest Neighbor ClassificationRecall Vector Space RepresentationEach doc j is a vector, one component for each term (= word).Normalize to unit length.Have a vector spaceterms are axesn docs live in this spaceeven with stemming, may have 10000+ dimensions, or even 1,000,000+Classificatio
31、n Using Vector SpacesEach training doc a point (vector) labeled by its topic (= class)Hypothesis: docs of the same topic form a contiguous region of spaceDefine surfaces to delineate topics in spaceTopics in a vector spaceGovernmentScienceArtsGiven a test docFigure out which region it lies inAssign
32、corresponding classTest doc = GovernmentGovernmentScienceArtsBinary ClassificationConsider 2 class problemsHow do we define (and find) the separating surface?How do we test which region a test doc is in?Separation by HyperplanesAssume linear separability for now:in 2 dimensions, can separate by a li
33、nein higher dimensions, need hyperplanesCan find separating hyperplane by linear programming (e.g. perceptron):separator can be expressed as ax + by = cLinear programming / PerceptronFind a,b,c, such thatax + by c for red pointsax + by c for green points.Relationship to Nave Bayes?Find a,b,c, such t
34、hatax + by c for red pointsax + by c for green points.Linear ClassifiersMany common text classifiers are linear classifiersDespite this similarity, large performance differencesFor separable problems, there is an infinite number of separating hyperplanes. Which one do you choose?What to do for non-s
35、eparable problems?Which hyperplane?In general, lots of possiblesolutions for a,b,c.Support Vector Machine (SVM)Support vectorsMaximizemarginQuadratic programming problem The decision function is fully specified by subset of training samples, the support vectors.Text classification method du jourTopi
36、c of lecture 9Category: InterestExample SVM features wi ti wi ti 0.70 prime 0.67 rate 0.63 interest 0.60 rates 0.46 discount 0.43 bundesbank 0.43 baker -0.71 dlrs -0.35 world -0.33 sees -0.25 year -0.24 group -0.24 dlr -0.24 januaryMore Than Two ClassesAny-of or multiclass classificationFor n classe
37、s, decompose into n binary problemsOne-of classification: each document belongs to exactly one classHow do we compose separating surfaces into regions?Centroid classificationK nearest neighbor classificationComposing Surfaces: Issues?Separating Multiple TopicsBuild a separator between each topic and
38、 its complementary set (docs from all other topics).Given test doc, evaluate it for membership in each topic.Declare membership in topics One-of classification: for class with maximum score/confidence/probabilityMulticlass classification:For classes above thresholdNegative examplesFormulate as above
39、, except negative examples for a topic are added to its complementary set.Positive examplesNegative examplesCentroid ClassificationGiven training docs for a topic, compute their centroidNow have a centroid for each topicGiven query doc, assign to topic whose centroid is nearest.Exercise: Compare to
40、RocchioExampleGovernmentScienceArtsk Nearest Neighbor ClassificationTo classify document d into class cDefine k-neighborhood N as k nearest neighbors of dCount number of documents l in N that belong to cEstimate P(c|d) as l/kCover and Hart 1967Asymptotically, the error rate of 1-nearest-neighbor classification is less than twice the Bayes
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 昭通工程施工方案(3篇)
- 框架砖砌体施工方案(3篇)
- 水池清污施工方案(3篇)
- 海边营销思维方案(3篇)
- 灵宝固化地坪施工方案(3篇)
- 综合用房施工方案(3篇)
- 药店营销思维方案(3篇)
- 装修企业营销方案(3篇)
- 车间清洗施工方案范本(3篇)
- 钢楼梯施工方案大全(3篇)
- 小学作文写作教学典型案例分析
- 固体酸催化剂课件
- PS平面设计课件
- 仪表接线箱(柜)制作及标识管理规定
- 统编版(2024)八年级上册道德与法治 11.2 全面推进国防和军队现代化 教案
- 2025年外贸行业招聘面试及笔试指南
- 镁合金生产线项目经营管理手册
- 2025年山东高等学校教师资格考试(综合)历年参考题库含答案详解(5套)
- 企业网络安全管理制度及操作规程
- 2025年人教版七年级英语下册期末复习之完形填空25篇(Units1-8单元话题)【答案+解析】
- 2025辽宁铁道职业技术学院单招考试文化素质数学练习题及参考答案详解(完整版)
评论
0/150
提交评论