版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
1、浅探关节镜下盘状半月板损伤的治疗RecapTodays topicsFeature selection for text classificationMeasuring classification performanceNearest neighbor categorizationFeature Selection: Why?Text collections have a large number of features10,000 1,000,000 unique words and moreMake using a particular classifier feasibleSome c
2、lassifiers cant deal with 100,000s of featsReduce training timeTraining time for some methods is quadratic or worse in the number of features (e.g., logistic regression)Improve generalizationEliminate noise featuresAvoid overfittingRecap: Feature ReductionStandard ways of reducing feature space for
3、textStemmingLaugh, laughs, laughing, laughed - laughStop word removalE.g., eliminate all prepositionsConversion to lower caseTokenizationBreak on all special characters: fire-fighter - fire, fighterFeature SelectionYang and Pedersen 1997Comparison of different selection criteriaDF document frequency
4、IG information gainMI mutual informationCHI chi squareCommon strategyCompute statistic for each termKeep n terms with highest value of this statisticInformation Gain(Pointwise) Mutual InformationChi-SquareTerm presentTerm absentDocument belongs to categoryABDocument does not belong to categoryCDX2 =
5、 N(AD-BC)2 / ( (A+B) (A+C) (B+D) (C+D) )Use either maximum or average X2Value for complete independence?Document FrequencyNumber of documents a term occurs in Is sometimes used for eliminating both very frequent and very infrequent termsHow is document frequency measure different from the other 3 me
6、asures?Yang&Pedersen: ExperimentsTwo classification methodskNN (k nearest neighbors; more later)Linear Least Squares FitRegression methodCollectionsReuters-2217392 categories16,000 unique termsOhsumed: subset of medline14,000 categories72,000 unique termsLtc term weighting Yang&Pedersen: Experiments
7、Choose feature set sizePreprocess collection, discarding non-selected features / wordsApply term weighting - feature vector for each documentTrain classifier on training setEvaluate classifier on test setDiscussionYou can eliminate 90% of features for IG, DF, and CHI without decreasing performance.I
8、n fact, performance increases with fewer features for IG, DF, and CHI.Mutual information is very sensitive to small counts.IG does best with smallest number of features.Document frequency is close to optimal. By far the simplest feature selection method.Similar results for LLSF (regression).ResultsW
9、hy is selecting common terms a good strategy?IG, DF, CHI Are Correlated.Information Gain vs Mutual InformationInformation gain is similar to MI for random variablesIndependence?In contrast, pointwise MI ignores non-occurrence of termsE.g., for complete dependence, you get:P(AB)/ (P(A)P(B) = 1/P(A) l
10、arger for rare terms than for frequent termsYang&Pedersen: Pointwise MI favors rare termsFeature Selection:Other ConsiderationsGeneric vs Class-SpecificCompletely generic (class-independent)Separate feature set for each classMixed (a la Yang&Pedersen)Maintainability over timeIs aggressive features s
11、election good or bad for robustness over time?Ideal: Optimal features selected as part of trainingYang&Pedersen: LimitationsDont look at class specific feature selectionDont look at methods that cant handle high-dimensional spacesEvaluate category ranking (as opposed to classification accuracy)Featu
12、re Selection: Other MethodsStepwise term selection ForwardBackwardExpensive: need to do n2 iterations of trainingTerm clusteringDimension reduction: PCA / SVDWord Rep. vs. Dimension ReductionWord representations: one dimension for each word (binary, count, or weight)Dimension reduction: each dimensi
13、on is a unique linear combination of all words (linear case)Dimension reduction is good for generic topics (“politics”), bad for specific classes (“ruanda”). Why?SVD/PCA computationally expensiveHigher complexity in implementationNo clear examples of higher performance through dimension reductionWor
14、d Rep. vs. Dimension ReductionMeasuring ClassificationFigures of MeritAccuracy of classification Main evaluation criterion in academiaMore in a momenSpeed of training statistical classifierSpeed of classification (docs/hour)No big differences for most algorithmsExceptions: kNN, complex preprocessing
15、 requirementsEffort in creating training set (human hours/topic)More on this in Lecture 9 (Active Learning)Measures of AccuracyError rate Not a good measure for small classes. Why?Precision/recall for classification decisionsF1 measure: 1/F1 = (1/P + 1/R)Breakeven pointCorrect estimate of size of ca
16、tegoryWhy is this different?Precision/recall for ranking classesStability over time / concept driftUtilityPrecision/Recall for Ranking ClassesExample: “Bad wheat harvest in Turkey”True categoriesWheatTurkeyRanked category list0.9: turkey0.7: poultry0.5: armenia0.4: barley0.3: georgiaPrecision at 5:
17、0.1, Recall at 5: 0.5Precision/Recall for Ranking ClassesConsider problems with many categories (10)Use method returning scores comparable across categories (not: Nave Bayes)Rank categories and compute average precision recall (or other measure characterizing precision/recall curve)Good measure for
18、interactive support of human categorizationUseless for an “autonomous” system (e.g. a filter on a stream of newswire stories)Concept DriftCategories change over timeExample: “president of the united states”1999: clinton is great feature2002: clinton is bad featureOne measure of a text classification
19、 system is how well it protects against concept drift.Feature selection: good or bad to protect against concept drift?Micro- vs. Macro-AveragingIf we have more than one class, how do we combine multiple performance measures into one quantity?Macroaveraging: Compute performance for each class, then a
20、verage.Microaveraging: Collect decisions for all classes, compute contingency table, evaluate.Micro- vs. Macro-Averaging: ExampleTruth: yesTruth: noClassifier: yes1010Classifier: no10970Truth: yesTruth: noClassifier: yes9010Classifier: no10890Truth: yesTruth: noClassifier: yes10020Classifier: no2018
21、60Class 1Class 2Micro.Av. TableMacroaveraged precision: (0.5 + 0.9)/2 = 0.7Microaveraged precision: 100/120 = .83Why this difference?Reuters 1Newswire textStatistics (vary according to version used)Training set: 9,610Test set: 3,66250% of documents have no category assignedAverage document length: 9
22、0.6Number of classes: 92Example classes: currency exchange, wheat, goldMax classes assigned: 14Average number of classes assigned1.24 for docs with at least one categoryReuters 1Only about 10 out of 92 categories are largeMicroaveraging measures performance on large categories.Factors Affecting Meas
23、uresVariability of dataDocument size/lengthquality/style of authorshipuniformity of vocabularyVariability of “truth” / gold standardneed definitive judgement on which topic(s) a doc belongs tousually humanIdeally: consistent judgementsAccuracy measurementConfusion matrix53Topic assigned by classifie
24、rActual TopicThis (i, j) entry means 53 of the docs actually intopic i were put in topic j by the classifier.Confusion matrixFunction of classifier, topics and test docs.For a perfect classifier, all off-diagonal entries should be zero.For a perfect classifier, if there are n docs in category j than
25、 entry (j,j) should be n.Straightforward when there is 1 category per document.Can be extended to n categories per document.Confusion measures (1 class / doc)Recall: Fraction of docs in topic i classified correctly:Precision: Fraction of docs assigned topic i that are actually about topic i:“Correct
26、 rate”: (1- error rate) Fraction of docs classified correctly:Integrated Evaluation/OptimizationPrincipled approach to trainingOptimize the measure that performance is measured withs: vector of classifier decision, z: vector of true classesh(s,z) = cost of making decisions s for true assignments zUt
27、ility / CostOne cost function h is based on contingency table.Assume identical cost for all false positives etc.Cost C = l11 * A + l12 *B + l21*C + l22*DFor this cost c, we get the following optimality criterionTruth: yesTruth: noClassifier: yesCost:11Count:ACost:12Count:BClassifier: noCost:21Count;
28、CCost:22Count:DUtility / CostTruth: yesTruth: noClassifier: yes1112Classifier: no2122Most common cost: 1 for error, 0 for correct. Pi ? Product cross-sale: high cost for false positive, low cost for false negative.Patent search: low cost for false positive, high cost for false negative.Are All Optim
29、al Rules of Form p?In the above examples, all you need to do is estimate probability of class membership.Can all problems be solved like this?No!Probability is often not sufficientUser decision depends on the distribution of relevanceExample: information filter for terrorismNave BayesVector Space Cl
30、assificationNearest Neighbor ClassificationRecall Vector Space RepresentationEach doc j is a vector, one component for each term (= word).Normalize to unit length.Have a vector spaceterms are axesn docs live in this spaceeven with stemming, may have 10000+ dimensions, or even 1,000,000+Classificatio
31、n Using Vector SpacesEach training doc a point (vector) labeled by its topic (= class)Hypothesis: docs of the same topic form a contiguous region of spaceDefine surfaces to delineate topics in spaceTopics in a vector spaceGovernmentScienceArtsGiven a test docFigure out which region it lies inAssign
32、corresponding classTest doc = GovernmentGovernmentScienceArtsBinary ClassificationConsider 2 class problemsHow do we define (and find) the separating surface?How do we test which region a test doc is in?Separation by HyperplanesAssume linear separability for now:in 2 dimensions, can separate by a li
33、nein higher dimensions, need hyperplanesCan find separating hyperplane by linear programming (e.g. perceptron):separator can be expressed as ax + by = cLinear programming / PerceptronFind a,b,c, such thatax + by c for red pointsax + by c for green points.Relationship to Nave Bayes?Find a,b,c, such t
34、hatax + by c for red pointsax + by c for green points.Linear ClassifiersMany common text classifiers are linear classifiersDespite this similarity, large performance differencesFor separable problems, there is an infinite number of separating hyperplanes. Which one do you choose?What to do for non-s
35、eparable problems?Which hyperplane?In general, lots of possiblesolutions for a,b,c.Support Vector Machine (SVM)Support vectorsMaximizemarginQuadratic programming problem The decision function is fully specified by subset of training samples, the support vectors.Text classification method du jourTopi
36、c of lecture 9Category: InterestExample SVM features wi ti wi ti 0.70 prime 0.67 rate 0.63 interest 0.60 rates 0.46 discount 0.43 bundesbank 0.43 baker -0.71 dlrs -0.35 world -0.33 sees -0.25 year -0.24 group -0.24 dlr -0.24 januaryMore Than Two ClassesAny-of or multiclass classificationFor n classe
37、s, decompose into n binary problemsOne-of classification: each document belongs to exactly one classHow do we compose separating surfaces into regions?Centroid classificationK nearest neighbor classificationComposing Surfaces: Issues?Separating Multiple TopicsBuild a separator between each topic and
38、 its complementary set (docs from all other topics).Given test doc, evaluate it for membership in each topic.Declare membership in topics One-of classification: for class with maximum score/confidence/probabilityMulticlass classification:For classes above thresholdNegative examplesFormulate as above
39、, except negative examples for a topic are added to its complementary set.Positive examplesNegative examplesCentroid ClassificationGiven training docs for a topic, compute their centroidNow have a centroid for each topicGiven query doc, assign to topic whose centroid is nearest.Exercise: Compare to
40、RocchioExampleGovernmentScienceArtsk Nearest Neighbor ClassificationTo classify document d into class cDefine k-neighborhood N as k nearest neighbors of dCount number of documents l in N that belong to cEstimate P(c|d) as l/kCover and Hart 1967Asymptotically, the error rate of 1-nearest-neighbor classification is less than twice the Bayes
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 二零二五年新媒体运营兼职内容创作者聘用合同3篇
- 二零二五版国际物流运输服务电子合同风险评估与管理3篇
- 西安欧亚学院《钻井液工艺原理》2023-2024学年第一学期期末试卷
- 2025年度厨师团队培训与绩效评估合同3篇
- 武汉大学《钢琴与伴奏》2023-2024学年第一学期期末试卷
- 二零二五版人工智能教育合资协议范本3篇
- 二零二五版建筑行业工人薪资保障合同范本2篇
- 二零二五年度冷链物流车队运输合作协议3篇
- 2024版砌体工程建筑承包合同细则版B版
- 二零二五年知识产权侵权纠纷调解与法律咨询协议3篇
- GB/T 45102-2024机采棉采收技术要求
- 2025年海南省盐业集团有限公司招聘笔试参考题库含答案解析
- 2024-2025学年成都市高一上英语期末考试题(含答案和音频)
- 2024年南通职业大学单招职业技能测试题库有答案解析
- 2024股权融资计划
- 西式面点师试题与答案
- 钢结构连廊专项吊装方案(通过专家论证)
- 50MWp渔光互补光伏电站项目锤桩施工方案
- 2025免疫规划工作计划
- 初二家长会课件下载
- 食品安全知识培训
评论
0/150
提交评论