浅探关节镜下盘状半月板损伤的治疗_第1页
浅探关节镜下盘状半月板损伤的治疗_第2页
浅探关节镜下盘状半月板损伤的治疗_第3页
浅探关节镜下盘状半月板损伤的治疗_第4页
浅探关节镜下盘状半月板损伤的治疗_第5页
已阅读5页,还剩61页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

1、浅探关节镜下盘状半月板损伤的治疗RecapTodays topicsFeature selection for text classificationMeasuring classification performanceNearest neighbor categorizationFeature Selection: Why?Text collections have a large number of features10,000 1,000,000 unique words and moreMake using a particular classifier feasibleSome c

2、lassifiers cant deal with 100,000s of featsReduce training timeTraining time for some methods is quadratic or worse in the number of features (e.g., logistic regression)Improve generalizationEliminate noise featuresAvoid overfittingRecap: Feature ReductionStandard ways of reducing feature space for

3、textStemmingLaugh, laughs, laughing, laughed - laughStop word removalE.g., eliminate all prepositionsConversion to lower caseTokenizationBreak on all special characters: fire-fighter - fire, fighterFeature SelectionYang and Pedersen 1997Comparison of different selection criteriaDF document frequency

4、IG information gainMI mutual informationCHI chi squareCommon strategyCompute statistic for each termKeep n terms with highest value of this statisticInformation Gain(Pointwise) Mutual InformationChi-SquareTerm presentTerm absentDocument belongs to categoryABDocument does not belong to categoryCDX2 =

5、 N(AD-BC)2 / ( (A+B) (A+C) (B+D) (C+D) )Use either maximum or average X2Value for complete independence?Document FrequencyNumber of documents a term occurs in Is sometimes used for eliminating both very frequent and very infrequent termsHow is document frequency measure different from the other 3 me

6、asures?Yang&Pedersen: ExperimentsTwo classification methodskNN (k nearest neighbors; more later)Linear Least Squares FitRegression methodCollectionsReuters-2217392 categories16,000 unique termsOhsumed: subset of medline14,000 categories72,000 unique termsLtc term weighting Yang&Pedersen: Experiments

7、Choose feature set sizePreprocess collection, discarding non-selected features / wordsApply term weighting - feature vector for each documentTrain classifier on training setEvaluate classifier on test setDiscussionYou can eliminate 90% of features for IG, DF, and CHI without decreasing performance.I

8、n fact, performance increases with fewer features for IG, DF, and CHI.Mutual information is very sensitive to small counts.IG does best with smallest number of features.Document frequency is close to optimal. By far the simplest feature selection method.Similar results for LLSF (regression).ResultsW

9、hy is selecting common terms a good strategy?IG, DF, CHI Are Correlated.Information Gain vs Mutual InformationInformation gain is similar to MI for random variablesIndependence?In contrast, pointwise MI ignores non-occurrence of termsE.g., for complete dependence, you get:P(AB)/ (P(A)P(B) = 1/P(A) l

10、arger for rare terms than for frequent termsYang&Pedersen: Pointwise MI favors rare termsFeature Selection:Other ConsiderationsGeneric vs Class-SpecificCompletely generic (class-independent)Separate feature set for each classMixed (a la Yang&Pedersen)Maintainability over timeIs aggressive features s

11、election good or bad for robustness over time?Ideal: Optimal features selected as part of trainingYang&Pedersen: LimitationsDont look at class specific feature selectionDont look at methods that cant handle high-dimensional spacesEvaluate category ranking (as opposed to classification accuracy)Featu

12、re Selection: Other MethodsStepwise term selection ForwardBackwardExpensive: need to do n2 iterations of trainingTerm clusteringDimension reduction: PCA / SVDWord Rep. vs. Dimension ReductionWord representations: one dimension for each word (binary, count, or weight)Dimension reduction: each dimensi

13、on is a unique linear combination of all words (linear case)Dimension reduction is good for generic topics (“politics”), bad for specific classes (“ruanda”). Why?SVD/PCA computationally expensiveHigher complexity in implementationNo clear examples of higher performance through dimension reductionWor

14、d Rep. vs. Dimension ReductionMeasuring ClassificationFigures of MeritAccuracy of classification Main evaluation criterion in academiaMore in a momenSpeed of training statistical classifierSpeed of classification (docs/hour)No big differences for most algorithmsExceptions: kNN, complex preprocessing

15、 requirementsEffort in creating training set (human hours/topic)More on this in Lecture 9 (Active Learning)Measures of AccuracyError rate Not a good measure for small classes. Why?Precision/recall for classification decisionsF1 measure: 1/F1 = (1/P + 1/R)Breakeven pointCorrect estimate of size of ca

16、tegoryWhy is this different?Precision/recall for ranking classesStability over time / concept driftUtilityPrecision/Recall for Ranking ClassesExample: “Bad wheat harvest in Turkey”True categoriesWheatTurkeyRanked category list0.9: turkey0.7: poultry0.5: armenia0.4: barley0.3: georgiaPrecision at 5:

17、0.1, Recall at 5: 0.5Precision/Recall for Ranking ClassesConsider problems with many categories (10)Use method returning scores comparable across categories (not: Nave Bayes)Rank categories and compute average precision recall (or other measure characterizing precision/recall curve)Good measure for

18、interactive support of human categorizationUseless for an “autonomous” system (e.g. a filter on a stream of newswire stories)Concept DriftCategories change over timeExample: “president of the united states”1999: clinton is great feature2002: clinton is bad featureOne measure of a text classification

19、 system is how well it protects against concept drift.Feature selection: good or bad to protect against concept drift?Micro- vs. Macro-AveragingIf we have more than one class, how do we combine multiple performance measures into one quantity?Macroaveraging: Compute performance for each class, then a

20、verage.Microaveraging: Collect decisions for all classes, compute contingency table, evaluate.Micro- vs. Macro-Averaging: ExampleTruth: yesTruth: noClassifier: yes1010Classifier: no10970Truth: yesTruth: noClassifier: yes9010Classifier: no10890Truth: yesTruth: noClassifier: yes10020Classifier: no2018

21、60Class 1Class 2Micro.Av. TableMacroaveraged precision: (0.5 + 0.9)/2 = 0.7Microaveraged precision: 100/120 = .83Why this difference?Reuters 1Newswire textStatistics (vary according to version used)Training set: 9,610Test set: 3,66250% of documents have no category assignedAverage document length: 9

22、0.6Number of classes: 92Example classes: currency exchange, wheat, goldMax classes assigned: 14Average number of classes assigned1.24 for docs with at least one categoryReuters 1Only about 10 out of 92 categories are largeMicroaveraging measures performance on large categories.Factors Affecting Meas

23、uresVariability of dataDocument size/lengthquality/style of authorshipuniformity of vocabularyVariability of “truth” / gold standardneed definitive judgement on which topic(s) a doc belongs tousually humanIdeally: consistent judgementsAccuracy measurementConfusion matrix53Topic assigned by classifie

24、rActual TopicThis (i, j) entry means 53 of the docs actually intopic i were put in topic j by the classifier.Confusion matrixFunction of classifier, topics and test docs.For a perfect classifier, all off-diagonal entries should be zero.For a perfect classifier, if there are n docs in category j than

25、 entry (j,j) should be n.Straightforward when there is 1 category per document.Can be extended to n categories per document.Confusion measures (1 class / doc)Recall: Fraction of docs in topic i classified correctly:Precision: Fraction of docs assigned topic i that are actually about topic i:“Correct

26、 rate”: (1- error rate) Fraction of docs classified correctly:Integrated Evaluation/OptimizationPrincipled approach to trainingOptimize the measure that performance is measured withs: vector of classifier decision, z: vector of true classesh(s,z) = cost of making decisions s for true assignments zUt

27、ility / CostOne cost function h is based on contingency table.Assume identical cost for all false positives etc.Cost C = l11 * A + l12 *B + l21*C + l22*DFor this cost c, we get the following optimality criterionTruth: yesTruth: noClassifier: yesCost:11Count:ACost:12Count:BClassifier: noCost:21Count;

28、CCost:22Count:DUtility / CostTruth: yesTruth: noClassifier: yes1112Classifier: no2122Most common cost: 1 for error, 0 for correct. Pi ? Product cross-sale: high cost for false positive, low cost for false negative.Patent search: low cost for false positive, high cost for false negative.Are All Optim

29、al Rules of Form p?In the above examples, all you need to do is estimate probability of class membership.Can all problems be solved like this?No!Probability is often not sufficientUser decision depends on the distribution of relevanceExample: information filter for terrorismNave BayesVector Space Cl

30、assificationNearest Neighbor ClassificationRecall Vector Space RepresentationEach doc j is a vector, one component for each term (= word).Normalize to unit length.Have a vector spaceterms are axesn docs live in this spaceeven with stemming, may have 10000+ dimensions, or even 1,000,000+Classificatio

31、n Using Vector SpacesEach training doc a point (vector) labeled by its topic (= class)Hypothesis: docs of the same topic form a contiguous region of spaceDefine surfaces to delineate topics in spaceTopics in a vector spaceGovernmentScienceArtsGiven a test docFigure out which region it lies inAssign

32、corresponding classTest doc = GovernmentGovernmentScienceArtsBinary ClassificationConsider 2 class problemsHow do we define (and find) the separating surface?How do we test which region a test doc is in?Separation by HyperplanesAssume linear separability for now:in 2 dimensions, can separate by a li

33、nein higher dimensions, need hyperplanesCan find separating hyperplane by linear programming (e.g. perceptron):separator can be expressed as ax + by = cLinear programming / PerceptronFind a,b,c, such thatax + by c for red pointsax + by c for green points.Relationship to Nave Bayes?Find a,b,c, such t

34、hatax + by c for red pointsax + by c for green points.Linear ClassifiersMany common text classifiers are linear classifiersDespite this similarity, large performance differencesFor separable problems, there is an infinite number of separating hyperplanes. Which one do you choose?What to do for non-s

35、eparable problems?Which hyperplane?In general, lots of possiblesolutions for a,b,c.Support Vector Machine (SVM)Support vectorsMaximizemarginQuadratic programming problem The decision function is fully specified by subset of training samples, the support vectors.Text classification method du jourTopi

36、c of lecture 9Category: InterestExample SVM features wi ti wi ti 0.70 prime 0.67 rate 0.63 interest 0.60 rates 0.46 discount 0.43 bundesbank 0.43 baker -0.71 dlrs -0.35 world -0.33 sees -0.25 year -0.24 group -0.24 dlr -0.24 januaryMore Than Two ClassesAny-of or multiclass classificationFor n classe

37、s, decompose into n binary problemsOne-of classification: each document belongs to exactly one classHow do we compose separating surfaces into regions?Centroid classificationK nearest neighbor classificationComposing Surfaces: Issues?Separating Multiple TopicsBuild a separator between each topic and

38、 its complementary set (docs from all other topics).Given test doc, evaluate it for membership in each topic.Declare membership in topics One-of classification: for class with maximum score/confidence/probabilityMulticlass classification:For classes above thresholdNegative examplesFormulate as above

39、, except negative examples for a topic are added to its complementary set.Positive examplesNegative examplesCentroid ClassificationGiven training docs for a topic, compute their centroidNow have a centroid for each topicGiven query doc, assign to topic whose centroid is nearest.Exercise: Compare to

40、RocchioExampleGovernmentScienceArtsk Nearest Neighbor ClassificationTo classify document d into class cDefine k-neighborhood N as k nearest neighbors of dCount number of documents l in N that belong to cEstimate P(c|d) as l/kCover and Hart 1967Asymptotically, the error rate of 1-nearest-neighbor classification is less than twice the Bayes

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论