第一章贝叶斯分类算法.ppt

上传人：清*** IP属地：河南上传时间：2020-10-09 格式：PPT 页数：72 大小：2.59MB 积分：18 举报 版权申诉

已阅读5页，还剩67页未读，继续免费阅读

版权说明：本文档由用户提供并上传，收益归属内容提供方，若内容存在侵权，请进行举报或认领

文档简介

1、1,Sergios Theodoridis Konstantinos Koutroumbas Version 2,A Course on PATTERN RECOGNITION,2,PATTERN RECOGNITION,Typical application areas Machine vision Character recognition (OCR) Computer aided diagnosis Speech recognition Face recognition Biometrics Image Data Base retrieval Data mining Bionformat

2、ics The task: Assign unknown objects patterns into the correct class. This is known as classification.,3,Features: These are measurable quantities obtained from the patterns, and the classification task is based on their respective values. Feature vectors: A number of features constitute the feature

3、 vector Feature vectors are treated as random vectors.,4,An example:,5,The classifier consists of a set of functions, whose values, computed at , determine the class to which the corresponding pattern belongs Classification system overview,6,Supervised unsupervised pattern recognition: The two major

4、 directions Supervised: Patterns whose class is known a-priori are used for training. Unsupervised: The number of classes is (in general) unknown and no training patterns are available.,7,CLASSIFIERS BASED ON BAYES DECISION THEORY,Statistical nature of feature vectors Assign the pattern represented

5、by feature vector to the most probable of the available classesThat is maximum,8,Computation of a-posteriori probabilities Assume known a-priori probabilities This is also known as the likelihood of,9,The Bayes rule (=2),where,10,The Bayes classification rule (for two classes M=2) Given classify it

6、according to the rule Equivalently: classify according to the rule For equiprobable classes the test becomes,11,12,Equivalently in words: Divide space in two regions Probability of error Total shaded area Bayesian classifier is OPTIMAL with respect to minimising the classification error probability!

7、,13,Indeed: Moving the threshold the total shaded area INCREASES by the extra “grey” area.,14,The Bayes classification rule for many (M2) classes: Given classify it to if: Such a choice also minimizes the classification error probability Minimizing the average risk For each wrong decision, a penalty

8、 term is assigned since some decisions are more sensitive than others,15,For M=2 Define the loss matrix penalty term for deciding class ,although the pattern belongs to , etc. Risk with respect to,16,Risk with respect to Average risk,Probabilities of wrong decisions, weighted by the penalty terms,17

9、,Choose and so that r is minimized Then assign to if Equivalently:assign x in if : likelihood ratio,18,If,19,An example:,20,Then the threshold value is: Threshold for minimum r,21,Thus moves to the left of (WHY?),22,DISCRIMINANT FUNCTIONS DECISION SURFACES,If are contiguous: is the surface separatin

10、g the regions. On one side is positive (+), on the other is negative (-). It is known as Decision Surface,+,-,23,If f(.) monotonic, the rule remains the same if we use: is a discriminant function In general, discriminant functions can be defined independent of the Bayesian rule. They lead to subopti

11、mal solutions, yet if chosen appropriately, can be computationally more tractable.,24,BAYESIAN CLASSIFIER FOR NORMAL DISTRIBUTIONS,Multivariate Gaussian pdf called covariance matrix,25,is monotonic. Define: Example:,26,That is, is quadratic and the surfaces quadrics, ellipsoids, parabolas, hyperbola

12、s, pairs of lines. For example:,27,Decision Hyperplanes Quadratic terms: If ALL (the same) the quadratic terms are not of interest. They are not involved in comparisons. Then, equivalently, we can write: Discriminant functions are LINEAR,28,Let in addition:,29,Nondiagonal: Decision hyperplane,30,Min

13、imum Distance Classifiers equiprobable Euclidean Distance: smaller Mahalanobis Distance: smaller,31,32,Example:,33,Maximum Likelihood,ESTIMATION OF UNKNOWN PROBABILITY DENSITY FUNCTIONS,34,35,36,Asymptotically unbiased and consistent,37,Example:,38,Maximum Aposteriori Probability Estimation In ML me

14、thod, was considered as a parameter Here we shall look at as a random vector described by a pdf p(), assumed to be known Given Compute the maximum of From Bayes theorem,39,The method:,40,41,Example:,42,Bayesian Inference,43,44,The above is a sequence of Gaussians as Maximum Entropy Entropy,45,Exampl

15、e: x is nonzero in the intervaland zero otherwise. Compute the ME pdf The constraint: Lagrange Multipliers,46,Mixture Models Assume parametric modeling, i.e., The goal is to estimate given a set Why not ML? As before?,47,This is a nonlinear problem due to the missing label information. This is a typ

16、ical problem with an incomplete data set. The Expectation-Maximisation (EM) algorithm. General formulation which are not observed directly. We observe a many to one transformation,48,Let What we need is to compute But are not observed. Here comes the EM. Maximize the expectation of the loglikelihood

17、 conditioned on the observed samples and the current iteration estimate of,49,The algorithm: E-step: M-step: Application to the mixture modeling problem Complete data Observed data Assuming mutual independence,50,Unknown parameters E-step M-step,51,Nonparametric Estimation,52,Parzen Windows Divide t

18、he multidimensional space in hypercubes,53,Define That is, it is 1 inside a unit side hypercube centered at 0 The problem: Parzen windows-kernels-potential functions,54,Mean value Hence unbiased in the limit,55,Variance The smaller the h the higher the variance,h=0.1, N=1000,h=0.8, N=1000,56,h=0.1,

19、N=10000,The higher the N the better the accuracy,57,If asymptotically unbiased The method Remember:,58,CURSE OF DIMENSIONALITY In all the methods, so far, we saw that the highest the number of points, N, the better the resulting estimate. If in the one-dimensional space an interval, filled with N po

20、ints, is adequately (for good estimation), in the two-dimensional space the corresponding square will require N2 and in the -dimensional space the -dimensional cube will require N points. The exponential increase in the number of necessary points in known as the curse of dimensionality. This is a ma

21、jor problem one is confronted with in high dimensional spaces.,59,NAIVE BAYES CLASSIFIER Let and the goal is to estimate i = 1, 2, , M. For a “good” estimate of the pdf one would need, say, N points. Assume x1, x2 , x mutually independent. Then: In this case, one would require, roughly, N points for

22、 each pdf. Thus, a number of points of the order N would suffice. It turns out that the Nave Bayes classifier works reasonably well even in cases that violate the independence assumption.,60,K Nearest Neighbor Density Estimation In Parzen: The volume is constant The number of points in the volume is

23、 varying Now: Keep the number of pointsconstant Leave the volume to be varying,61,62,The Nearest Neighbor Rule Choose k out of the N training vectors, identify the k nearest ones to x Out of these k identify ki that belong to class i The simplest version k=1 ! For large N this is not bad. It can be

24、shown that: if PB is the optimal Bayesian error probability, then:,63,For small PB:,64,Voronoi tesselation,65,Bayes Probability Chain Rule Assume now that the conditional dependence for each xi is limited to a subset of the features appearing in each of the product terms. That is: where,BAYESIAN NET

25、WORKS,66,For example, if =6, then we could assume: Then: The above is a generalization of the Nave Bayes. For the Nave Bayes the assumption is: Ai = , for i=1, 2, , ,67,A graphical way to portray conditional dependencies is given below,According to this figure we have that: x6 is conditionally depen

26、dent on x4, x5. x5 on x4 x4 on x1, x2 x3 on x2 x1, x2 are conditionally independent on other variables.,For this case:,68,Bayesian Networks Definition: A Bayesian Network is a directed acyclic graph (DAG) where the nodes correspond to random variables. Each node is associated with a set of condition

27、al probabilities (densities), p(xi|Ai), where xi is the variable associated with the node and Ai is the set of its parents in the graph. A Bayesian Network is specified by: The marginal probabilities of its root nodes. The conditional probabilities of the non-root nodes, given their parents, for ALL

28、 possible combinations.,69,The figure below is an example of a Bayesian Network corresponding to a paradigm from the medical applications field.,This Bayesian network models conditional dependencies for an example concerning smokers (S), tendencies to develop cancer (C) and heart disease (H), togeth

29、er with variables corresponding to heart (H1, H2) and cancer (C1, C2) medical tests.,70,Once a DAG has been constructed, the joint probability can be obtained by multiplying the marginal (root nodes) and the conditional (non-root nodes) probabilities. Training: Once a topology is given, probabilities are estimated via the training data set. There

人人文库> 全部分类> 教育资料 > 课件下载

温馨提示

1. 本站所有资源如无特殊说明，都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
2. 本站的文档不包含任何第三方提供的附件图纸等，如果需要附件，请联系上传者。文件的所有权益归上传用户所有。
3. 本站RAR压缩包中若带图纸，网页内容里面会有图纸预览，若没有图纸预览就没有图纸。
4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
5. 人人文库网仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对用户上传分享的文档内容本身不做任何修改或编辑，并不能对任何下载内容负责。
6. 下载文件中如有侵权或不适当内容，请与我们联系，我们立即纠正。
7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

第一章贝叶斯分类算法.ppt

文档简介

温馨提示

最新文档

评论

第一章 贝叶斯分类算法.ppt

文档简介

温馨提示

最新文档

评论

相关文档

第一章贝叶斯分类算法.ppt