经典机器学习模型瑞客论坛_第1页
经典机器学习模型瑞客论坛_第2页
经典机器学习模型瑞客论坛_第3页
经典机器学习模型瑞客论坛_第4页
经典机器学习模型瑞客论坛_第5页
已阅读5页,还剩17页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

本课件包括:演示文稿,示例,代码,题库,和声音等,小象学院拥有完全知识的权利;只限于善意学习者在本课程使用,不得在课程范围外向任何第散播。任何其他人或机构不得盗版、创意,

保留一切通过法律、仿造其中的者的权利。关注小象学院律本由

社区用户整理互联网新技术教育本领教航者

社区用户整理分享MachineLearningPart

4:

Classical

Machine

Learning

ModelsZengchang

Qin

(Ph.D.)本由

社区用户整理互联网新技术教育本领教航者

社区用户整理分享Decision

Tree

Learning本由

社区用户整理互联网新技术教育本领教航者

社区用户整理分享Play-Tennis

Problem’s

book

[3],

we

can

find

atree

to

representThe

Play-Tennis

data

from

T.

Mitc“Yes”

and“No”

by

leaves.[3]

T.

Mitc(1997),

Machine

Learning,

McGraw

Hill.本

社区用户整理本由

社区用户整理Impurity’s

book

[3],

we

can

find

atree

to

representThe

Play-Tennis

data

from

T.

Mitc“Yes”

and“No”

by

leaves.Greedy

approach:Nodes

with

homogeneous

classdistribution

are

preferredNeed

a

measure

of

node

impurity本由

社区用户整理本由

社区用户整理Multi-dimensional

Attributes

(Features)Shannon's

solution

follows

from

thefundamental

properties

of

information.1.I(p)

is

anti-monotonic

in

p

increasesand

decreases

inthe

probability

of

anevent

produce

decreases

andincreasesin

information,

respectively2.I(p)

0

information

is

a

non-negative

ty3.I(1)

=

0

events

that

always

occurdonot

communicate

information4.I(p1,

p2)

=

I(p1)

+

I(p2)

informationdue

to

independent

events

is

additive本

社区用户整理本由

社区用户整理Information

Gain本

社区用户整理本由

社区用户整理Sub-Tre本e教s

社区用户整理本由

社区用户整理Partition本由社区用户整理本由

社区用户整理General

Way

of

Building

TreesGreedy

strategy.Split

the

records

based

on

an

attribute

test

thatoptimizes

certain

criterion.IssuesDetermine

how

to

split

the

recordsHow

to

specify

the

attribute

testcondition?How

to

determine

the

bestsplit?Determine

when

to

stop

splitting本

社区用户整理本由

社区用户整理TidRefundMaritalStatusTaxableeCheat1YesSingle125KNo2NoMarried100KNo3NoSingle70KNo4YesMarried120KNo5NoDivorced95KYes6NoMarried60KNo7YesDivorced220KNo8NoSingle85KYes9NoMarried75KNo10NoSingle90KYes0RefundMarStTaxIncYESNONONOYesNoMarriedSingle,

Divorced<

80K>

80KSplitting

AttributesTraining

aModel: Decision

TreeAttribut本e教Tywpwwe.社区用户整理本由

社区用户整理Depends

on

attribute

typesNominalOrdinalContinuousDepends

on

number

of

ways

to

split2-way

splitMulti-way

splitSub-Tre本e教s

社区用户整理本由

社区用户整理What

about

this

split?Multi-way

split:

Use

as

many

partitions

as

distinct

values.SizeSmall

LargeMediumBinarysplit:

Divides

values

into

two

subsets.Need

to

find

optimal

partitioning.Size{Medium,Large}{Small}Size{Small,Medium}{Large}ORSize{Small,Large}{Medium}Splitting本由社区用户整理本由

社区用户整理DiscretizationDifferent

ways

of

handlingDiscretization

to

form

an

ordinal

categorical

attributeStatic

–discretize

once

at

thebeginningDynamic

ranges

can

be

found

byequalinterval

bucketing,

equal

frequencybucketing(percentiles),

or

clustering.Binary

Decision:

(A

<

v)

or

(A

v)consider

all

possible

splits

and

finds

the

best

cutcan

be

more

compute

intensive本

社区用户整理本由

社区用户整理Gi

ndexj(NOTE:

p(

j

|

t)

is

the

relative

frequency

of

class

j

atnode

t).um

(1

-

1/nc)

when

records

are

equally

distributed

among

allclasses,

implying

least

interesting

informationMinimum

(0.0)

when

all

records

belong

to

one

class,

implying

mostinteresting

informationGi ndex

for

a

given

node

t

:GINI

(t)

1

[

p(

j

|

t)]2C10C26Gini=0.000C12C24Gini=0.444C13C23Gini=0.500C11C25Gini=0.278本由

社区用户整理本由

社区用户整理Detailed

CalculationC10C26C12C24C11C25GINI

(t)

1

[

p(

j

|

t)]2jP(C1)

=

0/6

=

0 P(C2)

=

6/6

=

1Gini

=1

P(C1)2

P(C2)2

=1

0

1

=

0P(C1)

=

1/6 P(C2)

=

5/6Gini

=

1

(1/6)2

(5/6)2

=

0.278P(C1)

=

2/6 P(C2)

=

4/6Gini

=

1

(2/6)2

(4/6)2

=

0.444本

社区用户整理本由

社区用户整理kGINIsplit

i1n

i

GI

)nUsed

in

CART,

SLIQ,

SPRINT.When

a

node

p

is

split

into

k

partitions

(children),

thequality

of

split

is

computed

as,where,ni

=

number

of

records

at

child

i,n

=

number

of

records

atnode

p.Gini

Sp本li教t程–由wLwwo.5o2pkrosgraFmer社il区ia用r户?整理本由

社区用户整理For

efficient

computation:

for

each

attribute,Sort

the

attribute

on

valuesLinearly

scan

these

values,

each

time

updating

the

count

matrix

andcomputing

gi

ndexChoose

the

split

position

that

has

the

least

gi

ndexCheatNoNoNoYesYesYesNoNoNoNoTaxable

e60707585909510012012522055657280879297110122172230<=><=><=><=><=><=><=><=><=><=><=>Yes03030303122No07443526170Gini0.4200.4000.3750.3430.4170.4000.3000.3430.3750.4000.420Sorted

ValuesSplit

PositionsGini

Sp本li教t

社区用户整理本由

社区用户整理Misclassification

ErrorC10C26C12C24C11C25iError(t)

1

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论