版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
本课件包括:演示文稿,示例,代码,题库,和声音等,小象学院拥有完全知识的权利;只限于善意学习者在本课程使用,不得在课程范围外向任何第散播。任何其他人或机构不得盗版、创意,
保留一切通过法律、仿造其中的者的权利。关注小象学院律本由
社区用户整理互联网新技术教育本领教航者
社区用户整理分享MachineLearningPart
4:
Classical
Machine
Learning
ModelsZengchang
Qin
(Ph.D.)本由
社区用户整理互联网新技术教育本领教航者
社区用户整理分享Decision
Tree
Learning本由
社区用户整理互联网新技术教育本领教航者
社区用户整理分享Play-Tennis
Problem’s
book
[3],
we
can
find
atree
to
representThe
Play-Tennis
data
from
T.
Mitc“Yes”
and“No”
by
leaves.[3]
T.
Mitc(1997),
Machine
Learning,
McGraw
Hill.本
由
社区用户整理本由
社区用户整理Impurity’s
book
[3],
we
can
find
atree
to
representThe
Play-Tennis
data
from
T.
Mitc“Yes”
and“No”
by
leaves.Greedy
approach:Nodes
with
homogeneous
classdistribution
are
preferredNeed
a
measure
of
node
impurity本由
社区用户整理本由
社区用户整理Multi-dimensional
Attributes
(Features)Shannon's
solution
follows
from
thefundamental
properties
of
information.1.I(p)
is
anti-monotonic
in
p
–
increasesand
decreases
inthe
probability
of
anevent
produce
decreases
andincreasesin
information,
respectively2.I(p)
≥
0
–
information
is
a
non-negative
ty3.I(1)
=
0
–
events
that
always
occurdonot
communicate
information4.I(p1,
p2)
=
I(p1)
+
I(p2)
–
informationdue
to
independent
events
is
additive本
由
社区用户整理本由
社区用户整理Information
Gain本
由
社区用户整理本由
社区用户整理Sub-Tre本e教s
社区用户整理本由
社区用户整理Partition本由社区用户整理本由
社区用户整理General
Way
of
Building
TreesGreedy
strategy.Split
the
records
based
on
an
attribute
test
thatoptimizes
certain
criterion.IssuesDetermine
how
to
split
the
recordsHow
to
specify
the
attribute
testcondition?How
to
determine
the
bestsplit?Determine
when
to
stop
splitting本
由
社区用户整理本由
社区用户整理TidRefundMaritalStatusTaxableeCheat1YesSingle125KNo2NoMarried100KNo3NoSingle70KNo4YesMarried120KNo5NoDivorced95KYes6NoMarried60KNo7YesDivorced220KNo8NoSingle85KYes9NoMarried75KNo10NoSingle90KYes0RefundMarStTaxIncYESNONONOYesNoMarriedSingle,
Divorced<
80K>
80KSplitting
AttributesTraining
aModel: Decision
TreeAttribut本e教Tywpwwe.社区用户整理本由
社区用户整理Depends
on
attribute
typesNominalOrdinalContinuousDepends
on
number
of
ways
to
split2-way
splitMulti-way
splitSub-Tre本e教s
社区用户整理本由
社区用户整理What
about
this
split?Multi-way
split:
Use
as
many
partitions
as
distinct
values.SizeSmall
LargeMediumBinarysplit:
Divides
values
into
two
subsets.Need
to
find
optimal
partitioning.Size{Medium,Large}{Small}Size{Small,Medium}{Large}ORSize{Small,Large}{Medium}Splitting本由社区用户整理本由
社区用户整理DiscretizationDifferent
ways
of
handlingDiscretization
to
form
an
ordinal
categorical
attributeStatic
–discretize
once
at
thebeginningDynamic
–
ranges
can
be
found
byequalinterval
bucketing,
equal
frequencybucketing(percentiles),
or
clustering.Binary
Decision:
(A
<
v)
or
(A
v)consider
all
possible
splits
and
finds
the
best
cutcan
be
more
compute
intensive本
由
社区用户整理本由
社区用户整理Gi
ndexj(NOTE:
p(
j
|
t)
is
the
relative
frequency
of
class
j
atnode
t).um
(1
-
1/nc)
when
records
are
equally
distributed
among
allclasses,
implying
least
interesting
informationMinimum
(0.0)
when
all
records
belong
to
one
class,
implying
mostinteresting
informationGi ndex
for
a
given
node
t
:GINI
(t)
1
[
p(
j
|
t)]2C10C26Gini=0.000C12C24Gini=0.444C13C23Gini=0.500C11C25Gini=0.278本由
社区用户整理本由
社区用户整理Detailed
CalculationC10C26C12C24C11C25GINI
(t)
1
[
p(
j
|
t)]2jP(C1)
=
0/6
=
0 P(C2)
=
6/6
=
1Gini
=1
–
P(C1)2
–
P(C2)2
=1
–
0
–
1
=
0P(C1)
=
1/6 P(C2)
=
5/6Gini
=
1
–
(1/6)2
–
(5/6)2
=
0.278P(C1)
=
2/6 P(C2)
=
4/6Gini
=
1
–
(2/6)2
–
(4/6)2
=
0.444本
由
社区用户整理本由
社区用户整理kGINIsplit
i1n
i
GI
)nUsed
in
CART,
SLIQ,
SPRINT.When
a
node
p
is
split
into
k
partitions
(children),
thequality
of
split
is
computed
as,where,ni
=
number
of
records
at
child
i,n
=
number
of
records
atnode
p.Gini
Sp本li教t程–由wLwwo.5o2pkrosgraFmer社il区ia用r户?整理本由
社区用户整理For
efficient
computation:
for
each
attribute,Sort
the
attribute
on
valuesLinearly
scan
these
values,
each
time
updating
the
count
matrix
andcomputing
gi
ndexChoose
the
split
position
that
has
the
least
gi
ndexCheatNoNoNoYesYesYesNoNoNoNoTaxable
e60707585909510012012522055657280879297110122172230<=><=><=><=><=><=><=><=><=><=><=>Yes03030303122No07443526170Gini0.4200.4000.3750.3430.4170.4000.3000.3430.3750.4000.420Sorted
ValuesSplit
PositionsGini
Sp本li教t
社区用户整理本由
社区用户整理Misclassification
ErrorC10C26C12C24C11C25iError(t)
1
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 2025年度农村自建房劳务承包农村民宿建设承包合同
- 二零二五年度大数据分析服务合同解除协议3篇
- 2025年度企业员工借调与创新创业支持合同3篇
- 二零二五年度股份制企业股东间短期借款合同3篇
- 2025年度网络安全事故应急处理与恢复服务合同3篇
- 2025年度年度养牛产业养殖技术培训与合作合同3篇
- 二零二五年度农村房屋买卖合同-农村土地综合整治项目
- 2025年度电梯安全检查与维修保养合同3篇
- 新型美容院入股合同模板(2025年度)3篇
- 2025年度高端游戏引擎软件许可与定制化服务合同3篇
- 《陆上风电场工程设计概算编制规定及费用标准》(NB-T 31011-2019)
- 我和我的祖国拼音版
- 2023年生态环境综合行政执法考试参考题库(400题)
- 口腔材料学课件
- 工资审核流程
- 手工钨极氩弧焊焊接工艺指导书
- 北师大七年级上数学易错题(共8页)
- 供应商供方履约评价表(参考模板)
- 徒步行军pt课件
- 国家电网公司电网设备缺陷管理规定国网(运检3)(文号国家电网企管
- 输血科(血库)仪器设备使用、保养记录表
评论
0/150
提交评论