DMML
, 21 Tan 2020- -
Constructing
decisiontrees
Building
smallesttree
is NP -complete
Greedy heuristic
- maximizeimprovement
inpurity Measuring impunity
-
Mrs classification
rate-
Entropy
- "Information gain
"-
Gini Index
Problem with
using information gain directly
:Attribute
=Aadhaar
no/ Passport
N-/
SerialNof
. ..Suppose
mepick Aadhaar
as nextquestion
Dn
Iz
- -→
D Dr
←Kompany
highest possible information gana
, buttotally
uselessPenalize attributes with
too many possible values
-
Compute entropy ( aim
indexof attribute itself !
Aadhaar
- N values, each once in tablePr
--IN for
each value-
Is
, pihog
pi = -I
,f- log 's
= -
log IN
=log
Ninformation
-gain ( Ai )
-improvement
inpurity if
wechoose Ai
entropy ( Ai )
information
-gain
-raho ( Ai )
r.mfm#Ad
entropy ( Ai )
Two
well knownimplement
almsf decision trees
-
CART
-Classification
&Regression
Tree(
avi IndyBrennan
etal
- C
4.5
-Quinlan
-Entropy
-
Continuous attributes ? e.g
"salary
""
Cost
"May
borrow lower &upper band for
Ai-
Granularity
Typical question should by Ai
Ev?
how
do wechoose
v?
→( Ai Ai
Z= rv? ?
All
weknow abort Ai
is what we see intable
E N values across N rows V
,cuz C - - C UN
All these
values are
equivalent
#{
Vi/
Vzmfg
,
Vs
the
Vt Yo tht UNMo Mi
Ai
emg ? Try
each
mj
, evaluateShould we choose
midpoints for mj
,purity gann
,keep the
or actual
vj
?best
Better to use actual
vjs
. -fr interpreted ihty of
model-not
all
values arepossible
CART
-
Regression
-produce
a number as an answer( Typically
,fit
afunction fkn
. . ,xn) to
dataTraining data has
a numeric value astarget
1*7*7
Regression Tree
like
aclassification tree
TA
variance
is/ \
agood
error
metric
¥
- m rows -predict
meansatay
CART
-Classification And Regnum
Tree( Only asks binary quests
-
how good
isthe classifier ?
-
what
isthe
correct measure?
-
On whet data
can wecompute
the measure?
L
Need datafor
which cowed- ane isknown
Only
labelled data wehave
istraining data
Withhold
sometraining
datafor testing
Typically 702
tobuild
model ,30%
totest
( Be
cauafd to
select30%
"
randomly
"-
Build
a model ontraining
set-
Evaluate
ontest
setSomethin
-training
data istoo sparse to
"
waste"
Cross Validation
- Choose
different subsets
astest
data-
Repeatedly build
models10
-fold
crossvalidation-
902 tram
,10%
test=
(
coverswhole
data across 10iteration
If
results aregood
,
go
back and build model onfull
data