ID3 Algorithm – R Programming
School of Computer & Information Sciences
ITS 836 Data Science and Big Data Analytics
ITS 836
1
HW07 Lecture 07 Classification
Questions
R exercise for Decision Tree section 7_1
Explain how Random Forest Algorithm works
Iris Dataset with Decision Tree vs Random Forest
R exercise for Naïve Bayes section 7_2
Analyze Classifier Performance section 7_3
Redo calculations for ID3 and Naïve Bayes for the Golf
ITS 836
2
HW07-1 Apply ID3 Algorithm to demonstrate the Decision Tree for the data set
ITS 836
3
http://www.cse.unsw.edu.au/~billw/cs9414/notes/ml/06prop/id3/id3.html
Select | Size | Color | Shape |
yes | medium | blue | brick |
yes | small | red | sphere |
yes | large | green | pillar |
yes | large | green | sphere |
no | small | red | wedge |
no | large | red | wedge |
no | large | red | pillar |
Back to HW07 Overview
HW07 Q 2
Analyze R code in section 7_1 to create the decision tree classifier for the dataset: bank_sample.csv
Create and Explain all plots an d results
ITS 836
4
# install packages rpart,rpart.plot
# put this code into Rstudio source and execute lines via Ctrl/Enter
library(“rpart”)
library(“rpart.plot”)
setwd(“c:/data/rstudiofiles/”)
banktrain <- read.table(“bank-sample.csv”,header=TRUE,sep=”,”)
## drop a few columns to simplify the tree
drops<-c(“age”, “balance”, “day”, “campaign”, “pdays”, “previous”, “month”)
banktrain <- banktrain [,!(names(banktrain) %in% drops)]
summary(banktrain)
# Make a simple decision tree by only keeping the categorical variables
fit <- rpart(subscribed ~ job + marital + education + default + housing + loan + contact + poutcome,method=”class”,data=banktrain,control=rpart.control(minsplit=1),
parms=list(split=’information’))
summary(fit)
# Plot the tree
rpart.plot(fit, type=4, extra=2, clip.right.labs=FALSE, varlen=0, faclen=3)
Back to HW07 Overview
4
HW07 Q 2
Analyze R code in section 7_1 to create the decision tree classifier for the dataset: bank_sample.csv
Create and Explain all plots an d results
ITS 836
5
5
HW07 Q 2
Analyze R code in section 7_1 to create the decision tree classifier for the dataset: bank_sample.csv
Create and Explain all plots and results
ITS 836
6
6
HW 7 Q3
Explain how a Random Forest Algorithm Works
ITS 836
7
http://blog.citizennet.com/blog/2012/11/10/random-forests-ensembles-and-performance-metrics
Back to HW07 Overview
ITS 836
Use Decision Tree Classifier and Random Forest
Attributes: sepal length, sepal width, petal length, petal width
All flowers contain a sepal and a petal
For the iris flowers three categories (Versicolor, Setosa, Virginica) different measurements
R.A. Fisher, 1936
8
HW07 Q4 Using Iris Dataset
Back to HW07 Overview
HW07 Q4 Using Iris Dataset
Decision Tree applied to Iris Dataset
https://rpubs.com/abhaypadda/k-nn-decision-tree-on-IRIS-dataset or
https://davetang.org/muse/2013/03/12/building-a-classification-tree-in-r/
What are the disadvantages of Decision Trees?
https://www.quora.com/What-are-the-disadvantages-of-using-a-decision-tree-for-classification
Random Forest applied to Iris Dataset and compare to
https://rpubs.com/rpadebet/269829
http://rischanlab.github.io/RandomForest.html
ITS 836
9
Get data and e1071 package
sample<-read.table(“sample1.csv”,header=TRUE,sep=”,”)
traindata<-as.data.frame(sample[1:14,])
testdata<-as.data.frame(sample[15,])
traindata #lists train data
testdata #lists test data, no Enrolls variable
install.packages(“e1071”, dep = TRUE)
library(e1071) #contains naïve Bayes function
model<-naiveBayes(Enrolls~Age+Income+JobSatisfaction+Desire,traindata)
model # generates model output
results<-predict(model,testdata)
Results # provides test prediction
ITS 836
10
Q5 HW07 Section 7.2 Naïve Bayes in R
Back to HW07 Overview
10
7.3 classifier performance
# install some packages
install.packages(“ROCR”)
library(ROCR)
# training set
banktrain <- read.table(“bank-sample.csv”,header=TRUE,sep=”,”)
# drop a few columns
drops <- c(“balance”, “day”, “campaign”, “pdays”, “previous”, “month”)
banktrain <- banktrain [,!(names(banktrain) %in% drops)]
# testing set
banktest <- read.table(“bank-sample-test.csv”,header=TRUE,sep=”,”)
banktest <- banktest [,!(names(banktest) %in% drops)]
# build the na?ve Bayes classifier
nb_model <- naiveBayes(subscribed~.,
data=banktrain)
ITS 836
11
# perform on the testing set
nb_prediction <- predict(nb_model,
# remove column “subscribed”
banktest[,-ncol(banktest)],
type=’raw’)
score <- nb_prediction[, c(“yes”)]
actual_class <- banktest$subscribed == ‘yes’
pred <- prediction(score, actual_class)
perf <- performance(pred, “tpr”, “fpr”)
plot(perf, lwd=2, xlab=”False Positive Rate (FPR)”,
ylab=”True Positive Rate (TPR)”)
abline(a=0, b=1, col=”gray50″, lty=3)
## corresponding AUC score
auc <- performance(pred, “auc”)
auc <- unlist(slot(auc, “y.values”))
auc
Back to HW07 Overview
7.3 Diagnostics of Classifiers
We cover three classifiers
Logistic regression, decision trees, naïve Bayes
Tools to evaluate classifier performance
Confusion matrix
ITS 836
12
Back to HW07 Overview
12
7.3 Diagnostics of Classifiers
Bank marketing example
Training set of 2000 records
Test set of 100 records, evaluated below
ITS 836
13
Back to HW07 Overview
13
HW07 Q07 Review calculations for the ID3 and Naïve Bayes Algorithm
ITS 836
14
Record | OUTLOOK | TEMPERATURE | HUMIDITY | WINDY | PLAY GOLF |
X0 | Rainy | Hot | High | False | No |
X1 | Rainy | Hot | High | True | No |
X2 | Overcast | Hot | High | False | Yes |
X3 | Sunny | Mild | High | False | Yes |
4 | Sunny | Cool | Normal | False | Yes |
5 | Sunny | Cool | Normal | True | No |
6 | Overcast | Cool | Normal | True | Yes |
7 | Rainy | Mild | High | False | No |
8 | Rainy | Cool | Normal | False | Yes |
9 | Sunny | Mild | Normal | False | Yes |
10 | Rainy | Mild | Normal | True | Yes |
11 | Overcast | Mild | High | True | Yes |
12 | Overcast | Hot | Normal | False | Yes |
X13 | Sunny | Mild | High | True | No |
Back to HW07 Overview
Questions?
ITS 836
15