Supervised: Classification Models
Examples of supervised data mining
- In supervised data mining, the target variable is identified
- The goal is to predict or explain certain outcome
- Common applications:
- prediction models
- classification models
Classification model: response variable is categorical
- generalized linear regression, CART, gradient boosting, neural network, deep learning
- example: problems that has response as {A, B, C}, {dog, cat}, {0, 1}.
- logistic regression model can do the classification.
Example: Predicting binary purchase action using Purchase
dataset
Suppose we need predict whether a customer will purchase a product based on factors like their age, gender, browsing history, and the time of day they visit the website.
Splitting the data into training and testing sets
library(readr)
<- read_csv("examples/Purchase.csv")
Purchase # View(Purchase)
<- sample(nrow(Purchase),nrow(Purchase)*0.80)
sample_index <- Purchase[sample_index,]
train <- Purchase[-sample_index,] test
Fitting a logistic regression model
The estimated coefficient of a logistic model does not allow us to determine the partial effect of a predictor variable on the probability. It is preferable to interpret logistic regression coefficients in terms of odds rather than probabilities. Odds are defined as the ratio of the probability of success and the probability of failure.
Let \(p = p(y = 1)\), between \([0,1]\).
Odds = \(\frac{p}{1-p}\), between 0 and infinity.
After fitted, the odds(logit) can be expressed as \(\frac{\hat{p}}{(1-\hat{p})}= \exp(b_0 + b_1 x_1)\).
- The natural log of the odds (logit) is a linear function. \(\log{odds}= b_0 + b_1 x_1\)
- So the \(b_1\times 100\) is is the approximate percentage change in the odds when the predictor variable increases by one unit.
# Fitting a logistic regression model
<-
model glm(Purchase ~ Age + Browsing_History +
+ Time_Of_Day,
Gender data=train,
family='binomial')
summary(model)
Call:
glm(formula = Purchase ~ Age + Browsing_History + Gender + Time_Of_Day,
family = "binomial", data = train)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.160626 0.338652 -6.380 1.77e-10 ***
Age 0.035148 0.006814 5.158 2.50e-07 ***
Browsing_History 2.048101 0.168325 12.168 < 2e-16 ***
GenderMale -0.043162 0.162931 -0.265 0.7911
Time_Of_DayEvening 0.357064 0.203671 1.753 0.0796 .
Time_Of_DayMorning -0.171330 0.194250 -0.882 0.3778
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1099.77 on 799 degrees of freedom
Residual deviance: 898.41 on 794 degrees of freedom
AIC: 910.41
Number of Fisher Scoring iterations: 4
Making predictions
<- predict(model, newdata=train, type='response')
train_pred <- predict(model, newdata=test, type='response') test_pred
What about purchase probability of a female, 35-year-old customer who has browsed the product in the evening?
Evaluating the model
How should I evaluate a logistic regression model?
# Evaluating the accuracy
<- mean((train_pred > 0.5) == train$Purchase)
train_accuracy print(paste('Train_Accuracy:', train_accuracy))
[1] "Train_Accuracy: 0.725"
<- mean((test_pred > 0.5) == test$Purchase)
test_accuracy print(paste('Test_Accuracy:', test_accuracy))
[1] "Test_Accuracy: 0.69"
# Calculating AUC for training and testing sets
library(pROC)
<- roc(train$Purchase, train_pred)
train_auc <- roc(test$Purchase, test_pred) test_auc