Supervised: Classification Models

Examples of supervised data mining

In supervised data mining, the target variable is identified
The goal is to predict or explain certain outcome
Common applications:
- prediction models
- classification models

Classification model: response variable is categorical

generalized linear regression, CART, gradient boosting, neural network, deep learning
example: problems that has response as {A, B, C}, {dog, cat}, {0, 1}.
logistic regression model can do the classification.

Example: Predicting binary purchase action using `Purchase` dataset

Suppose we need predict whether a customer will purchase a product based on factors like their age, gender, browsing history, and the time of day they visit the website.

Splitting the data into training and testing sets

library(readr)
Purchase <- read_csv("examples/Purchase.csv")
# View(Purchase)

sample_index <- sample(nrow(Purchase),nrow(Purchase)*0.80)
train <- Purchase[sample_index,]
test <- Purchase[-sample_index,]

Fitting a logistic regression model

The estimated coefficient of a logistic model does not allow us to determine the partial effect of a predictor variable on the probability. It is preferable to interpret logistic regression coefficients in terms of odds rather than probabilities. Odds are defined as the ratio of the probability of success and the probability of failure.

Let \(p = p(y = 1)\), between \([0,1]\).

Odds = \(\frac{p}{1-p}\), between 0 and infinity.

After fitted, the odds(logit) can be expressed as \(\frac{\hat{p}}{(1-\hat{p})}= \exp(b_0 + b_1 x_1)\).

The natural log of the odds (logit) is a linear function. \(\log{odds}= b_0 + b_1 x_1\)
So the \(b_1\times 100\) is is the approximate percentage change in the odds when the predictor variable increases by one unit.

# Fitting a logistic regression model
model <- 
  glm(Purchase ~ Age + Browsing_History +
        Gender + Time_Of_Day, 
      data=train, 
      family='binomial')
summary(model)


Call:
glm(formula = Purchase ~ Age + Browsing_History + Gender + Time_Of_Day, 
    family = "binomial", data = train)

Coefficients:
                    Estimate Std. Error z value Pr(>|z|)    
(Intercept)        -2.160626   0.338652  -6.380 1.77e-10 ***
Age                 0.035148   0.006814   5.158 2.50e-07 ***
Browsing_History    2.048101   0.168325  12.168  < 2e-16 ***
GenderMale         -0.043162   0.162931  -0.265   0.7911    
Time_Of_DayEvening  0.357064   0.203671   1.753   0.0796 .  
Time_Of_DayMorning -0.171330   0.194250  -0.882   0.3778    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1099.77  on 799  degrees of freedom
Residual deviance:  898.41  on 794  degrees of freedom
AIC: 910.41

Number of Fisher Scoring iterations: 4

Making predictions

train_pred <- predict(model, newdata=train, type='response')
test_pred <- predict(model, newdata=test, type='response')

What about purchase probability of a female, 35-year-old customer who has browsed the product in the evening?

Evaluating the model

Ask Educate Us GPT

How should I evaluate a logistic regression model?

# Evaluating the accuracy
train_accuracy <- mean((train_pred > 0.5) == train$Purchase)
print(paste('Train_Accuracy:', train_accuracy))

[1] "Train_Accuracy: 0.725"

test_accuracy <- mean((test_pred > 0.5) == test$Purchase)
print(paste('Test_Accuracy:', test_accuracy))

[1] "Test_Accuracy: 0.69"

# Calculating AUC for training and testing sets

library(pROC)
train_auc <- roc(train$Purchase, train_pred)
test_auc <- roc(test$Purchase, test_pred)

go to top