Variable selection for logistic regression

Variable selection methods

Variable selection methods are techniques used to determine the most relevant and impactful features or variables in a dataset. These methods help to improve the accuracy and efficiency of machine learning models by focusing only on the most important variables, thereby reducing noise and enhancing predictive performance.

Variable selection is important in data mining because it helps to improve the quality and efficiency of predictive models, reduces complexity, and enhances the interpretability of results.

Popular variable selection methods

Some popular variable selection methods in data mining include:

Best Subset Selection: It evaluates all possible combinations of predictors (features) to find the best subset of variables that minimizes a chosen criterion, such as the Akaike Information Criterion (AIC) or the Bayesian Information Criterion (BIC).
Forward Selection: This method starts with an empty set of variables and adds one variable at a time based on their individual contribution to the model’s performance.
Backward Elimination: In contrast to forward selection, backward elimination starts with all variables included in the model and removes them one at a time based on their individual contribution to the model’s performance.
Lasso Regression: Lasso (Least Absolute Shrinkage and Selection Operator) is a regularization technique that penalizes the absolute size of coefficients, effectively setting some coefficients to zero and performing feature selection.
Random Forest Feature Importance: Random Forest algorithm can be used to measure the importance of each feature based on how much the tree nodes that use that feature reduce impurity on average.

Application: Spam detector

Jennifer Lee is a data scientist working for a telecommunications company. The company has been experiencing a surge in customer complaints about receiving spam and phishing emails that attempt to steal personal information. Jennifer is tasked with developing a machine learning model to detect and filter out these useless or malicious emails. Her objective is to implement spam filters that analyze the email content for suspicious URLs, phishing keywords, and unusual sender behavior to proactively protect customers from falling victim to phishing attacks.

The objective is to detect spam based on the number of recipients, the number of hyperlinks, and the number of characters for each email.

Download and import the `Spam.csv` dataset

# Reading the "Spam.csv" file from the URL and storing it in the 'Spam' dataframe.
Spam <- read.csv("https://www.educateusgpt.org/examples/Spam.csv", header = TRUE)

# Providing a summary of the 'Spam' dataframe, including statistical summaries.
summary(Spam)

     Record           Spam         Recipients      Hyperlinks    
 Min.   :  1.0   Min.   :0.000   Min.   :12.00   Min.   : 0.000  
 1st Qu.:125.8   1st Qu.:0.000   1st Qu.:12.00   1st Qu.: 3.000  
 Median :250.5   Median :1.000   Median :13.00   Median : 6.000  
 Mean   :250.5   Mean   :0.516   Mean   :14.42   Mean   : 6.226  
 3rd Qu.:375.2   3rd Qu.:1.000   3rd Qu.:15.00   3rd Qu.: 9.000  
 Max.   :500.0   Max.   :1.000   Max.   :52.00   Max.   :12.000  
   Characters   
 Min.   : 18.0  
 1st Qu.: 38.0  
 Median : 58.0  
 Mean   : 58.6  
 3rd Qu.: 79.0  
 Max.   :103.0

# Remove the useless 'Record' column
Spam <- Spam[,-1]

Fit the logistic regression model

Develop the logistic regression model for the Spam dataset.

# Splitting dataset to be training and testing sets.
# sample_index <- sample(nrow(Spam), nrow(Spam)*0.80)
# train <- Spam[sample_index,]
# test <- Spam[-sample_index,]

# or use a more straightforward splitting
train <- Spam[1:400,]
test <- Spam[401:500,]

# Train a logistic regression model with all variables
Logit_mod <- glm(Spam ~ Recipients + Hyperlinks + Characters,
                 data = train, 
                 family = binomial)

# Print the summary of the model
summary(Logit_mod)


Call:
glm(formula = Spam ~ Recipients + Hyperlinks + Characters, family = binomial, 
    data = train)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -4.807162   0.812737  -5.915 3.32e-09 ***
Recipients   0.160926   0.042556   3.782 0.000156 ***
Hyperlinks   0.523397   0.049921  10.484  < 2e-16 ***
Characters  -0.010583   0.005542  -1.910 0.056181 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 553.71  on 399  degrees of freedom
Residual deviance: 353.62  on 396  degrees of freedom
AIC: 361.62

Number of Fisher Scoring iterations: 5

Best subset selection

It evaluates all possible combinations of predictors (features) to find the best subset of variables that minimizes a chosen criterion, such as the Akaike Information Criterion (AIC) or the Bayesian Information Criterion (BIC).

If you cannot write your own codes, try to ask EducateUsGPT assistant to help you out.

Ask Educate Us GPT

Show me the sample codes in R to apply best subset selection method on the model object 'Logit_mod'?

# Load the leaps package for regsubsets function
library(leaps)

# Generate all possible models and select the best subset
best_subset_model <- 
  regsubsets(Spam ~ ., 
             data = train, 
             nbest=1, 
             nvmax = 3)

# Extract the best subset model
best_model <- summary(best_subset_model)
best_model

Subset selection object
Call: regsubsets.formula(Spam ~ ., data = train, nbest = 1, nvmax = 3)
3 Variables  (and intercept)
           Forced in Forced out
Recipients     FALSE      FALSE
Hyperlinks     FALSE      FALSE
Characters     FALSE      FALSE
1 subsets of each size up to 3
Selection Algorithm: exhaustive
         Recipients Hyperlinks Characters
1  ( 1 ) " "        "*"        " "       
2  ( 1 ) "*"        "*"        " "       
3  ( 1 ) "*"        "*"        "*"

plot(best_subset_model, scale="bic")

Forward/Backward/Stepwise selection

To perform the Forward/Backward/Stepwise Regression in R, we need to define the starting models:

# nullmodel is the model with no varaible in it,
nullmodel <- lm(Spam ~ 1, data=train)
# while fullmodel is the model with every variable in it.
fullmodel <- Logit_mod

Backward Elimination

model_step_b <- step(fullmodel, direction='backward')

Start:  AIC=361.62
Spam ~ Recipients + Hyperlinks + Characters

             Df Deviance    AIC
<none>            353.62 361.62
- Characters  1   357.32 363.32
- Recipients  1   374.71 380.71
- Hyperlinks  1   542.19 548.19

Start: AIC=361.62: This indicates the starting Akaike Information Criterion (AIC) value before any variables are removed or added.
- Recipients, - Characters, - Hyperlinks: These lines represent the potential improvement in AIC if each variable is removed from the model individually. However, removing any one of them would lead to an increase in AIC or RSS, indicating a worse model fit. Therefore, the best model should be the starting model with all three predictors included.

Forward Selection

model_step_f <- 
  step(nullmodel, 
       scope = list(lower = nullmodel, upper = fullmodel), 
       direction='forward')

Start:  AIC=-553.33
Spam ~ 1

             Df Sum of Sq    RSS     AIC
+ Hyperlinks  1    37.528 62.270 -739.99
+ Recipients  1     1.866 97.931 -558.88
+ Characters  1     0.872 98.925 -554.84
<none>                    99.798 -553.33

Step:  AIC=-739.99
Spam ~ Hyperlinks

             Df Sum of Sq    RSS     AIC
+ Recipients  1    3.6487 58.621 -762.15
+ Characters  1    0.6559 61.614 -742.23
<none>                    62.270 -739.99

Step:  AIC=-762.15
Spam ~ Hyperlinks + Recipients

             Df Sum of Sq    RSS     AIC
+ Characters  1   0.52274 58.099 -763.73
<none>                    58.621 -762.15

Step:  AIC=-763.73
Spam ~ Hyperlinks + Recipients + Characters

Start: AIC=-553.33, Spam ~ 1. It shows adding the ‘+ Hyperlinks’ decreases the AIC the most. So the algorithm decides to add the ‘Hyperlinks’.
Step: AIC=-739.99 shows the AIC for the model Spam ~ Hyperlinks. Based on this model, adding the ‘+ Recipients’ decreases the AIC again to be -762.15 So it should add ‘Recipients’ to the model.
Likewise, in the third step, the final AIC value after adding ‘Characters’ to the model decreases to -763.73, which indicates that this is the best model with the lowest AIC value with all three predictors: ‘Hyperlinks’, ‘Characters’, and ‘Recipients’.

Does the Forward selection give you the same selected model?

Stepwise Selection (Output Omitted)