Evaluation metrics: ROC and AUC

Logistic regression and ROC

Practitioners often seek to evaluate the overall goodness of a classification model. However, model evaluation metrics such as accuracy, precision, recall, and F1 score can be significantly impacted by the choice of the cut-off probability, making it challenging to accurately assess the model’s overall performance.

Hence, one of the most popular tools is the Receiver Operating Characteristic (ROC) curve or Precision-Recall (PR) curve. The ROC or PR curves evaluate the model’s performance across various cut-off probabilities and help gain a more comprehensive understanding of its predictive power.

ROC Curve

Receiver Operating Characteristic (ROC) curve is a graphical representation that showcases the performance of a classification model across different classification thresholds. The ROC curve plots the true positive rate TPR (sensitivity) against the false positive rate FPR (1-specificity) at various threshold settings.

True positive rate, \(\text{TPR}=\frac{\text{TP}}{\text{TP+FN}}\): is the proportion of true positives correctly predicted as positive.

False positive rate, \(\text{FPR}=\frac{\text{FP}}{\text{FP+TN}}=1-\frac{\text{TN}}{\text{FP+TN}}\): is the proportion of true negatives incorrectly predicted as positive.

Example: Credit Card Default

We will use a Credit Card Default Data for this lab and illustration. The details of the data can be found at here. Think about what kind of factors could affect people to fail to pay their credit balance. This study emphasizes that accurately estimating default probability is more valuable for risk management than a binary classification of clients as credible or not credible.

This dataset usually includes various features such as demographic information (age, gender, education), credit card usage details (credit limit, balance, payment history), and whether or not the cardholder defaulted on their payments.

# Importing credit data from a CSV file hosted online.
# The 'read.csv()' function is used to read the data from the specified URL.
# 'header=T' indicates that the first row of the CSV file contains column names.
credit_data <- read.csv(file = "https://xiaorui.site/Data-Mining-R/lecture/data/credit_default.csv", header = TRUE)

A brief exploratory data analysis

Explore what information do we have.

colnames(credit_data)

 [1] "LIMIT_BAL"                  "SEX"                       
 [3] "EDUCATION"                  "MARRIAGE"                  
 [5] "AGE"                        "PAY_0"                     
 [7] "PAY_2"                      "PAY_3"                     
 [9] "PAY_4"                      "PAY_5"                     
[11] "PAY_6"                      "BILL_AMT1"                 
[13] "BILL_AMT2"                  "BILL_AMT3"                 
[15] "BILL_AMT4"                  "BILL_AMT5"                 
[17] "BILL_AMT6"                  "PAY_AMT1"                  
[19] "PAY_AMT2"                   "PAY_AMT3"                  
[21] "PAY_AMT4"                   "PAY_AMT5"                  
[23] "PAY_AMT6"                   "default.payment.next.month"

Let’s look at how many people were actually default in this sample.

table(credit_data$default.payment.next.month)


   0    1 
9368 2632

The name of response variable is too long! I want to make it shorter by renaming. Recall the rename() function.

library(dplyr)
credit_data <- rename(credit_data, default=default.payment.next.month)

How about the variable type and summary statistics?

str(credit_data)    # structure - see variable type
summary(credit_data) # summary statistics

We see all variables are int, but we know that SEX, EDUCATION, MARRIAGE are categorical, we convert them to factor, the categorical variable type in R.

credit_data$SEX <- as.factor(credit_data$SEX)
credit_data$EDUCATION <- as.factor(credit_data$EDUCATION)
credit_data$MARRIAGE <- as.factor(credit_data$MARRIAGE)

We omit other EDA, but you shouldn’t whenever you are doing data analysis.

Fit logistic regression model

Randomly split the data to training (80%) and testing (20%) datasets:

index <- sample(nrow(credit_data),nrow(credit_data)*0.80)
credit_train <- credit_data[index,]
credit_test <- credit_data[-index,]

# or use a more straightforward splitting
# credit_train <- credit_data[1:10000,]
# credit_test <- credit_data[10001:12000,]

Train a logistic regression model with all variables:

credit_logit <- 
  glm(default~., 
      family = binomial, 
      data = credit_train)
summary(credit_logit)


Call:
glm(formula = default ~ ., family = binomial, data = credit_train)

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept) -9.358e-01  1.530e-01  -6.117 9.52e-10 ***
LIMIT_BAL   -8.148e-07  2.786e-07  -2.925  0.00344 ** 
SEX2        -1.251e-01  5.421e-02  -2.307  0.02104 *  
EDUCATION2  -1.165e-01  6.254e-02  -1.863  0.06246 .  
EDUCATION3  -1.515e-01  8.459e-02  -1.790  0.07339 .  
EDUCATION4  -1.378e+00  3.475e-01  -3.964 7.36e-05 ***
MARRIAGE2   -1.349e-01  6.122e-02  -2.204  0.02754 *  
MARRIAGE3   -7.853e-02  2.365e-01  -0.332  0.73987    
AGE          3.484e-03  3.313e-03   1.052  0.29291    
PAY_0        5.962e-01  3.143e-02  18.969  < 2e-16 ***
PAY_2        8.066e-02  3.581e-02   2.253  0.02428 *  
PAY_3        5.985e-02  3.936e-02   1.521  0.12838    
PAY_4        3.519e-02  4.430e-02   0.794  0.42694    
PAY_5        4.607e-02  4.702e-02   0.980  0.32721    
PAY_6       -4.688e-03  3.945e-02  -0.119  0.90539    
BILL_AMT1   -6.156e-06  1.973e-06  -3.120  0.00181 ** 
BILL_AMT2    5.871e-06  2.487e-06   2.361  0.01825 *  
BILL_AMT3   -4.691e-07  2.139e-06  -0.219  0.82643    
BILL_AMT4   -3.080e-06  2.363e-06  -1.304  0.19238    
BILL_AMT5    2.926e-06  2.773e-06   1.055  0.29140    
BILL_AMT6   -2.096e-07  2.257e-06  -0.093  0.92602    
PAY_AMT1    -1.061e-05  3.341e-06  -3.177  0.00149 ** 
PAY_AMT2    -5.463e-06  3.126e-06  -1.748  0.08052 .  
PAY_AMT3    -1.475e-06  2.838e-06  -0.520  0.60315    
PAY_AMT4    -2.013e-06  2.670e-06  -0.754  0.45088    
PAY_AMT5    -1.901e-06  3.123e-06  -0.609  0.54265    
PAY_AMT6    -2.335e-06  1.992e-06  -1.172  0.24102    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 10189.5  on 9599  degrees of freedom
Residual deviance:  8941.7  on 9573  degrees of freedom
AIC: 8995.7

Number of Fisher Scoring iterations: 5

You have seen glm() before. In this lab, this is the main function used to build logistic regression model because it is a member of generalized linear model. In glm(), the only thing new is family. It specifies the distribution of your response variable. You may also specify the link function after the name of distribution, for example, family=binomial(logit) (default link is logit). You can also specify family=binomial(link = "probit") to run probit regression. You may also use glm() to build many other generalized linear models.

In-sample AUC-ROC

Next, we predict the approval probability for training set (also called in-sample).

install.packages('ROCR')

# Predicted probability for training set
pred_train <- predict(credit_logit, type="response")

library(ROCR)
# This line of code creates a prediction object 'pred' using the 'prediction()' 
pred_obj_train <- prediction(pred_train, credit_train$default)

# This line of code calculates True Positive Rate (TPR) and False Positive Rate
# (FPR) based on the prediction object 'pred_obj_train' using the 'performance()'
# function from the 'ROCR' package. 
# It stores the resulting performance object in the variable 'perf_train'.
perf_train <- performance(pred_obj_train, "tpr", "fpr")

# Retrieves the values of the True Positive Rate (TPR) from the 'perf_train' object.
# The 'attr()' function is used to access the attributes of the performance object,
# and "x.values" specifies the TPR values.
attr(perf_train, "x.values")[[1]][1:10]

 [1] 0.0000000000 0.0000000000 0.0000000000 0.0001340662 0.0002681325
 [6] 0.0004021987 0.0004021987 0.0005362649 0.0005362649 0.0005362649

# Retrieves the values of the False Positive Rate (FPR) thresholds from the 'perf_train' object
# Similar to the previous line, 'attr()' function is used to access the attributes
# of the performance object, and "alpha.values" specifies the FPR values.
attr(perf_train, "alpha.values")[[1]][1:10]

 [1]       Inf 0.9931485 0.9928588 0.9923422 0.9921931 0.9921249 0.9911484
 [8] 0.9846340 0.9784866 0.9770306

# This line of code generates a plot of the ROC curve using the performance object 'perf_train'. 
# The 'colorize=TRUE' argument adds color to the plot for better visualization.
plot(perf_train, colorize=TRUE)

# This line of code calculates the Area Under the ROC Curve (AUC) using the 'performance()' function with "auc" as the argument. 
# It extracts the AUC value from the resulting performance object and returns it as
# a numeric value.
# The 'unlist()' function is used to convert the AUC value from a list to a numeric vector.
unlist(slot(performance(pred_obj_train, "auc"), "y.values"))

[1] 0.7270473

Be careful that the function prediction() is different from predict(). It is in the Package ROCR and is particularly used for preparing for the ROC curve. Recall from our lecture that this function essentially calculates many confusion matrices with different cut-off probabilities. Therefore, it requires two vectors as inputs: predicted probability and observed response (0/1).

The next line, performance(), calculates True Positive Rate (TPR) and False Positive Rate (FPR) based on all the confusion matrices obtained from the previous step. Then, you can simply draw the ROC curve, which is a curve of FPR versus TPR.

The last line is to calculate the Area Under the Curve (AUC). I would recommend you to group these four lines of code together and use them to obtain the ROC curve and AUC. If you don’t want to draw the ROC curve (because it takes time), you can simply comment out the plot line.

Out-of-sample AUC-ROC (more important)

# This line of code generates predictions for the test dataset using the logistic 
# regression model 'credit_logit'.
# The 'predict()' function is applied with the 'newdata' argument set to 'test' to 
# indicate the test dataset,
# and 'type="response"' specifies that the predicted probabilities of the response 
# variable are returned.
pred_test <- predict(credit_logit, newdata = credit_test, type="response")

The following codes will draw the ROC Curve and the AUC-ROC.

# Create a prediction object 'pred_obj_test' using predicted probabilities 'pred_test' 
# and actual responses 'credit_test$default'.
pred_obj_test <- prediction(pred_test, credit_test$default)

# Calculate True Positive Rate (TPR) and False Positive Rate (FPR) using the prediction object.
perf_test <- performance(pred_obj_test, "tpr", "fpr")

# Plot the ROC curve based on TPR and FPR, with colorization for better visualization.
plot(perf_test, colorize=TRUE)

# Retrieve the Area Under the ROC Curve (AUC) from the performance object and convert
# it into a numeric value.
unlist(slot(performance(pred_obj_test, "auc"), "y.values"))

[1] 0.7383772

(Optional) Precision-Recall Curve

Ask Educate Us GPT

Why should one use the Precision-Recall Curve rather than the ROC curve?

Precision-Recall Curve is particularly useful when dealing with imbalanced datasets where one class significantly outnumbers the other. In such cases, Precision-Recall Curve provides a better insight into the model’s performance compared to ROC curve.

We use package PRROC to draw the PR curve. It can also draw the ROC curve. More details of the package can be found here.

install.packages("PRROC")

# Importing necessary library for PRROC package.
library(PRROC)

# Splitting predicted scores for default=1 and default=0 for training dataset.
score1_train <- pred_train[credit_train$default==1]
score0_train <- pred_train[credit_train$default==0]

# Calculating ROC curve and AUC for in-sample prediction.
roc <- roc.curve(score1_train, score0_train, curve = TRUE)
roc$auc

[1] 0.7270473

# Calculating precision-recall curve for in-sample prediction.
pr <- pr.curve(score1_train, score0_train, curve = TRUE)
pr


  Precision-recall curve

    Area under curve (Integral):
     0.5114052 

    Area under curve (Davis & Goadrich):
     0.5114005 

    Curve for scores from  0.0003282111  to  0.9931485 
    ( can be plotted with plot(x) )

# Plotting the precision-recall curve for in-sample prediction.
plot(pr, main="In-sample PR curve")

# Out-of-sample prediction: 
# Splitting predicted scores for default=1 and default=0 for test dataset.
score1_test <- pred_test[credit_test$default==1]
score0_test <- pred_test[credit_test$default==0]

# Calculating ROC curve and AUC for out-of-sample prediction.
roc_test <- roc.curve(score1_test, score0_test, curve = TRUE)
roc_test$auc

[1] 0.7383772

# Calculating precision-recall curve for out-of-sample prediction.
pr_test <- pr.curve(score1_test, score0_test, curve = TRUE)
pr_test


  Precision-recall curve

    Area under curve (Integral):
     0.5127965 

    Area under curve (Davis & Goadrich):
     0.5127492 

    Curve for scores from  3.566311e-06  to  0.9943394 
    ( can be plotted with plot(x) )

# Plotting the precision-recall curve for out-of-sample prediction.
plot(pr_test, main="Out-of-sample PR curve")

Summary

Things to remember

Know how to use glm() to build logistic regression;
Know how to get ROC and AUC based on predicted probability;
Know how to get PR curve and AUC based on predicted probability;