Model Selection By Performance Comparison

7 min readNov 30, 2018

For those that are just starting their journey into machine learning, it can seem like a daunting task to learn about all the models and the differences between them. After we learn how a model works and how to utilize it, we will need to step into the real world and implement what we have learned to real world problems. With real world problems we are not given a specific model to tune. We have to evaluate a bunch of models and then decide which model we want to use.

We have a problem:

When we start to think about solving a problem in Machine Learning, we have to take many points into consideration:

Problem Definition
Availability of Dataset
Expected solutions
Business constraints

Given all of the above information, our job as an AI developer is to design and train a model on the given dataset to get the expected results. Building models can be trivial once we know which model we want to build. There are a ton of resources out there to help us out. One of the best libraries for python to build simple Machine Learning models is scikit-learn. With scikit-learn(scikit-learn.org) the amount of code we have to write is reduced significantly. We can find implementations of almost all of the popular models in scikit-learn, so our job is simplified and reduced now to only picking the best model for our task and tuning a few parameters. This can be done with only a few lines of code.

Our goal is to measure the performance of different classification models and try to pick the best one for our task. For this purpose we will see how well our models perform on the same dataset.

The dataset:

The dataset we chose is the Car Evaluation Data Set from Kaggle. The data consists of a number of features like

relative buying cost
maintenance cost
number of doors
boot space, etc.

Our task is to predict whether a given car is in acceptable condition or not.

There are 4 categories that a car may fall into: unacceptable, acceptable, good and very good. So our problem becomes a classification problem with 4 classes. Now we will try and pick the best model that can tell us whether a car is in good condition or not.

How do we pick the correct model?

There are many ways in which we can measure how our model is performing. The core idea of performance measurement is how accurately our model can predict given a datapoint that it has not seen. There are other considerations like latency, memory-usage, etc. but we will only deal with the prediction accuracy of a model in this article.

The metric we choose to evaluate in this article are:

Accuracy
Precision
Recall
F1 Score
ROC (AUC)
Log Loss

All of these metric will give us a different sense of performance. Though a good model performs well on most of these metrics, we should choose a metric based on our problem statement.

Confusion Matrix : To understand most of these metrics we first need to understand what a Confusion Matrix is. Let’s assume we have a 2-class classification problem. We call one class ‘Yes’ and the other class ‘No’. Now we have a set of points given with actual classification as our test data. We also have a set of predicted values for the same test data-points.

[source: https://www.dataschool.io/simple-guide-to-confusion-matrix-terminology/]

This image describes a confusion matrix for a dataset with 165 points.

TN — True Negatives (The number of Yes points in the actual dataset)
TP — True Positives (The number of No points in the actual dataset)
FN — False Negatives (Number of points that were incorrectly predicted to be No)
FP — False Positives (Number of points that were incorrectly predicted to be Yes)

Let’s take a look at what these metrics mean:

If n = the number of total points in our test dataset

1.Accuracy: Accuracy measures the number of correct predictions on the total number of predictions i.e. the fraction of times the model predicted correctly. That is for one datapoint. When we have multiple data-points that we want to test, the accuracy is given by the average accuracy of the model on the whole test dataset. [source: www.wikipedia.com/]

(Accuracy Formula)

Accuracy can also be defined in terms of the Confusion Matrix as
Acc = (TP + TN) / n

2.Precision: The number of times we got a correct Yes prediction.
Precision = TP / predicted Yes

3.Recall: Number of times it correctly predicts Yes
Recall = TP/ actual Yes

4.F1 Score: The F1 score takes into account both Precision and Recall

5.ROC (AUC): Plot of FPR (X-axis) to TPR (Y-axis). Where:
FPR = FP rate = FP / actual No
TPR = TP rate = TP / actual Yes

6.Log Loss: It is also a measure of difference of predicted value from the actual values. For more info go to: http://wiki.fast.ai/index.php/Log_Loss

Testing our Models:

Now that we have defined our metrics let us try and see what results we get on different models.

We will use the following models in our analysis:

Logistic Regression
Decision Trees
Random Forest
K Nearest Neighbors
Naive Bayes (Multinomial for multiple classes)

Let’s try them one by one.

Logistic Regression:

Decision Tree:

Random Forest:

K Nearest Neighbors:

Naive Bayes:

Comparing all of our metrics we get graphs like these:

Observations:

From the data we have collected from running different models on our dataset we can see that Accuracy, Precision, Recall and F1 Scores are all proportional to each other. Which means that for our case we can use any of them to evaluate our models.

The highest performing models in these metrics is the Simple Decision Tree with Random Forest and Logistic Regression competing for the second place.

The Confusion Matrices give us a clearer picture of the performance. We have to look at high numbers in the diagonals. Higher numbers other than the diagonals means there were errors in classification. We can see here that Decision Trees make the least misclassifications.

The situation is very different if we look at the log loss comparisons. Here Decision Trees perform much worse than all of our models while Logistic regression gives us the best results.

These discrepancies go against our intuition as we expect more accurate models to have minimum losses. This requires further investigation as to why we get such a high loss with Decision Trees.

But for our case Decision Tree Classifier seems to be performing the best. If we wanted to play safe (which is always a good strategy in real life) we would choose either Random Forest or Logistic Regression.

Caution:

While looking at these results we need to keep a few things in mind:

Our dataset only contains about 1700 points which is not nearly enough to train our models satisfactorily. In real life we will have much larger datasets and hence will have a varying degree of performance across several models.
We have only tuned our models slightly. We will need to evaluate our models with more rigorous tuning to decide which one best suits our data. We can use scikit-learn’s Model Selection libraries (https://scikit-learn.org/stable/modules/classes.html#module-sklearn.model_selection) to find the best parameters to tune our models.

Conclusion:

We can use different metrics to compare different models and select the best one that suits our needs. These metrics are important but there are other considerations. There are often latency requirements, minimum accuracy requirements, more emphasis on reducing False Negative results, etc.

We hope to have helped you to select the best model based on performance measurement.

Model Selection By Performance Comparison

Testing our Models:

Written by Ruchir Kakkad