This post is going to cover AI classification and give an introduction to decision trees (which will be covered in a future post). The idea is to give a good overall idea of what classification is in AI.

The topics we are going to look at in this post are:

  • Classification Algorithms
  • Data Preparation
  • Scikit-Learn
  • Model Evaluation
    • Overfitting & Underfitting
    • Hyperparameters and Cross-Validation
    • Metrics
  • Decision Tree Introduction

Classification is a supervised learning technique in that it is trained on an existing dataset (with a known output) in order to predict a new output on future data (where the output is not known). This type of problem/solution is one of the most common techniques used in machine learning.

The aim of classification is to be able to predict a “target feature” from a dataset using the other features in the dataset. An example would be, a bank trying to predict if a customer will leave based on their income, where they live, how old they are etc. This problem (either being true or false or 0 or 1) is what is called binary classification. This differs from regression problems which aim to predict numerical data i.e. how much someone earns in a year. If our target attribute has a number of possibilities then this would be called a multi-class problem i.e. MNIST.

In some cases we may be trying to predict a number of features at the same time in which case we would call that a multi-label or multi-output problem.


Classification Algorithms

  • Binary Algorithms
    • Support Vector Machine
    • Logistic Regression
    • Stochastic Gradient Descent
  • Multiclass Algorithms
    • Naive Bayes
    • Decision Trees

Binary algorithms can be used for multi-class problems with M labels.

  • One v One strategy: trains one binary predictor to train every couple of labels. M(M-1)/2 train on N/M\cdot2 instances (for a balanced problem with N instances).
    • The final prediction is the class that wins the most “duels”.
  • One v All Strategy: trains one binary predictor to distinguish a class from a non-class i.e. examples of 0 and examples of non-0.
    • M predictors each trained on a full dataset.
    • Final prediction is the class with the highest score (requires the model to output a probability / score).

Some algorithms provide an indication of the “confidence” of the prediction (i.e. providing a probability for each class rather than simply the class value).


Data Preparation

Splitting the data into training and testing sets – A dataset would be split down into two parts, a training set and a testing set. The testing set represents some “unseen” data that will be used to evaluate our training set (how good it is at predicting new unseen data).

Data cleaning (imputation) – Deal with potential entries containing missing data (i.e. null values in the data) as only some algorithms work with missing data whereas others would require us to drop or fill in missing values.

Data preparation – Categorical data needs to be encoded using ordinal or one-hot encoding (see example below). Some algorithms work directly with categorical data i.e. decision trees (although Scikit-Learn doesn’t support this). Scaling (standardising) numerical features can also be helpful for numerical methods (e.g. PCA, SVM etc) but is not necessary for Decision Trees (which will be covered in a future post). For some classification problems, the data may be unbalanced and we may need to consider undersampling or oversampling i.e. using the imbalanced-learn Python package.

// One-hot encoding example for “cat”, “dog” & “rabbit”

// 1,0,0 (cat)

// 0,1,0 (dog)

// 0,0,1 (rabbit)

Pipelines – Datapreparation tasks are often completed in pipelines making it easier to repeat and tune certain processes. A pipeline is usually composed using a sequence of transformers that transform the input for the next transformer or the final predictor.


Scikit-Learn

Estimators – The fit() method learns parameters from data i.e. fit(x) for unsupervised learning and fit(x,y) for supervised learning. Hyperparameters that are normally set in constructors, can be accessed with set_params(), get_params() and by public instance variables.

// Hyperparameters are external to the model whose value can’t be obtained from the data itself and refer to the configuration of the model (we will look at this again in this post in the Model Evaluation section).

Transformers – Estimators that can transform data using the transform(x) method e.g. pca.fit(x) identifies the PCA’s projection hyperplane and X_transf = transform(x) will apply the PCA projection to x. fit_transform(x) would apply both steps and can be more efficient.

// PCA projection stands for Principle Component Analysis and is used to reduce the dimensionality of data.

Predictors – Predictors are estimators that provide a predict(x) method that provides predictions for the class from observations.

Pipelines – Pipelines allow us to concatenate a set of transformers and final predictors. pipeline.fit(x) invokes fit_transform() on each intermediate step and fit on the last estimator. pipeline.predict(x) invokes transform() on each step and predict() on the last predictor. This allows for optimisation of hyperparameters of different estimators accross the entire pipeline.

from sklearn.datasets import fetch_openml
cs = fetch_openml("credit", version=1)

// At this point, credit_scores will contain two objects. A “data” object which contains all of the features that we are going to use to make our predictions and a “target” object that will contain the feature that we are trying to predict.

from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression()
classifier.fit(cs.data, cs.target)

classifier.predict([[.9, 40, 0, .9, 5000, 5, 0, 3, 0, 4]])
> array([1.])

classifier.predict_proba([[.9, 40, 0, .9, 5000, 5, 0, 3, 0, 4]])
> array([0.8465729, 0.51459824])

Model Evaluation

Model evaluation is the part where we evaluate the resulting model. We can try a model on the training examples (as we already know the labels) to see how accurate it is.

from sklearn.tree import DecisionTreeClassifier
tree = DecisionTreeClassifier()

tree.fit(cs.data, cs.target)
tree.score(cs.data, cs.target)
> 1.0

The result here it what we call the training accuracy and is not a good representation of the accuracy of our model because we are using the same data to test the model as the data that we are training the model with. To fix this we will split our data into a training set and testing set. That way, once we have trained our model, we can use the “unseen” test data to evaluate it.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(cs.data, cs.target)

tree.fit(X_train, y_train)
print(tree.score(X_train, y_train))
print(tree.score(X_test, y_test))
> 1.0
> 0.67643492843

// This shows that the model is overfitting the data as there is a big difference between the training and testing accuracy.

Overfitting & Underfitting

Models which perform worse on unseen data are said to be overfitting the data i.e. they are memorising the training set and not too good at generalising new inputs.

To prevent overfitting we have two options:

  • Select a simpler model or constrain the model (called regularisation).
  • Get more data.

The opposite (underfitting) occurs when the model is not powerful enough to properly learn the training data.

Options to prevent underfitting are:

  • Select a more complex model or reduce regularisation.
  • Add/improve features to make it easier for the model to fit to the data.

A model generalisation error can be split into three components:

  • bias: due to wrong assumptions about the model (i.e. trying to fit a quadratic model with a linear model).
  • variance: Excessive sensivity to small variations in training data. Typical of complex models such as high degree polynomials or neural networks with many hidden layers.
  • irreducible error: Due to noise in the data.

Increased model complexity will reduce bias and increase variance hence why we have a bias-variance tradeoff.

Hyperparameters & Cross-Validation

An example of a hyperparameter is stating a max_depth or max_number_leaves to a Decision Tree Classifier.

tree.max_depth = 3
tree.fit(X_train, y_train)

print(tree.score(X_train, y_train))
print(tree.score(X_test, y_test))
>0.74494725464782
>0.74957361893742

// This is an exammple of setting a hyperparameter (notice the training and testing scores are closer together).

It is important to keep in mind that test data should not be used to train hyperparameters but instead should be evaluated using a portion of the training data called a validation set. To avoid using a big portion of the training set for validation, we can keep a small set for validation (i.e. 10%) and perform the training 10 times and each time with a different 10% for validation.

Credit: Matteo Migliavacca (University of Kent)

Metrics

Accuracy: This is calculated by finding correct number of predictions/number of predictions

For binary classifier accuracy we can calculate accuracy using the following \frac{TP+TN}{TP+TN+FP+FN}

// Where TP is true positives, TN is true negatives, FP is false positives and FN is false negatives.

This isn’t always the best metric we can use to evaluate our model because some classes may be more important than others i.e. in the case of self-driving cars where recognising a pedestrian is more important than a street sign (as an example). Some type of errors may be more costly than others i.e. in medical tests – false positive v false negatives. Some data sets may be unbalanced i.e. a classifier that predicts negative 99% of the time will be correct on a dataset that has 99% negative labels.

Alternative metrics include:

  • Precision: How often a classifier is correct when making a prediction. In binary classification we usually concentrate on positive classes so \frac{TP}{TP+FP}.
  • Recall: How often instances belonging to a class are picked up by the classifier.
    • For positive class (Sensitivity): \frac{TP}{P} = \frac{TP}{TP+FN}.
    • For negative class (Specifity): \frac{TN}{N} = \frac{TN}{TN+FP}.
  • F1 Score: Harmonic mean of precision and recall \frac{TP}{TP+(FP+FN)/2}.

In an unbalanced dataset we look at balanced accuracy which is the average recall accross classes (unweighted) but for balanced classes we look at the normal accuracy score. For binary problems we look at the average of sensitivity and specifity.


Decision Trees

The decision tree is the “swiss army knife” of classification. It works by recursively splitting the data on an attribute-condition-value so that the resulting sub-spaces are more pure (homogeneous) than the initial dataset. Each split represents an internal node of the decision tree. The prediction is obtained from the most commonly found classes in a leaf.

Points to note with decision trees:

  • Fast to compute as they are greedy and decisions are made at each split.
  • Easy to interpret.
  • Non-parametric: A tree can adapt to the complexity of a decision boundary but can overfit.
    • You can specify max depth/leaves or min instances/gain to split.
  • Orthogonal decision boundaries.
  • Constant piece-wise prediction (poor extrapolators).
  • Can support datasets with missing values and categorical data (although Scikit-Learn doesn’t support this).

That brings us to the end of this post on AI classification using Scikit-Learn.

Leave a Comment

Your email address will not be published. Required fields are marked *