Cost functions in machine learning

Cost functions are a critical component of machine learning models. They allow models to measure how well they are estimating the true values and enable optimization of the model parameters. This article will explain what cost functions are, outline the most common types, and describe their role in optimizing machine learning models.

A cost function, also called a loss function or objective function, is used in machine learning to quantify the difference between a model's predictions and the true target values it is trying to predict. It measures how well the model is performing and allows it to be improved through optimization. Choosing the right cost function for a given machine learning task is crucial - it defines the very landscape that will be traversed during training to fit the model parameters.

What Are Cost Functions?

Formally, a cost function maps a set of predictions and a set of true target values to a cost value that quantifies the model's error. The goal is to minimize this cost, which can be done by changing the model parameters, like weights and biases.

Cost functions take in the predicted values from the machine learning model and compare them to the known true target values. They then output a cost value that summarizes the model's prediction error. The lower the cost, the better the model's estimates are matching the ground truth.

Since the cost function outputs a quantitative value, it can be used with optimization algorithms like gradient descent to iteratively improve the model by modifying its parameters to reduce cost. This process is what "trains" the machine learning model to make better predictions.

Different Types of Cost Functions

There are various cost functions that are commonly used in different machine learning models and tasks:

Mean Squared Error (MSE) - The most widely used cost function, MSE computes the average squared difference between the predicted values and true values. It is easy to implement and optimize. Since errors are squared, it amplifies the significance of outliers. MSE is best suited for linear regression problems.

$$\text{MSE} = \frac{1}{N} \sum_{i=1}^{N} (y_i - \hat{y}_i)^2$$

Mean Absolute Error (MAE) - MAE calculates the average absolute difference between predictions and actual values. Because it doesn't square the errors, it is less sensitive to outliers than MSE. MAE provides a more natural measure of average error.

$$\text{MAE} = \frac{1}{N} \sum_{i=1}^{N} \left| y_i - \hat{y}_i \right|$$

Hinge Loss - Used for classification, hinge loss looks at the difference between the correct class score and the incorrect class scores. It aims to maximize the margin between the classes, making it useful for classifiers like support vector machines. The Hinge Loss is primarily used in Support Vector Machines (SVMs), a type of supervised learning algorithm for classification or regression problems. The objective of SVMs is to find the hyperplane that maximizes the margin between different classes. The margin is defined as the distance between the closest points (or "support vectors") from different classes.

$$\text{Hinge Loss} = \max(0, 1 - y \cdot \hat{y})$$

Cross-Entropy Loss - Measures the divergence between two probability distributions - the predictions and the true distribution. Often used for logistic regression and neural networks for classification tasks because it emphasizes high prediction accuracy for the correct class.

$$\text{Cross-Entropy Loss} = - \left( y \log(\hat{y}) + (1 - y) \log(1 - \hat{y}) \right)$$

For multiclass classification

$$\text{Cross-Entropy Loss} = - \sum_{c=1}^{M} y_{c} \log(\hat{y}_{c})$$

Custom Loss Functions - For unique problems, standard loss functions may not fit. Custom functions can be created, allowing for complete flexibility based on the model and data.

For example, weighted cross entropy can be defined as

$$\text{Weighted Cross-Entropy Loss} = - \left( w_1 \cdot y \log(\hat{y}) + w_0 \cdot (1 - y) \log(1 - \hat{y}) \right)$$

Code example

import tensorflow as tf

# Dummy true labels and predictions
y_true = tf.constant([0, 1, 0, 1], dtype=tf.float32)
y_pred = tf.constant([0.1, 0.8, 0.2, 0.7], dtype=tf.float32)

# Custom weights
w1 = 2.0  # Weight for positive class
w0 = 1.0  # Weight for negative class

# Weighted Cross-Entropy Loss
loss = -tf.reduce_mean(
    w1 * y_true * tf.math.log(y_pred) +
    w0 * (1 - y_true) * tf.math.log(1 - y_pred)
)

print("Weighted Cross-Entropy Loss:", loss.numpy())

How Cost Functions Drive Model Optimization

Cost functions have desirable mathematical properties like being convex and continuous. This allows them to be differentiated and for the gradient (slope) to be calculated. Using gradient descent, the gradient of the cost function can be computed with respect to the model parameters.

The gradient points in the direction of steepest increase in the cost function. Therefore, taking small steps in the negative gradient direction allows iteratively moving the parameters to reduce cost. With enough iterations, a local or global minimum can be found.

This optimization process essentially fits the model parameters like weights and biases to map inputs to outputs in a way that minimizes the difference from ground truth as measured by the cost function. Using the right cost guides the model to make better predictions.

Different cost functions have different optimization landscapes. Functions may have a single global minimum or many local minima. Some are more prone to getting stuck in suboptimal local minima. Choosing an appropriate cost function that matches the problem is key.

Cost functions are a foundational component of nearly all machine learning models. By quantifying prediction error, they allow iterative improvement of model parameters to minimize the difference from ground truth. Choosing the right cost function for the task makes training more efficient and improves model performance. An understanding of cost functions unlocks the core of how machine learning models actually work.