Loss Functions¶

Loss functions (or cost functions) are mathematical functions that quantify the difference between a model’s predictions and the actual target values. In machine learning and optimization, the goal is to minimize this loss during training, thereby improving the model’s accuracy.

Key Properties¶

Non‑negativity: Loss is typically ≥ 0, with zero indicating a perfect fit.
Differentiability: Most optimization algorithms (e.g., gradient descent) require the loss function to be differentiable with respect to the model parameters.
Task‑specific design: Different tasks (regression, classification, ranking, etc.) call for different loss functions.

Common Loss Functions¶

1. Regression Losses¶

Used when the target variable is continuous.

Mean Squared Error (MSE)

\[ \text{MSE} = \frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2 \]

Characteristics: Differentiable everywhere; penalizes large errors heavily due to squaring. Sensitive to outliers.

Mean Absolute Error (MAE)

\[ \text{MAE} = \frac{1}{n}\sum_{i=1}^{n}|y_i - \hat{y}_i| \]

Characteristics: Less sensitive to outliers than MSE, but not differentiable at zero.

Huber Loss
A combination of MSE and MAE: quadratic for small errors, linear for large errors. The threshold (δ) controls the transition.

\[ L_{\delta}(y, \hat{y}) = \begin{cases} \frac{1}{2}(y - \hat{y})^2 & \text{if } |y - \hat{y}| \le \delta \\ \delta(|y - \hat{y}| - \frac{1}{2}\delta) & \text{otherwise} \end{cases} \]

Characteristics: Differentiable everywhere; robust to outliers.

2. Classification Losses¶

Used when the target is a discrete class label.

Binary Cross‑Entropy (Log Loss)
For binary classification with probabilities:

\[ \text{BCE} = -\frac{1}{n}\sum_{i=1}^{n}\big[y_i \log(\hat{p}_i) + (1 - y_i)\log(1 - \hat{p}_i)\big] \]

Characteristics: Works with probabilistic outputs (e.g., sigmoid); heavily penalizes confident wrong predictions.

Categorical Cross‑Entropy
Extension for multi‑class classification (one‑hot encoded labels):

\[ \text{CCE} = -\sum_{i=1}^{n}\sum_{c=1}^{C} y_{i,c} \log(\hat{p}_{i,c}) \]

Characteristics: Used with softmax output; encourages predicted probability distribution to match the true distribution.

Hinge Loss
Commonly used with Support Vector Machines (SVMs) for “maximum‑margin” classification:

\[ \text{Hinge}(y, \hat{y}) = \max(0, 1 - y \cdot \hat{y}) \]

Characteristics: For targets in {−1, +1}; penalizes predictions that are correct but not confident enough.

Kullback–Leibler (KL) Divergence
Measures how one probability distribution diverges from another. Often used in variational autoencoders or when the target is a distribution.

3. Ranking / Object Detection Losses¶

Contrastive Loss – Used in siamese networks to bring similar samples closer and push dissimilar ones apart.
Triplet Loss – Common in face recognition and embedding learning; ensures an anchor is closer to positive samples than to negatives.
Focal Loss – Modified cross‑entropy for object detection (e.g., RetinaNet) that down‑weights easy examples and focuses on hard ones.

4. Custom Losses¶

In many applications, problem‑specific losses are designed, e.g., for imbalanced datasets, for structured prediction (e.g., intersection‑over‑union in segmentation), or to incorporate domain constraints.

Choosing a Loss Function¶

Regression: MSE if outliers are not a concern; MAE or Huber if robustness is needed.
Binary classification: Binary cross‑entropy for probabilistic output; hinge for maximum‑margin.
Multi‑class classification: Categorical cross‑entropy.
Imbalanced data: Weighted cross‑entropy, focal loss, or custom losses that penalize false negatives/positives differently.

Summary¶

The loss function defines the optimization objective and thus directly influences the learned model. Selecting an appropriate loss is as critical as choosing the model architecture. Many frameworks (TensorFlow, PyTorch, etc.) provide a rich set of built‑in loss functions, and they also allow users to define custom ones.