Activation Functions¶
Activation functions are mathematical equations that determine the output of a neural network node (neuron). They introduce non‑linearity into the network, allowing it to learn complex patterns and representations from data. Without activation functions, the entire network would be equivalent to a single linear transformation, regardless of its depth.
Why Are Activation Functions Needed?¶
- Non‑linearity: Real‑world data is rarely linearly separable; non‑linear activations enable the network to approximate any continuous function (universal approximation theorem).
- Gradient flow: During backpropagation, the derivative of the activation function is used to update weights. The choice of activation affects how gradients propagate through the network.
- Output interpretation: Certain activations (e.g., sigmoid, softmax) are used in the final layer to produce probabilities or class scores.
Common Activation Functions¶
1. Sigmoid (Logistic)¶
- Range: (0, 1)
- Use: Historically for binary classification output; now less common in hidden layers.
- Pros: Smooth, differentiable, outputs bounded.
- Cons: Vanishing gradient for large |x|; outputs not zero‑centered (can slow convergence).
2. Tanh (Hyperbolic Tangent)¶
- Range: (-1, 1)
- Use: Hidden layers (especially in older RNNs).
- Pros: Zero‑centered, stronger gradients than sigmoid.
- Cons: Still suffers from vanishing gradient for extreme values.
3. ReLU (Rectified Linear Unit)¶
- Range: [0, ∞)
- Use: Default choice for many modern deep networks (CNNs, transformers).
- Pros: Computationally efficient; mitigates vanishing gradient; promotes sparsity.
- Cons: Dying ReLU – neurons can become inactive forever if they always output zero; not zero‑centered.
4. Leaky ReLU¶
- Range: (-∞, ∞)
- Pros: Fixes dying ReLU by allowing a small, non‑zero gradient when negative.
- Cons: Additional hyperparameter α; performance gains over ReLU are sometimes marginal.
5. Parametric ReLU (PReLU)¶
Similar to Leaky ReLU, but α is learned during training.
6. ELU (Exponential Linear Unit)¶
- Range: (-α, ∞)
- Pros: Smooth, saturates for negative inputs (robust to noise), pushes mean activation closer to zero.
- Cons: Exponential computation, slower.
7. Swish (SiLU – Sigmoid Linear Unit)¶
- Range: Bounded below, unbounded above.
- Pros: Smooth, non‑monotonic; often outperforms ReLU in deeper networks (e.g., in EfficientNet).
- Cons: Computationally more expensive.
8. Softmax¶
- Range: (0, 1) and sums to 1.
- Use: Almost exclusively in the output layer for multi‑class classification.
9. Linear (Identity)¶
- Use: Output layer for regression tasks (predicting continuous values).
Activation Functions in Different Parts of the Network¶
- Hidden layers: ReLU (or its variants) is the most common starting point. For very deep networks, Swish or GELU (used in Transformers) may be beneficial.
- Output layer:
- Regression → Linear.
- Binary classification → Sigmoid.
- Multi‑class classification → Softmax.
- Multi‑label classification → Sigmoid per class.
Vanishing Gradient Problem¶
Activations like sigmoid and tanh squash large inputs, making their derivatives very small. When these derivatives are multiplied across many layers (backpropagation), the gradient can vanish, preventing early layers from learning. ReLU and its variants largely avoid this because they have a constant derivative (1) for positive inputs.
Choosing an Activation Function¶
- Start with ReLU: It’s simple, fast, and works well for most feedforward and convolutional networks.
- If dying ReLU occurs: Try Leaky ReLU, ELU, or Swish.
- For RNNs/LSTMs: Tanh is still common, but ReLU variants with careful initialization can also work.
- For transformers (NLP): GELU (Gaussian Error Linear Unit) or Swish are popular.
- Output layers: Follow the task‑specific guidelines above.
Summary¶
Activation functions are essential for neural networks to model complex, non‑linear relationships. The choice of activation significantly impacts training dynamics, convergence speed, and final performance. Modern practice favours ReLU for hidden layers due to its simplicity and effectiveness, while task‑specific activations are used at the output.