Optimization

Optimization - Gradient descent

Grégoire Passault

Grégoire Passault
Gradient descent

Gradient descent

Grégoire Passault
Gradient descent

Gradient descent

Previously, we see how to solve linear systems of equations (exactly, or approximately).

Problem: How to find when is not linear in ?

Grégoire Passault
Gradient descent

Gradient

Suppose we have a function .

Let's take for example a function of two variables .

In that case, the linear approximation becomes:

Note: the derivatives above are taken at .

Grégoire Passault
Gradient descent

Gradient

In matrix form:

We call the gradient of .

Grégoire Passault
Gradient descent

Computing the gradient

For example, let's take .

⚙️ What is , the gradient of ?

Grégoire Passault
Gradient descent

Gradient descent

Suppose we want to minimize .
Idea: start at , and iteratively update by taking small steps.

The gradient provides a local, linear approximation of f:

It gives the direction of the steepest ascent (or descent) of .

Grégoire Passault
Gradient descent

Gradient descent

The algorithm is then:

  1. Select an initial guess .
  2. Compute the gradient .
  3. Update by taking a small step in the opposite direction of the gradient:
    .
  4. Repeat until convergence.

is called the learning rate.

Grégoire Passault
Gradient descent

Gradient descent

Grégoire Passault
Gradient descent

Gradient descent

We can now find the that minimizes , even when is not linear.

Problem remains:

  • What model to use ?
  • How to compute the gradient ?
  • How to choose/adjust the learning rate ?
  • How to tackle over/underfitting ?
Grégoire Passault
Perceptron

Perceptron

Grégoire Passault
Perceptron

Perceptron

A perceptron is a function modelling a neuron, with the following architecture:

Grégoire Passault
Perceptron

Perceptron

The output of the perceptron is:

Where:

  • are the inputs,
  • are the weights,
  • is the bias,
  • is the activation function.
Grégoire Passault
Perceptron

Activation functions

The goal of the activation function is to introduce non-linearity in the model.

Grégoire Passault
Perceptron

Activation functions

Grégoire Passault
Neural networks

Neural networks

Grégoire Passault
Neural networks

Neural networks

In a neural network, multiple perceptrons are combined in layers:

Here, each is a perceptron.

Grégoire Passault
Neural networks

Neural network

We call this a multi-layer perceptron (MLP).
The layers are fully connected.

Grégoire Passault
Neural networks

Weights

Grégoire Passault
Neural networks

Weights

Grégoire Passault
Neural networks

Weights

Grégoire Passault
Neural networks

Weights

Grégoire Passault
Learning

Learning

Grégoire Passault
Learning

Loss function

Categorical cross-entropy

Mean squared error

Grégoire Passault
Learning

Loss function

Grégoire Passault
Learning

Backpropagation

Suppose we have:

Where , and are the layers of the network.

How to compute ?

Grégoire Passault
Learning

Backpropagation

Grégoire Passault
Learning

Backpropagation

Grégoire Passault
Learning

Backpropagation

Grégoire Passault
Learning

Backpropagation

Grégoire Passault
Learning

Backpropagation

Grégoire Passault
Learning

Backpropagation

Grégoire Passault
Learning

Backpropagation

Grégoire Passault
Learning

Mini-batches

In practice, we don't compute the loss over the whole dataset at each iteration. Instead, we use mini-batches.

For example, we will sample indices, and compute the loss over these samples:

We call an epoch a full pass over the dataset.

Grégoire Passault
Learning

Training and validation

When learning, we want to avoid overfitting.

To achieve this, the typical strategy is to split the dataset into training and validation sets.

For example, 80% of the data is used for training, and 20% for validation.

Grégoire Passault
Learning

Training and validation

By monitoring the loss on the validation set, we can detect overfitting:

Grégoire Passault
Learning

Optimizer

The gradient descent as presented before is not used in practice. Many extra features are added to improve convergence.

The most common optimizer is the Adam optimizer.

It mostly adds momentum and adaptive learning rates.

Grégoire Passault
PyTorch

PyTorch

Grégoire Passault
PyTorch

PyTorch

PyTorch is a popular library for deep learning. It provides:

  • Tensors (like Numpy arrays) with GPU support,
  • Automatic differentiation (computing gradients),
  • Neural network layers,
  • Optimizers,
  • ...
Grégoire Passault
PyTorch

Learning a simple function

Let's first write a module for multi-layer perceptron (mlp.py):

import torch as th

# Defining a simple MLP
class MLP(th.nn.Module):
    def __init__(self, input_dimension: int, output_dimension: int):
        super().__init__()

        self.net = th.nn.Sequential(
            th.nn.Linear(input_dimension, 256),
            th.nn.ELU(),
            th.nn.Linear(256, 256),
            th.nn.ELU(),
            th.nn.Linear(256, 256),
            th.nn.ELU(),
            th.nn.Linear(256, output_dimension),
        )

    def forward(self, x):
        return self.net(x)
Grégoire Passault
PyTorch

Learning a simple function

We will then create some (nosiy) data:

import numpy as np

xs = np.linspace(0, 6, 1000)
ys = np.sin(xs)*2.5 + 1.5 + np.random.randn(1000)*0.1
Grégoire Passault
PyTorch

Learning a simple function

Those data needs to be transformed into PyTorch tensors:

import torch as th

xs = th.tensor(xs, dtype=th.float).unsqueeze(1)
ys = th.tensor(ys, dtype=th.float).unsqueeze(1)
Grégoire Passault
PyTorch

Learning a simple function

We can now create our MLP and the optimizer:

from mlp import MLP

# Creating the network and optimizer
net = MLP(1, 1)
optimizer = th.optim.Adam(net.parameters(), 1e-3)
Grégoire Passault
PyTorch

Learning a simple function

Finally, we can train our model:

# Training the model
for epoch in range(512):
    # Compute the loss
    loss = th.nn.functional.mse_loss(net(xs), ys)

    # Computing the gradient and updating weights
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    print(f"Epoch {epoch}, loss={loss.item()}")
Grégoire Passault
PyTorch

Learning a simple function

Putting it all together, and adding some plotting, we get the example code:

Grégoire Passault
PyTorch

Learning a simple function

Plotting the loss over time:

Grégoire Passault
PyTorch

Let's practice

PanTilt exercise

Grégoire Passault