Gradient Descent
Brief introduction to gradient descent algorithm
Both polynomial and exponential curve fitting admit best-fit least squares solutions [2]. However, these are specialized cases. Thus a more general framework is needed. In this post, we discuss the gradient descent algorithm also known as steepest descent.
Consider the following function [1]
$$f(\theta_1, \theta_2) = \frac{1}{2}(\theta_{1}^2 - \theta_2)^2 + \frac{1}{2}(\theta_1 -1)^2$$
The gradient of the function $\nabla f$ is given by
$$\nabla f = (2(\theta_{1}^2 - \theta_2)\theta_1 +(\theta_1 - 1), -(\theta_{1}^2 - \theta_2))$$
The gradient gives the direction of steepest descent towards the minimum point of $f$ [2]. The minimum point is located in the direction $-\nabla f$.
We are interested in finding $\theta_1, \theta_2$ that minimize $f$. Gradient descent is an iterative algorithm that uses the gradient of the function in order to update the parameters. The update rule is [1, 2]
$$\boldsymbol{\theta}_k = \boldsymbol{\theta}_{k-1} - \eta \nabla f|_{\boldsymbol{\theta}_{k-1}} $$
$\eta$ is the so called learning rate and tunes how fast we move to the direction of the gradient. A small $\eta$ slows down convergence whilst a large value may not allow convergence of the algorithm. This is shown in the two figures below:
Figure 1. Gradient descent with eta 0.1.
Figure 2. Gradient descent with eta 0.6.
The code below is a simple implementation of the gradient descent algorithm.
import numpy as np
import matplotlib.pyplot as plt
def f(theta1, theta2):
return 0.5*(theta1**2 - theta2)**2 + 0.5*(theta1 -1.0)**2
def f_grad(theta1, theta2):
return (2.0*theta1*(theta1**2 - theta2) + (theta1 - 1.0), -(theta1**2 - theta2))
def gd(eta, itrs, tol):
coeffs_series = []
coeffs = [0.0, 0.0]
coeffs_series.append([coeffs[0], coeffs[1]])
val_old = f(theta1=coeffs[0], theta2=coeffs[1])
for itr in range(itrs):
grad = f_grad(theta1=coeffs[0], theta2=coeffs[1])
coeffs[0] -= eta*grad[0]
coeffs[1] -= eta*grad[1]
coeffs_series.append([coeffs[0], coeffs[1]])
val = f(theta1=coeffs[0], theta2=coeffs[1])
abs_error = np.abs(val - val_old)
if itr % 10 == 0:
print(">Iteration {0} absolute error {1} exit tolerance {2}".format(itr, abs_error, tol))
if abs_error < tol:
print(">GD converged with residual {0}".format(np.abs(val - val_old)))
return coeffs_series
val_old = val
return coeffs_series
coeffs_series = gd(eta=0.1, itrs=100, tol=1.0e-4)
coeffs_x = []
coeffs_y = []
for item in coeffs_series:
coeffs_x.append(item[0])
coeffs_y.append(item[1])
theta1 = np.linspace(0.0, 2.0, 100)
theta2 = np.linspace(-0.5, 3.0, 100)
X, Y = np.meshgrid(theta1, theta2)
Z = f(X, Y)
plt.contour(X, Y, Z, 60, colors='black');
plt.plot(coeffs_x, coeffs_y, 'r-o')
plt.show()
- Kevin P. Murphy, Machine learning a probabilistic perspective, The MIT Press
- Steven L. Brunton and J. Nathan Kutz, Data-driven science and engineering. Machine learning, dynamical systems and control, Cambridge University Press.