Problems with the Hessian Matrix

6.9. Problems with the Hessian Matrix#

In Newton’s method, the Hessian matrix can pose significant numerical challenges. Recall the Newton method step

(6.87)#\[\begin{equation} \mathbf{x}_{n+1} = \mathbf{x}_n - H^{-1} \nabla f, \end{equation} \]

where the step is \(\Delta \mathbf{x} = -H^{-1} \nabla f\). The Hessian may be singular or nearly singular (ill-conditioned), making its inversion numerically unstable or impossible. The condition number of a matrix \(H\) is defined as

(6.88)#\[\begin{equation} \kappa(H) = \frac{\lambda_{\max}}{\lambda_{\min}}, \end{equation}\]

where \(\lambda_{\max}\) and \(\lambda_{\min}\) are the largest and smallest eigenvalues of \(H\). Equivalently, the condition number can be defined as the ratio of the largest to smallest singular values. The condition number measures how sensitive the solution to \(H \Delta\mathbf{x} = -\nabla f\) is to perturbations. A large condition number indicates that the Hessian is ill-conditioned, which can lead to numerical instability when computing the Newton step.

When the Hessian is ill-conditioned, the Newton step \(\Delta \mathbf{x} = -H^{-1}\nabla f\) may be unreliable. Small errors in computing \(\nabla f\) or \(H\) can lead to large errors in \(\Delta \mathbf{x}\), and the method may take very large steps that overshoot the minimum or even diverge (trust me, I have seen Newton’s method venture off to infinity often). Additionally, away from the optimum, the Hessian may not be positive definite, which means the Newton direction may not even be a descent direction.

To address these issues, we can introduce a damping factor (also called a step size or learning rate) into Newton’s method. The damped Newton update is

(6.89)#\[\begin{equation} \mathbf{x}_{new} = \mathbf{x}_{old} - \alpha H^{-1} \nabla f(\mathbf{x}_{old}), \end{equation}\]

where \(\alpha \in (0, 1]\) is the damping factor. When \(\alpha = 1\), we recover the standard Newton method. As \(\alpha \to 0\), the method approaches gradient descent with a variable step size. The damping factor serves several purposes. First, it prevents the method from taking excessively large steps when far from the optimum. Second, it can help stabilize the method when the Hessian is ill-conditioned. Third, it can be adjusted adaptively using line search methods to ensure sufficient decrease in the objective function at each iteration.

One common strategy for choosing \(\alpha\) is backtracking line search, where we start with \(\alpha = 1\) and repeatedly reduce it (e.g., \(\alpha \leftarrow \beta \alpha\) for \(\beta \in (0,1)\)) until the Armijo condition is satisfied (this is covered in a later lecture). Backtracking line search ensures that each step produces a sufficient decrease in the objective function while allowing the method to take full Newton steps when appropriate.

Another pitfall is when the optimization function is not convex. In this case the Hessian is not positive definite. Newton’s method may ocnverge to sadle point or local minima.