Adaptive Learning Rate Scheduling for Large-Scale Neural Networks

Authors: Dr. Academic Name, Jane Smith, Robert Johnson
Published in: International Conference on Machine Learning (ICML) (2024)
Date:

Abstract

We present a novel adaptive learning rate scheduling algorithm that significantly improves convergence speed for large-scale neural networks. Our method dynamically adjusts the learning rate based on the local geometry of the loss landscape, achieving a 30% reduction in training time while maintaining comparable accuracy. The key innovation is the use of a Hessian-free approximation that scales to networks with billions of parameters.

Keywords

optimization, deep learning, adaptive methods, large-scale training

Details

Adaptive Learning Rate Scheduling for Large-Scale Neural Networks

Introduction

Training large-scale neural networks efficiently remains a fundamental challenge in deep learning. The choice of learning rate schedule can dramatically impact both convergence speed and final model performance. We propose a novel adaptive algorithm that adjusts the learning rate based on local loss landscape geometry.

Mathematical Framework

Our approach builds on the standard gradient descent update rule:

\[\theta_{t+1} = \theta_t - \eta_t \nabla L(\theta_t)\]

where $\theta_t$ represents the parameters at iteration $t$, $\eta_t$ is the learning rate, and $L$ is the loss function.

Adaptive Learning Rate

We introduce an adaptive learning rate $\eta_t$ that depends on the local curvature:

\[\eta_t = \frac{\eta_0}{\sqrt{1 + \lambda \text{tr}(H_t)}}\]

where $H_t$ is an approximation of the Hessian matrix at iteration $t$, and $\lambda$ is a scaling parameter.

Hessian Approximation

Computing the full Hessian is prohibitively expensive for large networks. We use a diagonal approximation:

\[H_{ii} \approx \frac{1}{m}\sum_{j=1}^m \left(\frac{\partial^2 L}{\partial \theta_i^2}\right)_j\]

This can be efficiently computed using automatic differentiation with complexity $O(n)$ where $n$ is the number of parameters.

Theoretical Analysis

Convergence Guarantee

Theorem 1: Under standard smoothness assumptions, our algorithm converges to a stationary point with rate:

\[\min_{t \leq T} \|\nabla L(\theta_t)\|^2 \leq O\left(\frac{1}{\sqrt{T}}\right)\]

Proof sketch: The proof follows from analyzing the descent lemma with our adaptive step size…

Stability Analysis

The adaptive mechanism ensures numerical stability by bounding the learning rate:

\[\eta_{\min} \leq \eta_t \leq \eta_{\max}\]

where the bounds are determined by the maximum and minimum eigenvalues of the Hessian approximation.

Experimental Results

Performance on Standard Benchmarks

We evaluated our method on several standard datasets:

Dataset Model Baseline Time Our Method Speedup
ImageNet ResNet-50 100h 70h 1.43×
CIFAR-100 WideResNet 12h 8.5h 1.41×
WikiText-103 Transformer 200h 145h 1.38×

Convergence Analysis

The following equation shows the convergence behavior:

\[L(t) = L_0 e^{-\alpha t} + L_{\infty}\]

where $L_0$ is the initial loss, $L_{\infty}$ is the asymptotic loss, and $\alpha$ is the convergence rate.

Implementation Details

Algorithm Pseudocode

Initialize θ₀, η₀, λ
for t = 1 to T:
    Compute gradient: g_t = ∇L(θ_t)
    Estimate Hessian diagonal: H_t
    Compute adaptive rate: η_t = η₀ / √(1 + λ·tr(H_t))
    Update parameters: θ_{t+1} = θ_t - η_t·g_t

Computational Complexity

The per-iteration complexity is:

  • Gradient computation: $O(n)$
  • Hessian diagonal estimation: $O(n)$
  • Total: $O(n)$ where $n$ is the number of parameters

Previous adaptive methods like Adam and RMSprop use running averages of gradients:

\(m_t = \beta_1 m_{t-1} + (1-\beta_1)g_t\) \(v_t = \beta_2 v_{t-1} + (1-\beta_2)g_t^2\)

Our method differs by explicitly considering the loss landscape curvature through Hessian information.

Conclusion

We presented a scalable adaptive learning rate scheduling algorithm that significantly reduces training time for large-scale neural networks. The key contributions are:

  1. A computationally efficient Hessian approximation scheme
  2. Theoretical convergence guarantees
  3. Empirical validation on large-scale benchmarks

Future work will explore extending this approach to distributed training settings and investigating connections to natural gradient methods.

References

[1] Bottou, L., Curtis, F. E., & Nocedal, J. (2018). Optimization methods for large-scale machine learning. SIAM Review, 60(2), 223-311.

[2] Martens, J., & Grosse, R. (2015). Optimizing neural networks with Kronecker-factored approximate curvature. ICML.

[3] Kingma, D. P., & Ba, J. (2015). Adam: A method for stochastic optimization. ICLR.