Adaptive Learning Rate Scheduling for Large-Scale Neural Networks
Abstract
We present a novel adaptive learning rate scheduling algorithm that significantly improves convergence speed for large-scale neural networks. Our method dynamically adjusts the learning rate based on the local geometry of the loss landscape, achieving a 30% reduction in training time while maintaining comparable accuracy. The key innovation is the use of a Hessian-free approximation that scales to networks with billions of parameters.
Keywords
Details
Adaptive Learning Rate Scheduling for Large-Scale Neural Networks
Introduction
Training large-scale neural networks efficiently remains a fundamental challenge in deep learning. The choice of learning rate schedule can dramatically impact both convergence speed and final model performance. We propose a novel adaptive algorithm that adjusts the learning rate based on local loss landscape geometry.
Mathematical Framework
Our approach builds on the standard gradient descent update rule:
\[\theta_{t+1} = \theta_t - \eta_t \nabla L(\theta_t)\]where $\theta_t$ represents the parameters at iteration $t$, $\eta_t$ is the learning rate, and $L$ is the loss function.
Adaptive Learning Rate
We introduce an adaptive learning rate $\eta_t$ that depends on the local curvature:
\[\eta_t = \frac{\eta_0}{\sqrt{1 + \lambda \text{tr}(H_t)}}\]where $H_t$ is an approximation of the Hessian matrix at iteration $t$, and $\lambda$ is a scaling parameter.
Hessian Approximation
Computing the full Hessian is prohibitively expensive for large networks. We use a diagonal approximation:
\[H_{ii} \approx \frac{1}{m}\sum_{j=1}^m \left(\frac{\partial^2 L}{\partial \theta_i^2}\right)_j\]This can be efficiently computed using automatic differentiation with complexity $O(n)$ where $n$ is the number of parameters.
Theoretical Analysis
Convergence Guarantee
Theorem 1: Under standard smoothness assumptions, our algorithm converges to a stationary point with rate:
\[\min_{t \leq T} \|\nabla L(\theta_t)\|^2 \leq O\left(\frac{1}{\sqrt{T}}\right)\]Proof sketch: The proof follows from analyzing the descent lemma with our adaptive step size…
Stability Analysis
The adaptive mechanism ensures numerical stability by bounding the learning rate:
\[\eta_{\min} \leq \eta_t \leq \eta_{\max}\]where the bounds are determined by the maximum and minimum eigenvalues of the Hessian approximation.
Experimental Results
Performance on Standard Benchmarks
We evaluated our method on several standard datasets:
| Dataset | Model | Baseline Time | Our Method | Speedup |
|---|---|---|---|---|
| ImageNet | ResNet-50 | 100h | 70h | 1.43× |
| CIFAR-100 | WideResNet | 12h | 8.5h | 1.41× |
| WikiText-103 | Transformer | 200h | 145h | 1.38× |
Convergence Analysis
The following equation shows the convergence behavior:
\[L(t) = L_0 e^{-\alpha t} + L_{\infty}\]where $L_0$ is the initial loss, $L_{\infty}$ is the asymptotic loss, and $\alpha$ is the convergence rate.
Implementation Details
Algorithm Pseudocode
Initialize θ₀, η₀, λ
for t = 1 to T:
Compute gradient: g_t = ∇L(θ_t)
Estimate Hessian diagonal: H_t
Compute adaptive rate: η_t = η₀ / √(1 + λ·tr(H_t))
Update parameters: θ_{t+1} = θ_t - η_t·g_t
Computational Complexity
The per-iteration complexity is:
- Gradient computation: $O(n)$
- Hessian diagonal estimation: $O(n)$
- Total: $O(n)$ where $n$ is the number of parameters
Related Work
Previous adaptive methods like Adam and RMSprop use running averages of gradients:
\(m_t = \beta_1 m_{t-1} + (1-\beta_1)g_t\) \(v_t = \beta_2 v_{t-1} + (1-\beta_2)g_t^2\)
Our method differs by explicitly considering the loss landscape curvature through Hessian information.
Conclusion
We presented a scalable adaptive learning rate scheduling algorithm that significantly reduces training time for large-scale neural networks. The key contributions are:
- A computationally efficient Hessian approximation scheme
- Theoretical convergence guarantees
- Empirical validation on large-scale benchmarks
Future work will explore extending this approach to distributed training settings and investigating connections to natural gradient methods.
References
[1] Bottou, L., Curtis, F. E., & Nocedal, J. (2018). Optimization methods for large-scale machine learning. SIAM Review, 60(2), 223-311.
[2] Martens, J., & Grosse, R. (2015). Optimizing neural networks with Kronecker-factored approximate curvature. ICML.
[3] Kingma, D. P., & Ba, J. (2015). Adam: A method for stochastic optimization. ICLR.