Open Notebook: higher batch size in ML model training

Wednesday, December 17, 2025

higher batch size in ML model training

It’s a common observation in deep learning, though it often sparks a bit of debate because the "best" batch size usually depends on how you tune the other hyperparameters alongside it.

If you're seeing better accuracy with higher batch sizes, it’s likely due to how the model navigates the loss landscape. Here is the breakdown of why larger batches can lead to better performance.

1. Accurate Gradient Estimates

When you use a small batch, the gradient (the direction the model moves to improve) is calculated based on just a few examples. This makes the gradient "noisy." A larger batch size provides a more accurate estimate of the true gradient of the entire dataset.¹

Small Batch: The model takes jagged, erratic steps. While this can help skip out of "sharp" local minima, it can also prevent the model from ever settling into the absolute best spot.
Large Batch: The path toward the minimum is smoother and more stable, allowing the model to converge more reliably toward a high-quality solution.²

2. Generalization and "Flat" Minima

There is a theory in deep learning that larger batches help the model find wider, flatter minima in the loss landscape, provided the learning rate is scaled correctly.

Sharp minima are very sensitive; a tiny change in input data can lead to a huge jump in error (poor generalization).³
Flat minima are robust; the model performs consistently even if the data varies slightly, leading to higher validation and test accuracy.⁴

3. Better Use of Regularization (BatchNorm)

If your model uses Batch Normalization, the batch size directly impacts how the means and variances are calculated.

With very small batches, the statistics are highly volatile, which can actually destabilize training.
With larger batches, the Batch Mem layers have a more "representative" sample of the data to normalize against, leading to more stable training and often higher final accuracy.

4. The "Learning Rate" Connection

Often, a higher batch size allows you to use a higher learning rate without the model diverging. This is known as the Linear Scaling Rule. By increasing both, you cover more ground in the loss landscape effectively, which can lead the model to a better global optimum that it might have missed with smaller, slower steps.

A Note of Caution

While large batches offer stability, there is a "point of diminishing returns." If the batch size is too large relative to your learning rate, the model might suffer from the "generalization gap," where it gets stuck in the nearest local minimum instead of exploring the landscape.

Summary Table: Batch Size Trade-offs

Feature	Small Batch	Large Batch
Gradient Quality	Noisy / High Variance	Stable / Low Variance
Training Speed	Slower (less parallelization)	Faster (high GPU utilization)
Convergence	Can "wiggle" out of local minima	Converges smoothly to the nearest minimum
Memory Usage	Low	High

Would you like me to help you calculate the optimal learning rate adjustment for your new batch size using the Linear Scaling Rule?

Open Notebook