ICML Workshop 2024 Where Do Large Learning Rates Lead Us? A Feature Learning Perspective

It is a conventional wisdom that using large learning rates (LRs) early in training improves generalization. Following a line of research devoted to understanding this effect mechanistically, we conduct an empirical study in a controlled setting focusing on the feature learning properties of training with different initial LRs. We show that the range of initial LRs providing the best generalization of the final solution results in a sparse set of learned features, with a clear focus on those most relevant for the task. In contrast, training starting with too small LRs attempts to learn all features simultaneously, resulting in poor generalization. Conversely, using initial LRs that are too large fails to extract meaningful patterns from the data.