Yongjae Oh: [Deepest S18 Seminar] Geometry of solution space: Flat minima, replica theory and mode connectivity

Last Saturday, I hosted a seminar at Deepest (an SNU student deep learning club). I mainly focused on theoretical topics related to the loss function landscape in deep learning and the concept of flat minima, and briefly covered some additional topics on global geometry.

The loss function $L\left(\theta; \{x_\mu\}_{\mu=1}^P\right)$ is a function of numerous weights $\theta$$ of a neural network, and the "landscape" of this function is determined by the dataset $\{x_\mu\}_{\mu=1}^{P}$. S. Hochreiter proposed that, models that lie in 'flat minima' of the landscape generalize better (S. Hochreiter and J. Schmidhuber, NIPS (currently NeurIPS) 1994).

The geometric intuition supporting this argument is simple: in order to make the loss landscape not distorted too much under a dataset change (training set -> test set), the model should lie on a broad region of the loss. This also agrees with the discussion on model complexity in the context of statistics and information theory, which says that simple models generalize better (ICLR 1995).

In fact, one can show that maximum a posteriori (MAP) inference is equivalent to the minimization of cross-entropy loss while keeping the model simple (L2 regularization when $-\ln P(\theta)\propto \theta^{2}$). This is exactly what we do at deep learning. Moreover, this is also equivalent to the idea of minimum description length (MDL), which states that the total number of bits needed to identify the model and data should be as small as possible.

The authors show that finding flat minima (whose loss values of the neighborhood are similar across as large as possible volume) is also linked to MDL. They further suggest a gradient descent method which is intentionally biased to prefer that kind of minima, which leads to good generalization.

Research on flat minima is still continued in contemporary deep learning era. Nowadays it is widely accepted that, SGD natively prefers flat minima without adding any bias. There are also a few works rigorously connecting flat minima with generalization performance based on the PAC-Bayes framework.

Another active line of research analyzes the global geometry of the disordered solution space of deep learning using statistical physics (E Gardner, J. Phys. A: Math. Gen. (1988)). They start from a simple perceptron model with random data, but they have been extended to more realistic architectures and datasets. In this paradigm, the frequently appearing factor $\alpha=P/N$ ($P$: data size, $N$: model size) is crucial for the collective behavior of the neural network. For example, for random dataset, when α is small (large), the space of solutions tends to be connected (fragmented), which can be roughly related to flat minima. Moreover, at the limit where $\alpha$ is finite but $N,P$ goes to infinite, an accurate explicit formula of train loss for practical dataset (MNIST, etc.) is obtained.

Lastly, this method can theoretically reproduce the renowned '(linear) mode connectivity' (PRL 2023) which is empirically reported in deep learning at the late 2010s, and furthermore predict that the structure of the connected region is star-shaped.

See this post on LinkedIn: link

See this post on Facebook: link

(국문)

지난 11월 15일(토요일)에 서울대학교 딥러닝 동아리 Deepest에서 세미나를 호스팅했습니다. 딥러닝의 손실함수(loss function) 지형과 flat minima에 대한 이론적 주제를 자세히 다루었고, 보다 광역적인 구조에 대한 몇 가지 추가 주제도 소개하였습니다.

손실함수 $L\left(\theta; \{x_\mu\}_{\mu=1}^P\right)$는 신경망 연결의 수많은 가중치 $\theta$를 정의역으로 갖고, 이 함수의 모양, 즉 "지형"(landscape)은 데이터셋 $\{x_\mu\}_{\mu=1}^{P}$에 따라 다르게 결정됩니다. 손실함수 지형 위의 한 점을 선택하는 것은, 신경망 가중치들이 확정되므로 모델을 확정하는 것과 같습니다. 대표적인 시계열 처리 모델인 LSTM의 제안자이기도 한 Sepp Hochreiter는 NIPS (현 NeurIPS) 1994에서, 손실함수 지형의 flat minima (평탄한 최소점) 에 위치한 모델이 일반화(generalization) 성능이 좋을 것이라고 제안하였습니다 (S. Hochreiter and J. Schmidhuber, NIPS 1994).

이 논의의 근거가 되는 기하학적 직관은 간단합니다. 데이터셋이 training set에서 test set으로 바뀔 때에 그럼에도 불구하고 손실함수 지형이 덜 뒤틀리려면, 따라서 원래 모델에서 평가되었던 손실함수 값이 많이 바뀌지 않으려면 (즉 일반화를 잘 하려면) 모델은 지형의 널찍한 영역에 있어야 한다는 것입니다. 아래에서 MDL 개념을 바탕으로 보다 자세히 쓰겠지만, 이는 통계학 및 정보이론에서 model complexity와 관련하여 논의된, '단순한 모델일수록 일반화를 잘 한다'는 논의와도 일관적입니다 (J. Schmidhuber, ICLR 1995).

베이즈 추론 맥락에서 최대사후확률(maximum a posteriori, MAP) 추론이, cross-entropy loss를 최소화하면서도 모형을 간단하게 유지하는 것($-\ln P(\theta)\propto \theta^{2}$일 때 L2 regularization)과 동등합니다. 이것은 다름이 아니라 딥러닝에서 우리가 많이 하는 일입니다. 상상 속에 존재하는 매끄러운 data distribution과 달리 실제로 기계에게 주어지는 것은 개별적인 샘플들(즉 델타함수 분포)뿐임을 고려하면, 이렇게 simple model을 추구하는 것은 일반화에 매우 중요하다고 하겠습니다. 그런데 더 나아가서, 이것은 모형을 고르는 원리 중의 하나인 miminum description length (MDL), 즉 모델과 데이터를 기술하기 위한 비트 수의 총량이 가능한 한 작아야 한다는 원리와도 동치입니다.

저자들은 flat minima를 찾는 것, 즉 최소점 중에서 그 주변에 자기 자신과 같은 loss 값을 갖는 점들의 부피가 가능한 한 많은 최소점을 찾는 것 역시 MDL과 관련됨을 보입니다. 더 나아가서, 그러한 minima를 선호하도록 의도적으로 bias된 새로운 gradient descent 방법을 제안하고 이를 통해 일반화를 잘 하는 모델을 찾을 수 있다는 것을 확인합니다.

Flat minima와 관련된 연구는 딥 러닝이 정보기술의 전면에 등장한 현대에도 계속되고 있습니다. 최근에는 SGD가 별도의 의도적인 편향 없이도 그 특유의 통계학적, 동역학적 특성에 의해 native하게 flat minima를 선호하는 경향이 있다는 것이 상당히 널리 받아들여집니다. SGD는 단순히 batch size를 줄여서 최적화 과정을 효율화할 뿐 아니라, 그 결과로 보다 더 좋은 minima에 도달할 수 있게 해 준다는 것입니다. 최근에는 flat minima와 일반화 성능을 PAC-Bayes 프레임워크를 이용해서 보다 엄밀하게 연관짓는 일들도 있습니다.

전통이 있으면서 현재까지도 활발한 또다른 연구의 흐름은, 딥 러닝의 무질서한(disordered) 해 공간의 광역적 구조를 통계역학을 이용해서 분석하는 것입니다 (E Gardner, J. Phys. A: Math. Gen. (1988)). 이러한 이론적 연구들은 주로 간단한 단일 퍼셉트론 및 랜덤 데이터셋과 같은 간단한 상황에서 출발했지만, 점점 보다 복잡하고 사실적인 상황들에 대해 적용되며 딥러닝에서 나타나는 현상을 상당히 잘 설명해냅니다. 이 패러다임에서는 $\alpha=P/N$라는 팩터($P$: 데이터셋 크기, $N$: 모델 크기)가 자주 중요하게 등장합니다. 특히 $\alpha$를 유한하게 유지하되 $N,P$ 모두 무한대로 보내는 극한에서 (수백~수천 차원에서 이미 실제 실험 결과들과도 꽤 잘 맞는 경우가 많습니다) 신경망의 많은 중요한 성질이 통계역학적으로 예측됩니다.

이를테면, 랜덤 데이터셋에서, $\alpha$가 작을수록 퍼셉트론 분류 문제의 해들(보다 사실적인 신경망에서는 region with low loss)들이 서로 연결된 경향이 있고, $\alpha$가 클수록 해들이 쪼개져 있는(fragmented) 경향이 있습니다. 이는 통계역학 맥락에서 각각 replica symmetry (RS) 및 그 breaking (RSB)에 대응됩니다. Flat minima와의 관계도 생각해 볼 수 있습니다. 물론 flat minima는 국소적인 개념이고 RS/RSB는 훨씬 광역적인 개념이므로 일대일 대응시키기는 어렵지만, rough한 관계는 있을 것이라고 기대하고 Franz-Parisi entropy 등을 통해 보다 광범위하게 확인해야 할 것입니다.

마지막으로, 이 통계역학적 방법은 2010년대 후반부터 딥 러닝 이론 커뮤니티에서 경험적으로 많이 보고된 '(linear) mode connectivity'를 이론적으로 재현해내기도 합니다. 이는 손실함수 지형 상에서 작은 loss 값을 갖는 영역들이 평평한 직선 경로를 통해 매우 넓게 연결되어 있다는 관찰을 뜻합니다. 그리고 나아가서, 이러한 연결 구조가 별 모양이라는 새로운 이론적 예측도 하며 (B. L. Annesi et al., PRL 2023), 이러한 예측을 보다 실제적인 신경망에서 확인하려는 시도도 이어지고 있습니다.

Yongjae Oh

Labels

게시물 목록

Sunday, November 23, 2025

[Deepest S18 Seminar] Geometry of solution space: Flat minima, replica theory and mode connectivity

No comments:

Post a Comment