Distillation

언어 AI (NLP)/LLM & RAG & Agent

Distillation

bellmake 2025. 4. 3. 13:16

✅ 핵심 특징

모델 압축 (Model Compression)
- 큰 모델(teacher)의 지식을 작은 모델(student)에 전달해 경량화된 모델을 만듦.
소프트 타겟 (Soft Targets)
- 정답 레이블(hard label)뿐만 아니라, teacher의 예측 확률 분포(soft label)를 student가 학습함.
- 이 soft label에는 클래스 간의 유사도 정보가 담겨 있음.
일반화 성능 향상
- student 모델이 단순히 hard label만 학습할 때보다 더 좋은 일반화 성능을 보일 수 있음.
모델 구조 유연성
- student는 반드시 teacher와 같은 구조일 필요는 없음. 훨씬 작거나, 다른 구조여도 가능.

Soft Target

✅ Distillation Loss의 정의

Distillation Loss는 Knowledge Distillation 과정에서 사용되는 전체 손실 함수 전체를 의미하며,
이 손실은 일반적으로 두 개의 구성 요소로 이루어져 있음

🧩 1. Cross-Entropy (Hard label loss)

student가 ground-truth label을 잘 예측하도록 유도
전통적인 supervised learning에서 사용되는 것

🧩 2. KL Divergence (Soft target loss)

student가 teacher의 soft output(logits)을 잘 모방하도록 유도
이 KLDiv가 우리가 흔히 말하는 logit matching을 수행하는 부분

Distillation Loss

✅ 비유로 정리

Distillation Loss = 전체 식사
Loss_KD = 메인 디쉬
KLDiv = 조리법 (logit 차이를 요리하는 방식

# Distillation Loss = Hard label loss + Soft label loss
Loss_total = alpha * CrossEntropy(y_true, student) + (1 - alpha) * KLDiv(student, teacher)

🛠 대표적인 방법 (기본적인 지식 증류 방식)

1. Soft Target Matching

Loss 구성:
Loss = α * CE(y, student(x)) + (1-α) * KLDiv(student_T(x), teacher_T(x))
- CE: Cross Entropy with ground-truth (hard label)
- KLDiv: KL Divergence between student and teacher outputs
- T: Temperature (보통 >1로 설정해 soft하게 만듦)
- α: 두 loss의 가중치 조절 파라미터

2. Temperature Scaling

출력 분포를 soft하게 만들어 클래스 간 연관성을 잘 학습하도록 함
- 예: softmax(logits / T)
- T가 클수록 분포가 더 평평해짐 → 더 많은 정보 전달

🔥 기타 발전된 증류 기법들 (간단히)

FitNets: 중간 레이어 feature map도 student가 따라하게 함
Attention Transfer: teacher의 attention map을 student가 흉내내게 함
Self-Distillation: 하나의 모델이 여러 단계의 출력(예: shallow layer)을 teacher처럼 사용해 자기 자신을 학습
Task-specific Distillation: NLP, CV 등에서 task에 최적화된 distillation 전략 사용

Language Model Distillation