Knowledge Distillation in Neural Networks

1 min readNov 7, 2021

Problem

Your model is too big and can’t afford it on an edge device. It’s difficult to get similar performance from a smaller model by training it from scratch.

Solution

Knowledge distillation

In knowledge distillation, what you do is to teach a smaller model from the learning of a bigger model.
Following is how to use the larger model (often called the teacher model) for a smaller model (student model).

What we have: pre-trained teacher model, data
What we need to do: Train the student model

How

Forward pass the data from the teacher model and the student model. For backpropagation, the objective we use considers:

A student loss function: deviation of student prediction from the ground truth
A distillation loss function: deviation from the soft targets of the teacher model. The soft targets are produced by a factor (called temperature, T).

The soft target of:

Teacher model: Softmax(Prediction[teacher model] / T)

Student model: Softmax(Prediction [student model] / T)

Distillation loss: Deviation/difference between the soft targets of student and teacher model.

Example

In the example by keras.io, the student model is trained using a fraction (0.1) of loss from ground truth and the other part (0.9) of loss from the soft targets of the teacher model. KL divergence is used as a distillation loss between the soft targets.

Paper by Hinton, et al.

Do share your wonderful feedback.

Knowledge Distillation in Neural Networks

Problem

Solution

How

Example

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Anas R.

No responses yet