Knowledge Distillation in Neural Networks

Anas R.
1 min readNov 7, 2021



Your model is too big and can’t afford it on an edge device. It’s difficult to get similar performance from a smaller model by training it from scratch.


Knowledge distillation

In knowledge distillation, what you do is to teach a smaller model from the learning of a bigger model.
Following is how to use the larger model (often called the teacher model) for a smaller model (student model).

What we have: pre-trained teacher model, data
What we need to do: Train the student model


Forward pass the data from the teacher model and the student model. For backpropagation, the objective we use considers:

  • A student loss function: deviation of student prediction from the ground truth
  • A distillation loss function: deviation from the soft targets of the teacher model. The soft targets are produced by a factor (called temperature, T).

The soft target of:

Teacher model: Softmax(Prediction[teacher model] / T)

Student model: Softmax(Prediction [student model] / T)

Distillation loss: Deviation/difference between the soft targets of student and teacher model.


In the example by, the student model is trained using a fraction (0.1) of loss from ground truth and the other part (0.9) of loss from the soft targets of the teacher model. KL divergence is used as a distillation loss between the soft targets.

Paper by Hinton, et al.

Do share your wonderful feedback.