Knowledge Distillation in Neural Networks

Anas R.
1 min readNov 7, 2021

--

Problem

Your model is too big and can’t afford it on an edge device. It’s difficult to get similar performance from a smaller model by training it from scratch.

Solution

Knowledge distillation

In knowledge distillation, what you do is to teach a smaller model from the learning of a bigger model.
Following is how to use the larger model (often called the teacher model) for a smaller model (student model).

What we have: pre-trained teacher model, data
What we need to do: Train the student model

How

Forward pass the data from the teacher model and the student model. For backpropagation, the objective we use considers:

  • A student loss function: deviation of student prediction from the ground truth
  • A distillation loss function: deviation from the soft targets of the teacher model. The soft targets are produced by a factor (called temperature, T).

The soft target of:

Teacher model: Softmax(Prediction[teacher model] / T)

Student model: Softmax(Prediction [student model] / T)

Distillation loss: Deviation/difference between the soft targets of student and teacher model.

Example

In the example by keras.io, the student model is trained using a fraction (0.1) of loss from ground truth and the other part (0.9) of loss from the soft targets of the teacher model. KL divergence is used as a distillation loss between the soft targets.

Paper by Hinton, et al.

Do share your wonderful feedback.

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

--

--

No responses yet

Write a response