Problem
Your model is too big and can’t afford it on an edge device. It’s difficult to get similar performance from a smaller model by training it from scratch.
Solution
Knowledge distillation
In knowledge distillation, what you do is to teach a smaller model from the learning of a bigger model.
Following is how to use the larger model (often called the teacher model) for a smaller model (student model).
What we have: pre-trained teacher model, data
What we need to do: Train the student model
How
Forward pass the data from the teacher model and the student model. For backpropagation, the objective we use considers:
- A student loss function: deviation of student prediction from the ground truth
- A distillation loss function: deviation from the soft targets of the teacher model. The soft targets are produced by a factor (called temperature, T).
The soft target of:
Teacher model: Softmax(Prediction[teacher model] / T)
Student model: Softmax(Prediction [student model] / T)
Distillation loss: Deviation/difference between the soft targets of student and teacher model.
Example
In the example by keras.io, the student model is trained using a fraction (0.1) of loss from ground truth and the other part (0.9) of loss from the soft targets of the teacher model. KL divergence is used as a distillation loss between the soft targets.
Paper by Hinton, et al.
Do share your wonderful feedback.