**Problem**

Your model is too big and can’t afford it on an edge device. It’s difficult to get similar performance from a smaller model by training it from scratch.

**Solution**

*Knowledge distillation*

In knowledge distillation, what you do is to teach a smaller model from the learning of a bigger model.

Following is how to use the larger model (often called **the teacher model**) for a smaller model (**student model**).

What we have: **pre-trained** teacher model, data

What we need to do: Train the student model

**How**

Forward pass the data from the teacher model and the student model. For backpropagation, the objective we use considers:

*A student loss function:*deviation of student prediction from the ground truth*A distillation loss function:*deviation from the soft targets of the teacher model. The soft targets are produced by a factor (called temperature, T).

The soft target of:

**Teacher model:** Softmax(Prediction[teacher model] / T)

**Student model**: Softmax(Prediction [student model] / T)

**Distillation loss: **Deviation/difference** **between the soft targets of student and teacher model.

# Example

In the example by keras.io, the student model is trained using a fraction (0.1) of loss from ground truth and the other part (0.9) of loss from the soft targets of the teacher model. KL divergence is used as a distillation loss between the soft targets.

Paper by Hinton, et al.

Do share your wonderful feedback.