To deploy an object detection system into an application, the size of the system is one of key issues. We distill a teacher’s knowledge into our small scale network, where the teacher is a full-scale architecture with good performance. In addition to KL divergence loss, we propose a cosine similarity loss on foreground features to encourage student to learn the feature direction of teacher’s. This leads an efficient and robust learning from teacher model. We also propose an adaptive learning criteria which makes student model learns from teacher only when teacher has a better performance than student’s. The proposed student model has an improvement of 34.85% on ResNet34 and 39.67% on ResNet18 when Teacher model is on ResNet50.