Knowledge distillation is an effective tool to compress large pre-trained convolutional neural networks (CNNs) or their ensembles into models applicable to mobile and embedded devices. However, with expected flops or latency, existing methods are hand-crafted heuristics. They propose to pre-define the target student network for knowledge distillation, which may be sub-optimal because it requires much effort to explore a powerful student from the large design space. In this paper, we develop a novel teacher guided neural architecture search method to directly search for a student network with flexible channel and layer sizes. Specifically, we define the search space as the number of the channels/layers, which is sampled based on the probability distribution and is learned by minimizing the search objective of the student network. The maximum probability for the size in each distribution serves as the final searched width and depth of the target student network. Extensive experiments on a variety of face recognition benchmarks have demonstrated the superiority of our method over the state-of-the-art alternatives.