Due to the large variation and richness of visual inputs, statistical learning gets more and more concerned in the practice of visual processing such as visual tracking and recognition. Statistical models can be trained from a large set of training data. However, in many cases, since it is not trivial to obtain a large labeled and representative training data set, it would be difficult to obtain a satisfactory generalization. Another difficulty is how to automatically select good features for representation. By combining both labeled and unlabeled training data, this paper proposes a new learning paradigm, self-supervised learning, to investigate the issues of learning bootstrapping and model transduction. Inductive learning and transductive learning are the two main cases of self-supervised learning, in which the proposed algorithm, Discriminant-EM (D-EM), is a specific learning technique. Vision-based gesture interface is employed as a testbed in our research.