Nowadays images on social networking websites (e.g., Flickr) are mostly accompanied with user-contributed tags, which help cast a new light on the conventional content-based image analysis tasks such as image classification and retrieval. In order to establish a scalable social image analysis system, two issues need to be considered: 1) Supervised learning is a futile task in modeling the enormous number of concepts in the world, whereas unsupervised approaches overcome this hurdle; 2) Algorithms are required to be both spatially and temporally efficient to handle large-scale datasets. In this paper, we propose a cross-view feature learning (CVFL) framework to handle the problem of social image analysis effectively and efficiently. Through explicitly modeling the relevance between image content and tags (which is empirically shown to be visually and semantically meaningful), CVFL yields more promising results than existing methods in the experiments. More importantly, being general and descriptive, CVFL and its variants can be readily applied to other large-scale multi-view tasks in unsupervised setting.