How do people describe clothing? The words like “formal”or "casual" are usually used. However, recent works often focus on recognizing or extracting visual features (e.g., sleeve length, color distribution and clothing pattern) from clothing images accurately. How can we bridge the gap between the visual features and the aesthetic words? In this paper, we formulate this task to a novel three-level framework: visual features(VF) - image-scale space (ISS) - aesthetic words space(AWS). Leveraging the art-field image-scale space served as an intermediate layer, we first propose a Stacked Denoising Autoencoder Guided by CorrelativeLabels (SDAE-GCL) to map the visual features to the image-scale space; and then according to the semantic distances computed byWordNet::Similarity, we map the most often used aesthetic words in online clothing shops to the image-scale space too. Employing upper body menswear images downloaded from several global online clothing shops as experimental data, the results indicate that the proposed three-level framework can help to capture the subtle relationship between visual features and aesthetic words better compared to several baselines. To demonstrate that our three-level framework and its implementation methods are universally applicable, we finally present some interesting analyses on the fashion trend of menswear in the last 10 years.