AAAI Publications, The Thirty-Second International Flairs Conference

Font Size: 
Effect of Domain Corpus Size and LSA Vector Dimension: A Study in Assessing Student Generated Short Texts in Virtual Internships Without Participant Data
Dipesh Gautam, Zhiqiang Cai, Vasile Rus

Last modified: 2019-05-04


Semantic similarity is a major automated approach to address many tasks such as essay grading, answer assessment, text summarization and information retrieval. Many semantic similarity methods rely on semantic representation such as Latent Semantic Analysis (LSA), an unsupervised method to infer a vectorial semantic representation of words or larger texts such as documents. Two ingredients in obtaining LSA vectorial representations are the corpus of texts from which the vectors are derived and the dimensionality of the resulting space. In this work, we investigate the effect of corpus size and vector dimensionality on assessing student generated content in advanced learning systems, namely, virtual internships. Automating the assessment of student generated content would greatly increase the scalability of virtual internships to millions of learners at reasonable costs. Prior work on automated assessment of notebook entries relied on classifiers trained on participant data. However, when new virtual internships are created for a new domain, for instance, no participant data is available a priori. Here, we report on our effort to develop a LSA-based assessment method without student data. Furthermore, we investigate the optimum corpus size and vector dimensionality for these LSA-based methods.

Full Text: PDF