AAAI Publications, Thirtieth AAAI Conference on Artificial Intelligence

Font Size: 
Minimally-Constrained Multilingual Embeddings via Artificial Code-Switching
Michael Wick, Pallika Kanani, Adam Pocock

Last modified: 2016-03-05


We present a method that consumes a large corpus of multilingual text and produces a single, unified word embedding in which the word vectors generalize across languages. In contrast to current approaches that require language identification, our method is agnostic about the languages with which the documents in the corpus are expressed, and does not rely on parallel corpora to constrain the spaces. Instead we utilize a small set of human provided word translations---which are often freely and readily available. We can encode such word translations as hard constraints in the model's objective functions; however, we find that we can more naturally constrain the space by allowing words in one language to borrow distributional statistics from context words in another language. We achieve this via a process we term artificial code-switching. As the name suggests, we induce code-switching so that words across multiple languages appear in contexts together. Not only do embedding models trained on code-switched data learn common cross-lingual structure, the common structure allows an NLP model trained in a source language to generalize to multiple target languages (achieving up to 80% of the accuracy of models trained with target-language data).


NLP; word embeddings; multilingual; sentiment analysis; artificial code switching

Full Text: PDF