摘要:Distributional semantic models represent the meaning of words as vectors. We introduce a selection method to learn a vector space that each of its dimensions is a natural word. The selection method starts from the most frequent words and selects a subset which has the best performance. The method produces a vector space that each of its dimensions is a word. This is the main advantage of the method compared to fusion methods such as NMF and neural embedding models. We apply the method to the ukWaC corpus and train a vector space of N=1500 basis words. We report tests results on word similarity tasks for MEN RG-65 SimLex-999 and WordSim353 gold datasets. Also results show that reducing the number of basis vectors from 5000 to 1500 reduces accuracy by about 1.5-2%. So we achieve good interpretability without a large penalty. Interpretability evaluation results indicate that the word vectors obtained by the proposed method using N=1500 are more interpretable than word embedding models and the baseline method. We report the top 15 words of 1500 selected basis words in this paper.
关键词:distributional semantic vectors;basis vectors;basis words;interpretable;word selection method;projection function