Monday, 16 March, 2020 - 14:00

Improving Word Representations by Introducing Atomic Subwords in fastText -- CANCELLED

Ebrahim Ansari (ÚFAL MFF UK)


In this seminar, we review a part of our recent work on fastText optimization. In recent years, word embeddings have been successfully applied in many NLP tasks. The fastText method increased the accuracy of word2vec, the well-known continuous word representation approach, by creating vectors for all possible character n-grams in order to use sub-word information in their vectors. In the first part of our work, we present a modification of the fastText model based on the idea of considering selected n-grams as atomic (in the sense that they are not further decomposable into smaller n-grams). We find such morphemes using a scoring function which makes use of word vectors calculated by the fastText model trained on a small portion of our corpus. Then, we retrain the fastText model using the whole corpus, this time considering the top-scoring segments as unigrams. In the second part of this work, we proposed a new version of fastText to implicitly distinguish prefix and affix n-grams by considering their overall position in words.