https://nlp.pw/

Creating Custom-word-embedding with t-SNE 2D-dimension visualizations and retrain Glove Vectors on top of own data with code

Rana singh
6 min readJan 31, 2021

--

Steps:

  1. Introduction
  2. Train our own word embedding (code)
  3. Phrases(bigrams)
  4. t-SNE visualizations in 2D
  5. Retrain Glove Vectors on top of my own data

Introduction:

Word embedding is one of the most popular representations of document vocabulary. It is capable of capturing the context of a word in a document, semantic and syntactic similarity, relation with other words, etc.

Word2Vec is one of the most popular techniques to learn word embeddings using a shallow neural network. It was developed by Tomas Mikolov in 2013 at Google.

After playing around with GLOVE, you will quickly find that certain words in your training data are not present in its vocab. These are typically replaced with the same-shape zero vector, which essentially means you are ‘sacrificing’ the word as your input feature, which can potentially be important for correct prediction. Another way to deal with this is to train your own word embeddings, using your training data, so that the semantic relationship of your own training corpus can be better represented.

why we need word2vec?

Word2Vec is a method to construct word embedding. It can be obtained using two methods (both involving Neural Networks): Skip Gram and Common Bag Of Words (CBOW). For detail, follow section 10 of below article

Who wins?

Both have their own advantages and disadvantages. According to Mikolov, Skip Gram works well with a small amount of data and is found to represent rare words well.

On the other hand, CBOW is faster and has better representations for more frequent words.

Let's create our own word embedding (code):

Reading DonorsChoose text data and clean it for embedding

Word2vec is a self-supervised method (well, sort of unsupervised but not unsupervised, since it provides its own labels. check out this Quora thread for a more detailed explanation), so we can make full use of the entire dataset (including test data) to obtain a more wholesome word embedding representation.

Phrases(bigrams):

We are using Gensim Phrases package to automatically detect common phrases (bigrams) from a list of sentences. https://radimrehurek.com/gensim/models/phrases.html

The main reason we do this is to catch words like “mr_burns” or “bart_simpson” !

Phrases parameters:

  • sentences (iterable of list of str, optional) — The sentences iterable can be simply a list, but for larger corpora, consider a generator that streams the sentences directly from disk/network, See BrownCorpus, Text8Corpus or LineSentence for such examples.
  • min_count (float, optional) — Ignore all words and bigrams with total collected count lower than this value.
  • threshold (float, optional) — Represent a score threshold for forming the phrases (higher means fewer phrases). A phrase of words a followed by b is accepted if the score of the phrase is greater than threshold. Heavily depends on concrete scoring-function, see the scoring parameter.
  • max_vocab_size (int, optional) — Maximum size (number of tokens) of the vocabulary. Used to control pruning of less common words, to keep memory under control. The default of 40M needs about 3.6GB of RAM. Increase/decrease max_vocab_size depending on how much available memory you have.
  • delimiter (str, optional) — Glue character used to join collocation tokens.
  • scoring ({‘default’, ‘npmi’, function}, optional) –
  • Specify how potential phrases are scored. scoring can be set with either a string that refers to a built-in scoring function, or with a function with the expected parameter names. Two built-in scoring functions are available by setting scoring to a string:
  • connector_words (set of str, optional) — Set of words that may be included within a phrase, without affecting its scoring. No phrase can start nor end with a connector word; a phrase may contain any number of connector words in the middle.

Let's do a sanity check of the effectiveness of the lemmatization, removal of stopwords, and addition of bigrams.

Training the model

We use Gensim implementation of word2vec: https://radimrehurek.com/gensim/models/word2vec.html

The parameters:

  • min_count = int - Ignores all words with total absolute frequency lower than this - (2, 100)
  • window = int - The maximum distance between the current and predicted word within a sentence. E.g. window words on the left and window words on the left of our target - (2, 10)
  • size = int - Dimensionality of the feature vectors. - (50, 300)
  • sample = float - The threshold for configuring which higher-frequency words are randomly downsampled. Highly influencial. - (0, 1e-5)
  • alpha = float - The initial learning rate - (0.01, 0.05)
  • min_alpha = float - Learning rate will linearly drop to min_alpha as training progresses. To set it: alpha - (min_alpha * epochs) ~ 0.00
  • negative = int - If > 0, negative sampling will be used, the int for negative specifies how many "noise words" should be drown. If set to 0, no negative sampling is used. - (5, 20)
  • workers = int - Use these many worker threads to train the model (=faster training with multicore machines)

Word2Vec()

STEP 1. Word2Vec() initialization:

In this first step, I set up the parameters of the model one-by-one.
I do not supply the parameter sentences, and therefore leave the model uninitialized, purposefully.

STEP 2. Building the Vocabulary and Training of the model:

Word2Vec requires us to build the vocabulary table (simply digesting all the words and filtering out the unique words, and doing some basic counts on them)

Exploring the model

Most similar and Similarities:

Here, we will ask our model to find the word most similar to some of the most iconic characters of the test. In Similarities, we will see how similar are two words to each other

t-SNE visualizations:

t-SNE is a non-linear dimensionality reduction algorithm that attempts to represent high-dimensional data and the underlying relationships between vectors in a lower-dimensional space.

Our goal in this section is to plot our 300 dimensions vectors into 2 dimensional graphs, and see if we can spot interesting patterns.
For that we are going to use t-SNE implementation from scikit-learn.

To make the visualizations more relevant, we will look at the relationships between a query word (in **red**), its most similar words in the model (in **blue**), and other words from the vocabulary (in **green**).

This will compare where the vector representation of Homer, his 10 most similar words from the model, as well as 8 random ones, lies in a 2D graph.

Interestingly, the 10 most similar words to Homer ends up around him, so does Quite and (sideshow) Homework, two recurrent characters.

How to retrain Glove Vectors on top of my own data?

suppose you have a corpus of data (let’s say mydata.txt) that has new words which are not in the existing Glove. So, how to retrain glove so that the existing pre-trained glove must now include the new words on corpus mydata.txt.

  • Create a new instance of a GloVe model with the old_words and new_words as vocabulary.
  • Replace the initial vectors/biases of the old_words with the ones you have already.
  • Train this model on mydata.txt.

The new old_words representations won’t be the same but will be highly influenced by the old ones.

=========Thanks================

Reference:

Mikolov, T., Chen, K., Corrado, G.S., & Dean, J. (2013). Efficient Estimation of Word Representations in Vector Space. CoRR, abs/1301.3781.

https://www.kaggle.com/chewzy/tutorial-how-to-train-your-custom-word-embedding

https://datascience.stackexchange.com/questions/33792/how-to-retrain-glove-vectors-on-top-of-my-own-data

--

--