Introduction to Word Embeddings

Word embeddings are one of the most important breakthroughs in Natural Language Processing (NLP). They transform words from discrete symbols into continuous vectors of real numbers, capturing semantic meaning and relationships between words in a way that computers can understand and process.

Before embeddings, words were represented using one-hot encoding, where each word is a vector of zeros with a single 1. For a vocabulary of 10,000 words, this creates sparse vectors of 10,000 dimensions. Embeddings compress this into dense vectors of typically 100-300 dimensions while capturing meaning.

The three most popular word embedding techniques are Word2Vec (Google, 2013), GloVe (Stanford, 2014), and FastText (Facebook, 2016), each with unique strengths and use cases.

Why Use Word Embeddings?

Word embeddings solve fundamental problems in NLP:

  • Semantic similarity: Similar words have similar vectors (king ≈ queen, Paris ≈ London)
  • Reduced dimensionality: Dense 300-dimensional vectors vs. sparse 10,000+ dimensions
  • Transfer learning: Pre-trained embeddings capture general language understanding
  • Arithmetic relationships: Vector math captures relationships (king - man + woman ≈ queen)
  • Better ML performance: Improved accuracy in classification, clustering, and generation tasks
  • Out-of-vocabulary handling: FastText can handle unseen words using subword information

Word2Vec: Skip-gram and CBOW

Word2Vec, developed by Tomas Mikolov at Google in 2013, introduced two efficient architectures for learning word embeddings from large text corpora:

1. CBOW (Continuous Bag of Words)

CBOW predicts a target word from its surrounding context words. It's faster and works well with smaller datasets.

# Example: Predict "learning" from context
# Context: "I love [TARGET] machine learning"
# The model learns to predict the middle word from surrounding words

2. Skip-gram

Skip-gram does the opposite: it predicts context words from a target word. It's slower but works better with rare words and smaller datasets.

# Example: Given "learning", predict context
# Target: "learning"
# Predictions: "I", "love", "machine", "models"

Training Word2Vec

from gensim.models import Word2Vec
from nltk.tokenize import word_tokenize
import nltk

# Sample corpus
sentences = [
    "the quick brown fox jumps over the lazy dog",
    "machine learning is a subset of artificial intelligence",
    "deep learning uses neural networks with many layers",
    "natural language processing enables computers to understand text"
]

# Tokenize sentences
tokenized = [word_tokenize(sent.lower()) for sent in sentences]

# Train Word2Vec model
# vector_size: dimension of word vectors
# window: maximum distance between current and predicted word
# min_count: ignore words with frequency lower than this
# sg: 0 for CBOW, 1 for Skip-gram
model = Word2Vec(
    sentences=tokenized,
    vector_size=100,
    window=5,
    min_count=1,
    sg=1,  # Use Skip-gram
    workers=4
)

# Save and load model
model.save("word2vec.model")
# model = Word2Vec.load("word2vec.model")

# Get vector for a word
vector = model.wv['learning']
print(f"Vector shape: {vector.shape}")
print(f"First 10 dimensions: {vector[:10]}")

Using Pre-trained Word2Vec

import gensim.downloader as api

# Load pre-trained Google News vectors (100B words, 3M vocab)
# This will download ~1.6GB on first use
model = api.load('word2vec-google-news-300')

# Find similar words
similar = model.most_similar('python', topn=5)
print("Words similar to 'python':")
for word, score in similar:
    print(f"  {word}: {score:.4f}")

# Output:
#   scripting: 0.6899
#   perl: 0.6814
#   java: 0.6489
#   programming: 0.6315
#   ruby: 0.6281

# Word arithmetic
result = model.most_similar(
    positive=['king', 'woman'],
    negative=['man'],
    topn=1
)
print(f"king - man + woman = {result[0][0]}")
# Output: queen

# Calculate similarity
similarity = model.similarity('dog', 'cat')
print(f"Similarity between dog and cat: {similarity:.4f}")
# Output: 0.7609

GloVe: Global Vectors for Word Representation

GloVe (Global Vectors), developed at Stanford in 2014, takes a different approach than Word2Vec. Instead of using a prediction-based model, GloVe uses matrix factorization on global word co-occurrence statistics.

The key insight: the ratio of co-occurrence probabilities encodes meaning better than raw probabilities.

Using Pre-trained GloVe

import gensim.downloader as api
import numpy as np

# Load pre-trained GloVe vectors
# Options: glove-wiki-gigaword-50, 100, 200, 300
#          glove-twitter-25, 50, 100, 200
glove = api.load('glove-wiki-gigaword-100')

# Get word vector
vector = glove['computer']
print(f"Vector shape: {vector.shape}")

# Find similar words
similar = glove.most_similar('algorithm', topn=5)
print("Words similar to 'algorithm':")
for word, score in similar:
    print(f"  {word}: {score:.4f}")

# Analogy task
result = glove.most_similar(
    positive=['paris', 'germany'],
    negative=['france'],
    topn=1
)
print(f"paris - france + germany = {result[0][0]}")
# Output: berlin

Loading GloVe from Text Files

import numpy as np

def load_glove_vectors(file_path):
    """Load GloVe vectors from text file"""
    embeddings_dict = {}

    with open(file_path, 'r', encoding='utf-8') as f:
        for line in f:
            values = line.split()
            word = values[0]
            vector = np.asarray(values[1:], dtype='float32')
            embeddings_dict[word] = vector

    return embeddings_dict

# Load GloVe vectors
# Download from: https://nlp.stanford.edu/projects/glove/
glove_dict = load_glove_vectors('glove.6B.100d.txt')

# Use the vectors
vector = glove_dict.get('learning', None)
if vector is not None:
    print(f"Vector for 'learning': {vector[:10]}")

# Calculate cosine similarity
def cosine_similarity(vec1, vec2):
    return np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))

sim = cosine_similarity(glove_dict['king'], glove_dict['queen'])
print(f"Similarity between king and queen: {sim:.4f}")

FastText: Subword Embeddings

FastText, developed by Facebook AI Research in 2016, improves on Word2Vec by representing words as bags of character n-grams. This allows it to generate embeddings for out-of-vocabulary words and better handle rare words and morphologically rich languages.

How FastText Works

Instead of treating "learning" as a single token, FastText breaks it into:

  • Character n-grams: <le, lea, ear, arn, rni, nin, ing, ng>
  • Full word: <learning>

The final embedding is the sum of all n-gram embeddings, allowing the model to understand that "learner" and "learning" share similar meanings.

Training FastText

from gensim.models import FastText
from nltk.tokenize import word_tokenize

# Sample corpus
sentences = [
    "machine learning algorithms learn from data",
    "deep learning is a powerful technique",
    "learners must practice consistently",
    "neural networks learn patterns automatically"
]

# Tokenize
tokenized = [word_tokenize(sent.lower()) for sent in sentences]

# Train FastText model
# min_n: min length of char n-grams
# max_n: max length of char n-grams
fasttext_model = FastText(
    sentences=tokenized,
    vector_size=100,
    window=5,
    min_count=1,
    min_n=3,      # Min character n-gram length
    max_n=6,      # Max character n-gram length
    sg=1,         # Skip-gram
    workers=4
)

# Get vector for in-vocabulary word
vector = fasttext_model.wv['learning']
print(f"Vector for 'learning': {vector[:5]}")

# Get vector for out-of-vocabulary word!
# This is the key advantage of FastText
oov_vector = fasttext_model.wv['learnings']  # Word not in training data
print(f"Vector for 'learnings' (OOV): {oov_vector[:5]}")

# FastText can generate meaningful embeddings for typos too
typo_vector = fasttext_model.wv['lerning']  # Typo
print(f"Vector for 'lerning' (typo): {typo_vector[:5]}")

Using Pre-trained FastText

import gensim.downloader as api

# Load pre-trained FastText model
# Available models: fasttext-wiki-news-subwords-300
fasttext = api.load('fasttext-wiki-news-subwords-300')

# Regular word
similar = fasttext.most_similar('science', topn=3)
print("Similar to 'science':")
for word, score in similar:
    print(f"  {word}: {score:.4f}")

# Out-of-vocabulary word (made-up brand name)
# FastText can still generate a meaningful embedding
oov_vector = fasttext['TechnoAI2024']
print(f"OOV vector shape: {oov_vector.shape}")

# Typo handling
similarity = fasttext.similarity('definately', 'definitely')
print(f"Similarity between typo and correct: {similarity:.4f}")
# FastText recognizes they're similar despite the typo!

Practical Applications

1. Text Classification with Embeddings

import numpy as np
from sklearn.linear_model import LogisticRegression
from gensim.models import Word2Vec

# Sample data
texts = [
    "this movie is excellent and amazing",
    "worst film ever made terrible acting",
    "great story wonderful cinematography",
    "boring and predictable waste of time"
]
labels = [1, 0, 1, 0]  # 1 = positive, 0 = negative

# Train Word2Vec
tokenized = [text.split() for text in texts]
w2v = Word2Vec(tokenized, vector_size=100, window=5, min_count=1)

# Create document embeddings by averaging word vectors
def document_vector(text, model):
    """Average word vectors to get document vector"""
    words = text.split()
    word_vectors = [model.wv[word] for word in words if word in model.wv]
    if len(word_vectors) == 0:
        return np.zeros(model.vector_size)
    return np.mean(word_vectors, axis=0)

# Convert texts to vectors
X = np.array([document_vector(text, w2v) for text in texts])
y = np.array(labels)

# Train classifier
classifier = LogisticRegression()
classifier.fit(X, y)

# Predict on new text
new_text = "amazing movie wonderful"
new_vector = document_vector(new_text, w2v).reshape(1, -1)
prediction = classifier.predict(new_vector)
print(f"Sentiment: {'Positive' if prediction[0] == 1 else 'Negative'}")

2. Semantic Search

import gensim.downloader as api
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# Load pre-trained model
model = api.load('glove-wiki-gigaword-100')

# Document corpus
documents = [
    "machine learning algorithms for data analysis",
    "deep neural networks and artificial intelligence",
    "cooking recipes for italian pasta dishes",
    "python programming and software development",
    "quantum physics and particle mechanics"
]

# Convert documents to vectors
def doc_vector(text, model):
    words = text.lower().split()
    vectors = [model[word] for word in words if word in model]
    if not vectors:
        return np.zeros(model.vector_size)
    return np.mean(vectors, axis=0)

doc_vectors = np.array([doc_vector(doc, model) for doc in documents])

# Search query
query = "AI and neural nets"
query_vector = doc_vector(query, model).reshape(1, -1)

# Calculate similarities
similarities = cosine_similarity(query_vector, doc_vectors)[0]

# Rank documents
ranked = sorted(enumerate(similarities), key=lambda x: x[1], reverse=True)

print(f"Search results for: '{query}'")
for idx, score in ranked:
    print(f"{score:.4f}: {documents[idx]}")

# Output:
# 0.8721: deep neural networks and artificial intelligence
# 0.7234: machine learning algorithms for data analysis
# 0.5012: python programming and software development
# ...

3. Word Analogy Tasks

import gensim.downloader as api

model = api.load('word2vec-google-news-300')

# Define analogy function
def analogy(word1, word2, word3, model, topn=3):
    """
    Solve analogy: word1 is to word2 as word3 is to ?
    Example: king is to queen as man is to woman
    """
    result = model.most_similar(
        positive=[word2, word3],
        negative=[word1],
        topn=topn
    )
    return result

# Examples
analogies = [
    ("king", "queen", "man"),
    ("france", "paris", "germany"),
    ("good", "better", "bad"),
    ("walking", "walked", "swimming")
]

for word1, word2, word3 in analogies:
    result = analogy(word1, word2, word3, model, topn=1)
    answer = result[0][0]
    score = result[0][1]
    print(f"{word1}:{word2} :: {word3}:{answer} (score: {score:.4f})")

# Output:
# king:queen :: man:woman (score: 0.7698)
# france:paris :: germany:berlin (score: 0.7845)
# good:better :: bad:worse (score: 0.7234)
# walking:walked :: swimming:swam (score: 0.6891)

Comparing Word2Vec, GloVe, and FastText

import gensim.downloader as api
import time

# Load all three models
print("Loading models...")
w2v = api.load('word2vec-google-news-300')
glove = api.load('glove-wiki-gigaword-300')
fasttext = api.load('fasttext-wiki-news-subwords-300')

# Compare similarity scores
words = [('king', 'queen'), ('dog', 'cat'), ('python', 'java')]

print("\nSimilarity Comparison:")
print(f"{'Pair':<20} {'Word2Vec':<12} {'GloVe':<12} {'FastText'}")
print("-" * 56)

for word1, word2 in words:
    w2v_sim = w2v.similarity(word1, word2)
    glove_sim = glove.similarity(word1, word2)
    ft_sim = fasttext.similarity(word1, word2)
    print(f"{word1}-{word2:<15} {w2v_sim:<12.4f} {glove_sim:<12.4f} {ft_sim:.4f}")

# Test OOV handling
oov_word = "COVID-19"
print(f"\nOut-of-Vocabulary Test: '{oov_word}'")
print(f"Word2Vec: {oov_word in w2v}")
print(f"GloVe: {oov_word in glove}")
print(f"FastText: Can generate vector (always True)")

if oov_word not in fasttext:
    # FastText can still generate a vector
    vector = fasttext[oov_word]
    print(f"FastText generated vector: {vector.shape}")

Choosing the Right Embedding

Feature Word2Vec GloVe FastText
Training Method Prediction-based Count-based (matrix factorization) Prediction + subword
OOV Handling No No Yes
Training Speed Fast Slower Moderate
Rare Words Poor Moderate Excellent
Best For General purpose, large corpus Semantic tasks, analogies Morphologically rich languages, typos

Best Practices

  • Use pre-trained embeddings: Start with pre-trained models unless you have domain-specific text
  • Choose appropriate dimensions: 100-300 dimensions is typical; more isn't always better
  • Fine-tune for your domain: Continue training on your specific corpus for better performance
  • Normalize vectors: Normalize embeddings before similarity calculations for consistency
  • Handle OOV words: Use FastText for applications with many rare or out-of-vocabulary words
  • Consider context: Modern transformers (BERT, GPT) provide contextual embeddings that may work better
  • Evaluate on your task: Different embeddings perform differently; test on your specific use case
  • Combine with other features: Embeddings work well combined with traditional features

Limitations and Considerations

  • No context awareness: Each word has one embedding regardless of context (overcome by transformers)
  • Bias in embeddings: Pre-trained embeddings can reflect societal biases in training data
  • Memory requirements: Large vocabulary embeddings can consume significant memory
  • Language specific: Most pre-trained embeddings are language-specific
  • Static representations: Can't capture evolving word meanings over time

Master NLP and Text Embeddings

Our Data Science program provides comprehensive coverage of NLP techniques, from classical word embeddings to modern transformer models. Get hands-on experience with real-world projects and expert mentorship.

Explore Data Science Program

Related Articles