General Instructions for the ML Coding Problems¶
Please follow these instructions carefully to ensure a smooth evaluation process.
1. Notebook Submission¶
- You must make a copy of this notebook and append your full name to the filename before submitting (e.g.,
[OriginalNotebookName]_[YourName].ipynb). - Share your notebook copy with inaio@acmindia.org [This is for your own safety so that you do not accidentally lose any changes while editing the notebook]
- After solving the questions, ensure you mention the correct URL of your modified notebook in the test form
- Also answer questions on external resources used and link to LLM chats used for each problem in the main test form
2. Attempting the Questions¶
- Carefully read each problem statement before attempting.
- Attempt all parts of each question.
- Each question is organized into the following parts
- DATA, TASK, HELPER CODE [Optional] and ANSWER
- Follow the function signatures provided. Do not modify them.
- You only need to edit the cells in the ANSWER sections
- If required, you may also add other modules under IMPORTS and INSTALLATION INSTRUCTIONS
- Do not edit the other cells, especially those marked with DO NOT MODIFY which are meant for evaluation
- You may add new cells to the notebook with extra code as desired
3. Scoring Criteria¶
Your score will be based on the following factors with distribution varying across each problem.
- Soundness & Creativity of your approach.
- Include a clear description and rationale of your solution methodology in the notebook (in markdown cells)
- Solutions that showcase your understanding of data and ML will garner more points
- Code Implementation & Readability
- Ensure your implementation is correct and works
- Incomplete non-working code will be awarded partial marks based on problem-wise rubric
- In case you have a solution but are unsure about some aspect, you can define a function that solves that aspect and present the rest of the solution
- Use comments to explain important parts of your code.
- Performance of Your Model:
- Each task will be assessed based on specified performance metrics both on shared datasets and secret datasets
- Different performance ranges will receive different scores.
- Secret datasets used for last section will be shared along with the final results
Points associated with cells are marked at the beginning of the cell
4. Dataset Usage¶
- Only use the datasets provided in this test.
- Do not use the provided test data set for training.
- Do not use external datasets for training or testing.
- If the submitted performance metrics cannot be reproduced with your code and original datasets, then you will lose all the points associated with model performance.
Problem 1: Analogy Oracle: Cracking the Code of Word Relationships [14 pts]¶
Analogical reasoning is a crucial skill tested in scholastic aptitude exams, where word relationships define logical patterns. Given a pair A:B, the goal is to predict the missing word C:? using learned word relationships. For example:
height : tall :: weight : ? (Answer: heavy) cat : kitten :: dog : ? (Answer: puppy)
Unfortunately, you are not a native English speaker, but you aim to ML to ace this aptitude test.
Your challenge: Developing an AI-powered Analogy Oracle that is as good as an English expert.
This problem consists of 4 questions (3 must be attempted, the 4th is private for INAIO evaluation).
- Q1: Zero-Shot Decoding – Solving Analogies Without Training [5 pts]
- Q2: Train an Analogy Prediction Model [5 pts]
- Q3: Test Analogy Model on Public Dataset [2 pts]
- Q4: Test Analogy Model on Private Dataset [2 pts] [NOT FOR STUDENTS TO ATTEMPT]
INSTALLATION¶
!pip install uv
!uv pip install pandas numpy scikit-learn scipy matplotlib seaborn torch nltk transformers sentence_transformers wget
IMPORTS¶
# EDIT: [O pts]
# You may add any other free python packages along with comments
# Data Types
from typing import Any
# Data handling
import pandas as pd # Data manipulation and analysis
import numpy as np # Numerical computations and array handling
import nltk
nltk.download('words')
nltk_vocab = set(nltk.corpus.words.words()) # Use only valid words
import os
import random
from tqdm import tqdm
import gensim.downloader as api
import gzip
import shutil
# Machine Learning - Process
from sklearn.model_selection import train_test_split # Splitting dataset
from sklearn.pipeline import Pipeline, make_pipeline # Combining multiple processing steps
# Machine Learning - Models
import torch
import torch.nn.functional as F
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
import gensim
from gensim.models import KeyedVectors, Word2Vec
from sentence_transformers import SentenceTransformer
import sentence_transformers.util as st_utils
from transformers import BertModel, BertTokenizer, AdamW
# Machine Learning - Feature Transformations
from sklearn.preprocessing import OneHotEncoder, StandardScaler # Feature transformations if needed
from sklearn.compose import ColumnTransformer #Transforming columns
# Model evaluation
from sklearn.metrics import (
mean_squared_error, # Mean squared Error
r2_score, # R² Score
mean_absolute_percentage_error, # MAPE
)
from sklearn.metrics.pairwise import cosine_similarity, euclidean_distances
# Statistical Analysis
from scipy.stats import pearsonr # Pearson correlation coefficient
# Visualization
import matplotlib.pyplot as plt # Plotting graphs
import seaborn as sns # Enhanced data visualization
COPY DATA¶
# Copy data
!mkdir /content/data
!wget https://raw.githubusercontent.com/inaiogit/stage2test/main/test/analogy_test_public.csv
!wget https://raw.githubusercontent.com/inaiogit/stage2test/main/test/analogy_train.csv
!wget https://raw.githubusercontent.com/inaiogit/stage2test/main/test/vocab.csv
!mv analogy_test_public.csv analogy_train.csv vocab.csv data/
def set_seed(seed):
"""
Set random seeds for reproducibility.
Args:
seed (int): The seed value to use.
"""
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
np.random.seed(seed)
random.seed(seed)
os.environ['PYTHONHASHSEED'] = str(seed)
seed_value = 42 # Do not change this
set_seed(seed_value)
Q1: Zero-Shot Decoding – Solving Analogies Without Training [2 pts]¶
First task is to solve word analogy problems without training any models. Given a dataset of analogies in the form A:B :: C:?, your goal is to predict the missing word D using only pre-trained word embeddings (such as BERT, Word2Vec, or GloVe). No additional model training is allowed—just smart use of existing embeddings.
Note: You can look at the full set of (A,B,C) tuples while solving for the missing D's
DATA¶
You are provided with an analogy dataset
analogy_train_path: analogy dataset with each row corresponding to two analogous pairs
Columns
A(first word in analogy)B(second word in analogy, related to A)C(third word, forming the analogy with the missing word)D(ground truth answer, only for evaluation)
# Training datasets
analogy_train_path = "data/analogy_train.csv" # analogy data with columns A, B, C, D
# Vocabulary
# We can use this restricted vocab for the problem to keep inference and training time in check
RES_VOCAB = pd.read_csv('data/vocab.csv', header=None)[0].to_list()
len(RES_VOCAB)
TASK¶
Analyze the data and record your observations below:
- (a) Perform some exploratory analysis of data and embeddings to come with a solution approach
- (b) Create a function predict_analogy as per the signature defined below. If you scroll down, you will see cells with the skeletal code that you need to flesh out.
Function 1: predict_analogy¶
def predict_analogy(
analogy_df: pd.DataFrame,
model: Any = None,
top_k: int = 5
) -> pd.DataFrame:
"""
Predicts the missing word (D) in an analogy of the form A:B :: C:D using pre-trained embeddings.
Parameters:
- analogy_df (pd.DataFrame): A DataFrame containing:
- "A" - First word in analogy
- "B" - Second word, related to A
- "C" - Third word, forming an analogy with the missing word D
- model (Any, optional): Could be a word embedding model (e.g., Word2Vec, GloVe, or BERT). If None, some hard-coded implementation
- top_k (int, optional): The number of top closest predictions to return. Defaults to 5.
Returns:
- pd.DataFrame: A DataFrame with predictions, containing:
- "Predicted_D" - The top predicted word
- "Top_K_Predictions" - List of top-k predictions (all lowercase)
"""
- (c) Evaluate your strategy using the provided code (no modifications)
HELPER CODE¶
glove_vectors = None
word2vec_vectors = None
def load_embedding_model(model_name="word2vec"):
"""
Loads a pre-trained word embedding model.
Parameters:
- model_name (str): Name of the model to load. Options: 'word2vec', 'glove', 'bert'
Returns:
- model: Loaded embedding model
"""
global glove_vectors, word2vec_vectors
if model_name == "word2vec":
if word2vec_vectors is None:
print("Loading Word2Vec (Google News 300)...")
word2vec_vectors = api.load("word2vec-google-news-300")
return word2vec_vectors
elif model_name == "glove":
if glove_vectors is None:
print("Loading GloVe (6B, 300d)...")
glove_vectors = api.load("glove-wiki-gigaword-300")
return glove_vectors
elif model_name == "bert":
print("Loading BERT (Sentence Transformers)...")
return SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
else:
raise ValueError("Unsupported model. Choose from 'word2vec', 'glove', or 'bert'.")
def get_word_embedding(word, model, method="word2vec"):
"""
Fetches the vector embedding of a word using the specified model.
Parameters:
- word (str): The word to get the embedding for.
- model (Any): Loaded word embedding model.
- method (str): One of 'word2vec', 'glove', 'bert'
Returns:
- np.array: Word embedding vector (or zero vector if word is missing).
"""
word = word.lower()
if method in ["word2vec", "glove"]:
return model[word] if word in model else np.zeros(300)
elif method == "bert":
return model.encode(word, convert_to_tensor=True)
else:
raise ValueError("Unsupported method. Choose from 'word2vec', 'glove', or 'bert'.")
def compute_similarity(vec1, vec2, metric="cosine"):
"""
Computes similarity between two vectors.
Parameters:
- vec1 (np.array or torch.Tensor): First vector
- vec2 (np.array or torch.Tensor): Second vector
- metric (str): Similarity metric ('cosine' or 'euclidean')
Returns:
- float: Similarity score
"""
if isinstance(vec1, torch.Tensor):
vec1 = vec1.cpu().numpy()
if isinstance(vec2, torch.Tensor):
vec2 = vec2.cpu().numpy()
vec1 = vec1.reshape(1, -1)
vec2 = vec2.reshape(1, -1)
if metric == "cosine":
return cosine_similarity(vec1, vec2)[0][0]
elif metric == "euclidean":
return -euclidean_distances(vec1, vec2)[0][0] # Negative for consistency (higher = better)
else:
raise ValueError("Unsupported metric. Choose 'cosine' or 'euclidean'.")
# to save on inference time, we'll encode vocabulary only once for embedding model
st_encoded_vocabulary = None
def find_closest_words(target_vec, model, method="word2vec", vocab=RES_VOCAB, top_k=5):
"""
Finds the closest words to a given vector using cosine similarity.
Parameters:
- target_vec (np.array): The vector representation of the target word.
- model (Any): The pre-trained embedding model.
- method (str): Embedding method ('word2vec', 'glove', 'bert')
- vocab (set): Restrict to a vocabulary set (e.g., nltk words, RES_VOCAB)
- top_k (int): Number of top similar words to return
Returns:
- list: Top-k closest words (lowercased)
"""
global st_encoded_vocabulary
best_matches = []
if method in ["word2vec", "glove"]:
for word in model.key_to_index:
if vocab and word.lower() not in vocab:
continue # Skip words not in vocabulary
word_vec = model[word]
similarity = compute_similarity(target_vec, word_vec)
best_matches.append((word, similarity))
elif method == "bert":
vocabulary = RES_VOCAB
if st_encoded_vocabulary is None:
print('\n Encoding the vocab..')
st_encoded_vocabulary = model.encode(list(vocabulary), convert_to_tensor=True)
best_matches = st_utils.semantic_search(target_vec, st_encoded_vocabulary, top_k=top_k)
best_matches = [(list(vocabulary)[match['corpus_id']], match['score']) for match in best_matches[0]]
best_matches = sorted(best_matches, key=lambda x: x[1], reverse=True)[:top_k]
return [word.lower() for word, _ in best_matches]
def evaluate_analogy_predictions(
file_path: str,
model: Any = None,
top_k: int = 3
) -> dict:
"""
Evaluates analogy prediction accuracy using Precision@1 and Precision@K from a given file.
Parameters:
- file_path (str): Path to the CSV file containing test analogy data.
Expected columns: "A", "B", "C", "D" (ground truth)
- model (Any, optional): If None, we rely on predict_analogy to use the default predictions
- top_k (int, optional): The number of top closest predictions to consider for Precision@K. Defaults to 5.
Returns:
- dict: Dictionary containing:
- "Precision@1": Fraction of cases where the top predicted word matches D exactly.
- "Precision@K": Fraction of cases where the correct word appears in the top-K predictions.
"""
# Load test data
try:
test_df = pd.read_csv(file_path)
except Exception as e:
print(f"Error loading file: {e}")
return None
# Validate required columns
required_columns = {"A", "B", "C", "D"}
if not required_columns.issubset(test_df.columns):
print(f"Error: Missing required columns. Expected {required_columns}, found {set(test_df.columns)}")
return None
# Extract only A, B, C columns for prediction
analogy_df = test_df[['A', 'B', 'C']]
# Get predictions (adds "Predicted_D" and "Top_K_Predictions" columns)
predictions_df = predict_analogy(analogy_df, model=model, top_k=top_k)
# Convert actual D and predicted values to lowercase for case-insensitive comparison
test_df["D"] = test_df["D"].str.lower()
predictions_df["Predicted_D"] = predictions_df["Predicted_D"].str.lower()
# Convert lists of top-K predictions to sets for efficient lookup
predictions_df["Top_K_Predictions"] = predictions_df["Top_K_Predictions"].apply(lambda x: set(map(str.lower, x)))
# Vectorized precision calculations
precision_1 = (predictions_df["Predicted_D"] == test_df["D"]).mean()
precision_k = (predictions_df.apply(lambda row: row["Top_K_Predictions"] and test_df.loc[row.name, "D"] in row["Top_K_Predictions"], axis=1)).mean()
return {
"Precision@1": precision_1,
"Precision@K": precision_k
}
ANSWER¶
# EDIT: [0.5 pt]
# Add your data and exploration of embeddings code here
data = pd.read_csv(analogy_train_path)
data.sample(6)
EDIT: [1 pts]¶
Describe Your Solution Approach¶
• Data Exploration Notes [0.5 pt]¶
- A and C seem to have no corelation.
- The corelation between A and B is different in every piece of the data.
• Zero Shot Modeling Strategy & Choices [0.5 pt]¶
- I could vectorize A, B, and C and then set D = C+B-A.
- For finding the top K choices, I could just find the k nearest neighbours to D.
- I am using a pretrained GloVe model for this.
# EDIT: [2.5 pts]
# Implement the zero-shot prediction of a **predict_analogy** for the case where model = "None"
# NOTE: For this task, you can choose to ONLY implement the case model = "None" and hardcode your choice
# Do not modify the function signature same
def predict_analogy(
analogy_df: pd.DataFrame,
model: Any = None,
top_k: int = 5
) -> pd.DataFrame:
"""
Predicts the missing word (D) in an analogy of the form A:B :: C:D using pre-trained embeddings.
Parameters:
- analogy_df (pd.DataFrame): A DataFrame containing:
- "A" - First word in analogy
- "B" - Second word, related to A
- "C" - Third word, forming an analogy with the missing word D
- model (Any, optional): A pre-trained word embedding model (e.g., Word2Vec, GloVe, or BERT). If None, a default Word2Vec model is loaded.
- top_k (int, optional): The number of top closest predictions to return. Defaults to 5.
Returns:
- pd.DataFrame: A DataFrame with predictions, containing:
- "Predicted_D" - The top predicted word
- "Top_K_Predictions" - List of top-k predictions (all lowercase)
"""
# Load the pre-trained GloVe model if no model is provided.
glove = api.load("glove-wiki-gigaword-300")
def predict_row(row):
a_word = row["A"]
b_word = row["B"]
c_word = row["C"]
# Retrieve the embeddings if available, otherwise use a zero vector.
A = glove[a_word] if a_word in glove else np.zeros(glove.vector_size)
B = glove[b_word] if b_word in glove else np.zeros(glove.vector_size)
C = glove[c_word] if c_word in glove else np.zeros(glove.vector_size)
# Compute the predicted vector for D: B - A + C.
pred_vector = B - A + C
# Retrieve the top-k most similar words.
predictions = glove.most_similar([pred_vector], topn=top_k)
predicted_d = predictions[0][0] # Top prediction.
top_k_predictions = [word for word, _ in predictions]
return pd.Series({
"Predicted_D": predicted_d,
"Top_K_Predictions": top_k_predictions
})
# Apply the prediction row-wise.
results_df = analogy_df.apply(predict_row, axis=1)
return results_df
# DO NOT MODIFY
# Run this code to observe the Precision@1 and Precision@3
# [pts depend on performance range]
print(evaluate_analogy_predictions(analogy_train_path, model=None, top_k=3))
Q2: Train an Analogy Prediction Model [5 pts]¶
DATA¶
Use the same analogy dataset as in Q1
analogy_train_path: analogy dataset with each row corresponding to two analogous pairs
Columns
A(first word in analogy)B(second word in analogy, related to A)C(third word, forming the analogy with the missing word)D(ground truth answer, only for evaluation)
# Training datasets
analogy_train_path = "data/analogy_train.csv" # analogy data with columns A, B, C, D
TASK¶
Create two functions learn_analogy_model and predict_analogy as per the signatures defined below.
If you scroll down, you will see cells with the skeletal code that you need to flesh out.
Function 1: learn_analogy_model¶
def learn_analogy_model(
train_file_path: str
) -> Any:
"""
Loads an analogy dataset from a CSV file, processes it, and trains a machine learning model to predict
the missing word (D) in an analogy of the form A:B :: C:D.
Parameters:
- train_file_path (str): Path to the CSV file containing analogy data with columns:
- "A" - First word in analogy
- "B" - Second word, related to A
- "C" - Third word, forming an analogy with the missing word D
- "D" - The ground truth answer (target variable)
Returns:
- model: A trained machine learning model capable of predicting D given A, B, and C.
"""
Function 2: predict_analogy¶
def predict_analogy(
model: Any,
analogy_df: pd.DataFrame,
top_k: int = 3
) -> pd.DataFrame:
"""
Predicts the missing word (D) in an analogy of the form A:B :: C:D using a trained model.
Parameters:
- model (Any): A trained analogy prediction model from `learn_analogy_model`.
- analogy_df (pd.DataFrame): A DataFrame containing:
- "A" - First word in analogy
- "B" - Second word, related to A
- "C" - Third word, forming an analogy with the missing word D
- top_k (int, optional): The number of top closest predictions to return. Defaults to 5.
Returns:
- pd.DataFrame: A DataFrame with predictions, containing:
- "Predicted_D" - The top predicted word
- "Top_K_Predictions" - List of top-k predictions (all lowercase)
"""
HELPER CODE¶
# HELPER CODE
# You may choose to use or modify any of the below code in your solution, but it is NOT mandatory
def train_word2vec_on_analogy(train_file_path: str, vector_size=300, window=5, min_count=1, epochs=50):
"""
Trains a Word2Vec model on analogy data to learn custom word embeddings.
Parameters:
- train_file_path (str): Path to the training CSV file with columns A, B, C, D.
- vector_size (int): Dimensionality of word vectors.
- window (int): Maximum distance between current and predicted word in a sentence.
- min_count (int): Minimum count for a word to be included in training.
- epochs (int): Number of training iterations.
Returns:
- model: Trained Word2Vec model.
"""
df = pd.read_csv(train_file_path)
# Convert analogy pairs into training sentences (treat each analogy as a "sentence")
sentences = df[["A", "B", "C", "D"]].values.tolist()
# Train Word2Vec model
model = Word2Vec(sentences, vector_size=vector_size, window=window, min_count=min_count, workers=4, sg=1, epochs=epochs)
# Save trained model
model.save("word2vec_analogy.model")
print("Custom Word2Vec model trained and saved!")
return model
def get_trained_word2vec_embedding(
model: KeyedVectors,
word: str,
vector_size: int = 300
) -> np.ndarray:
"""
Retrieves the vector embedding of a word from a trained Word2Vec model.
Parameters:
- model (KeyedVectors): The trained Word2Vec model.
- word (str): The input word to retrieve the embedding for.
- vector_size (int, optional): The size of the word vector. Default is 300.
Returns:
- np.ndarray: The word embedding vector, or a zero vector if the word is not in the model.
"""
word = word.lower() # Ensure lowercase lookup
if word in model:
return model[word]
else:
return np.zeros(vector_size) # Return zero vector if word is not found
# Code for training BERT model
# Custom Dataset class to handle analogy data
class AnalogyDataset(Dataset):
def __init__(self, df, tokenizer):
self.df = df
self.tokenizer = tokenizer
def __len__(self):
return len(self.df)
def __getitem__(self, idx):
# Get the row from the DataFrame
A, B, C, D = self.df.iloc[idx]["A"], self.df.iloc[idx]["B"], self.df.iloc[idx]["C"], self.df.iloc[idx]["D"]
# Tokenize the inputs
tokens_A = self.tokenizer(A, return_tensors="pt", padding='max_length', max_length=8)
tokens_B = self.tokenizer(B, return_tensors="pt", padding='max_length', max_length=8)
tokens_C = self.tokenizer(C, return_tensors="pt", padding='max_length', max_length=8)
tokens_D = self.tokenizer(D, return_tensors="pt", padding='max_length', max_length=8)
# The input feature is (A + B - C), and the target is the embedding of D
return tokens_A, tokens_B, tokens_C, tokens_D
# BERT-based Model for Analogy Prediction
class AnalogyBertModel(nn.Module):
def __init__(self, bert_model):
super(AnalogyBertModel, self).__init__()
self.bert_model = bert_model
def forward(self, tokens_A, tokens_B, tokens_C):
# Pass tokens A, B, and C through the BERT model to get embeddings
output_A = self.bert_model(**tokens_A) # Output: (last_hidden_state, pooler_output)
output_B = self.bert_model(**tokens_B)
output_C = self.bert_model(**tokens_C)
# Extract the [CLS] token embedding (or alternatively, mean-pooling) as the representation
emb_A = output_A.last_hidden_state[:, 0, :] # [CLS] token embedding for A
emb_B = output_B.last_hidden_state[:, 0, :] # [CLS] token embedding for B
emb_C = output_C.last_hidden_state[:, 0, :] # [CLS] token embedding for C
# The analogy equation: A + B - C
analogy_vector = emb_A + emb_B - emb_C
return analogy_vector
def train_bert_on_analogy(train_file_path: str, model_name="bert-base-uncased", epochs=1, batch_size=16):
"""
Train a BERT model to predict the missing word (D) in analogy tasks.
Parameters:
- train_file_path (str): Path to CSV with analogies (columns: A, B, C, D).
- model_name (str): Pre-trained BERT model to fine-tune.
- epochs (int): Number of training epochs.
- batch_size (int): Batch size for training.
Returns:
- model: Fine-tuned BERT model for analogy prediction.
"""
# Load tokenizer and BERT model
tokenizer = BertTokenizer.from_pretrained(model_name)
bert_model = BertModel.from_pretrained(model_name)
# Load analogy dataset
df = pd.read_csv(train_file_path)
# Create dataset and dataloader
dataset = AnalogyDataset(df, tokenizer)
dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True)
# Initialize the analogy model
model = AnalogyBertModel(bert_model)
# Define optimizer and loss function
optimizer = AdamW(model.parameters(), lr=5e-5)
criterion = nn.MSELoss() # Mean Squared Error for embedding prediction
# Move model to GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
# Training loop
for epoch in tqdm(range(epochs)):
model.train()
total_loss = 0
for step, (tokens_A, tokens_B, tokens_C, tokens_D) in enumerate(dataloader):
tokens_A = {key: val.squeeze(1).to(device) for key, val in tokens_A.items()}
tokens_B = {key: val.squeeze(1).to(device) for key, val in tokens_B.items()}
tokens_C = {key: val.squeeze(1).to(device) for key, val in tokens_C.items()}
tokens_D = {key: val.squeeze(1).to(device) for key, val in tokens_D.items()}
optimizer.zero_grad()
# Get the predicted embedding
predicted_embedding = model(tokens_A, tokens_B, tokens_C)
# Extract the true embedding for D
true_embedding = model.bert_model(**tokens_D).last_hidden_state[:, 0, :].squeeze()
# Compute the loss (Mean Squared Error between predicted and true embeddings)
loss = criterion(predicted_embedding, true_embedding)
# Backpropagate the loss
loss.backward()
optimizer.step()
total_loss += loss.item()
print(f"Epoch {epoch + 1}/{epochs}, Loss: {total_loss / len(dataloader)}")
print("Model trained successfully!")
return model
def get_bert_predicted_embedding(model: AnalogyBertModel, tokenizer: BertTokenizer, A: str, B: str, C: str) -> np.ndarray:
"""
Returns the predicted embedding for a given analogy using the fine-tuned BERT model.
Parameters:
- model: Fine-tuned BERT model for analogy prediction.
- tokenizer: BERT tokenizer.
- A, B, C: Words in the analogy.
Returns:
- np.ndarray: The predicted embedding vector.
"""
model.eval()
with torch.no_grad():
tokens_A = tokenizer(A, return_tensors="pt", padding='max_length', max_length=8)
tokens_B = tokenizer(B, return_tensors="pt", padding='max_length', max_length=8)
tokens_C = tokenizer(C, return_tensors="pt", padding='max_length', max_length=8)
predicted_embedding = model(tokens_A, tokens_B, tokens_C)
return predicted_embedding.cpu().numpy()
ANSWER¶
EDIT: [1.5 pts]¶
You can jot down initial notes here and flesh this out in more detail after the implementation.¶
Describe Your Solution Approach¶
Understanding of Available Options:
- Use pre-trained embeddings (GloVe, Word2Vec, or BERT) to capture semantic relationships.
- Leverage vector arithmetic (B - A + C) to infer the missing word.
- Utilize libraries like Gensim and scikit‑learn for efficient implementation.
Modeling Strategy & Choices:
- Embedding Selection: Prefer static embeddings (GloVe/Word2Vec) for simplicity.
- Analogy Calculation: Compute D = B - A + C.
- Pipeline Design: Build a scikit‑learn pipeline with a custom transformer for embedding extraction and a classifier or nearest-neighbor search.
- Robustness: Handle out-of-vocabulary words by substituting with zero vectors.
These concise notes capture the key elements of the solution approach.
# EDIT: [O pts]
# Add any additional code that you need for your modeling
# Points for any code in this will be assigned to the learn_analogy_model and predict_analogy cells
# @title
# EDIT: [2.5 pts]
# Implement the training of analogy model
# Do not change the signature
# Note: just loading a pre-trained model will NOT fetch points - You have to train it on the provided data
def learn_analogy_model(
train_file_path: str
) -> Any:
"""
Loads an analogy dataset from a CSV file, processes it, and trains a machine learning model to predict
the missing word (D) in an analogy of the form A:B :: C:D.
Parameters:
- train_file_path (str): Path to the CSV file containing analogy data with columns:
- "A" - First word in analogy
- "B" - Second word, related to A
- "C" - Third word, forming an analogy with the missing word D
- "D" - The ground truth answer (target variable)
Returns:
- model: A trained machine learning model capable of predicting D given A, B, and C.
"""
pass # Your implementation here
# EDIT: [1 pts]
# Implement the prediction of analogy
# NOTE: This is similar to Q1 task but we will now run it the model learned from *learn_analogy_model* function
# Keep the signature same
def predict_analogy(
model: Any,
analogy_df: pd.DataFrame,
top_k: int = 3
) -> pd.DataFrame:
"""
Predicts the missing word (D) in an analogy of the form A:B :: C:D using a trained model.
Parameters:
- model (Any): A trained analogy prediction model from `learn_analogy_model`.
- analogy_df (pd.DataFrame): A DataFrame containing:
- "A" - First word in analogy
- "B" - Second word, related to A
- "C" - Third word, forming an analogy with the missing word D
- top_k (int, optional): The number of top closest predictions to return. Defaults to 3.
Returns:
- pd.DataFrame: A DataFrame with predictions, containing:
- "Predicted_D" - The top predicted word
- "Top_K_Predictions" - List of top-k predictions (all lowercase)
"""
pass # Your implementation here
Q3: Test Analogy Model on Public Dataset [2 pts]¶
DATA¶
You now get to demonstrate show that you can ace analogies on a unseen test set with the same columns as before
analogy_test_path_public: analogy dataset with each row corresponding to two analogous pairs
Columns
A(first word in analogy)B(second word in analogy, related to A)C(third word, forming the analogy with the missing word)D(ground truth answer, only for evaluation)
# Public Test Dataset
analogy_test_public_path = "data/analogy_test_public.csv" #
TASK¶
Execute the code below as is with your implementation of learn_analogy_model and predict_analogy to test your model
- Evaluate your model on this test set.
- Compute Precision@1 and Precision@K
HELPER CODE¶
# DO NOT MODIFY
# HELPER CODE
# This is the same function as in Q1
# Use these functions directly since these are meant for evaluation
def evaluate_analogy_predictions(
file_path: str,
model: Any = None,
top_k: int = 3
) -> dict:
"""
Evaluates analogy prediction accuracy using Precision@1 and Precision@K from a given file.
Parameters:
- file_path (str): Path to the CSV file containing test analogy data.
Expected columns: "A", "B", "C", "D" (ground truth)
- model (Any, optional): If None, we rely on predict_analogy to use the default predictions
- top_k (int, optional): The number of top closest predictions to consider for Precision@K. Defaults to 5.
Returns:
- dict: Dictionary containing:
- "Precision@1": Fraction of cases where the top predicted word matches D exactly.
- "Precision@K": Fraction of cases where the correct word appears in the top-K predictions.
"""
# Load test data
try:
test_df = pd.read_csv(file_path)
except Exception as e:
print(f"Error loading file: {e}")
return None
# Validate required columns
required_columns = {"A", "B", "C", "D"}
if not required_columns.issubset(test_df.columns):
print(f"Error: Missing required columns. Expected {required_columns}, found {set(test_df.columns)}")
return None
# Extract only A, B, C columns for prediction
analogy_df = test_df[['A', 'B', 'C']]
# Get predictions (adds "Predicted_D" and "Top_K_Predictions" columns)
predictions_df = predict_analogy(analogy_df, model=model, top_k=top_k)
# Convert actual D and predicted values to lowercase for case-insensitive comparison
test_df["D"] = test_df["D"].str.lower()
predictions_df["Predicted_D"] = predictions_df["Predicted_D"].str.lower()
# Convert lists of top-K predictions to sets for efficient lookup
predictions_df["Top_K_Predictions"] = predictions_df["Top_K_Predictions"].apply(lambda x: set(map(str.lower, x)))
# Vectorized precision calculations
precision_1 = (predictions_df["Predicted_D"] == test_df["D"]).mean()
precision_k = test_df["D"].isin(predictions_df["Top_K_Predictions"]).mean()
return {
"Precision@1": precision_1,
"Precision@K": precision_k
}
# DO NOT MODIFY
# Run this code to observe the Precision@1 and Precision@3
# [pts depend on performance range]
trained_model = learn_analogy_model(analogy_train_path)
print(evaluate_analogy_predictions(analogy_train_path, model=trained_model, top_k=3))
print(evaluate_analogy_predictions(analogy_test_public_path, model=trained_model, top_k=3))
ANSWER¶
YOU CAN STOP THE TEST HERE -- BELOW EVALUATION TO BE PERFORMED BY INAIO¶
Q4: Test Analogy Model on Private Dataset [2 pts]¶
- Same metrics and lead time as Public Dataset