Building a Semantic Search Engine and Open-Status Classifier over the ResearchMath-14k Dataset

In this tutorial, we work with the amphora/ResearchMath-14k dataset, a collection of research-level mathematics problems mined from arXiv. We load the dataset, inspect its structure, and explore how the problems are distributed across mathematical fields and open-status categories. We then move beyond basic analysis by extracting field-specific keywords, generating semantic embeddings, visualizing the problem landscape, clustering related problems, and building a simple search engine over the dataset. Also, we train a classifier to predict problem status from embeddings and detect closely related or near-duplicate problems.

Copy CodeCopiedUse a different Browser!pip -q install -U datasets sentence-transformers scikit-learn umap-learn
pandas matplotlib seaborn wordcloud 2>/dev/null
import warnings, numpy as np, pandas as pd
warnings.filterwarnings(“ignore”)
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_theme(style=”whitegrid”, palette=”deep”)
SAMPLE_SIZE = 4000
RANDOM_STATE = 42
EMB_MODEL = “sentence-transformers/all-MiniLM-L6-v2”

We begin by installing the required libraries and importing the tools needed for analysis, visualization, embeddings, and data handling. We also set the main configuration values, including sample size, random seed, and embedding model. This gives us a clean setup before we start working with the ResearchMath dataset.

Copy CodeCopiedUse a different Browserfrom datasets import load_dataset
ds = load_dataset(“amphora/ResearchMath-14k”, split=”test”)
df = ds.to_pandas()
print(“Rows:”, len(df))
print(“Columns:”, list(df.columns))
df.head(3)
TEXT_COL = “self_contained_problem”
df = df[df[TEXT_COL].astype(str).str.len() > 20].reset_index(drop=True)

We load the amphora/ResearchMath-14k dataset from Hugging Face and convert it into a pandas DataFrame. We inspect the number of rows, available columns, and a few sample records to understand the dataset structure. We then keep only problem statements of meaningful length so that subsequent analysis works on useful text.

Copy CodeCopiedUse a different Browserprint(“n— open_status distribution —“)
print(df[“open_status”].value_counts(dropna=False))
print(“n— taxonomy_level_1 (math fields) —“)
print(df[“taxonomy_level_1”].value_counts())
fig, axes = plt.subplots(1, 3, figsize=(20, 6))
df[“open_status”].value_counts().plot(
kind=”bar”, ax=axes[0], color=”steelblue”)
axes[0].set_title(“Problem status”); axes[0].tick_params(axis=”x”, rotation=30)
df[“taxonomy_level_1″].value_counts().plot(
kind=”barh”, ax=axes[1], color=”seagreen”)
axes[1].set_title(“Top-level math field”); axes[1].invert_yaxis()
df[“doc_len”] = df[TEXT_COL].str.split().apply(len)
axes[2].hist(df[“doc_len”].clip(upper=400), bins=40, color=”indianred”)
axes[2].set_title(“Problem length (words, clipped @400)”)
plt.tight_layout(); plt.show()
ct = pd.crosstab(df[“taxonomy_level_1”], df[“open_status”], normalize=”index”)
plt.figure(figsize=(10, 6))
sns.heatmap(ct, annot=True, fmt=”.2f”, cmap=”rocket_r”)
plt.title(“Fraction of each status within each field”)
plt.tight_layout(); plt.show()

We explore the dataset by checking how problems are distributed across open-status labels and mathematical fields. We visualize the status counts, field counts, and problem lengths to quickly get an overview of the corpus. We also create a heatmap to see how open-status categories vary across different math fields.

Copy CodeCopiedUse a different Browserfrom sklearn.feature_extraction.text import TfidfVectorizer
def top_terms_per_group(frame, group_col, text_col, k=8):
out = {}
for g, sub in frame.groupby(group_col):
if len(sub) < 20:
continue
vec = TfidfVectorizer(max_features=3000, stop_words=”english”,
ngram_range=(1, 2), min_df=3)
X = vec.fit_transform(sub[text_col])
scores = np.asarray(X.mean(axis=0)).ravel()
terms = np.array(vec.get_feature_names_out())
out[g] = terms[scores.argsort()[::-1][:k]].tolist()
return out
for field, terms in top_terms_per_group(df, “taxonomy_level_1″, TEXT_COL).items():
print(f”n{field:35s} -> {‘, ‘.join(terms)}”)

We use TF-IDF to find the most important terms within each top-level mathematical field. We group the dataset by field and extract the strongest keywords or phrases that represent each group. This helps us understand what topics and terminology dominate different areas of research in mathematics.

We sample the dataset and convert each mathematical problem into a semantic embedding using a SentenceTransformer model. We reduce the embeddings into two dimensions using UMAP, or PCA if UMAP is unavailable, and visualize the problem landscape by field. We then apply K-Means clustering and compare the resulting clusters with the human-labeled taxonomy using ARI and NMI.

Copy CodeCopiedUse a different Browserfrom sentence_transformers import util
def search(query, k=5):
q = model.encode([query], normalize_embeddings=True)
sims = util.cos_sim(q, emb)[0].cpu().numpy()
idx = sims.argsort()[::-1][:k]
print(f’n=== Query: “{query}” ===’)
for rank, i in enumerate(idx, 1):
row = work.iloc[i]
print(f”n[{rank}] sim={sims[i]:.3f} | {row[‘taxonomy_level_1’]} ”
f”| status={row[‘open_status’]}”)
print(” “, row[TEXT_COL][:260].replace(“n”, ” “), “…”)
search(“rational points on hyperelliptic curves”)
search(“multiplicativity of maximal output p-norm of a quantum channel”)
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, ConfusionMatrixDisplay
y = work[“open_status”].values
Xtr, Xte, ytr, yte = train_test_split(
emb, y, test_size=0.25, random_state=RANDOM_STATE, stratify=y)
clf = LogisticRegression(max_iter=2000, class_weight=”balanced”, C=2.0)
clf.fit(Xtr, ytr)
pred = clf.predict(Xte)
print(“n=== open_status classifier (embeddings + logistic regression) ===”)
print(classification_report(yte, pred))
fig, ax = plt.subplots(figsize=(7, 6))
ConfusionMatrixDisplay.from_predictions(
yte, pred, ax=ax, cmap=”Blues”, xticks_rotation=45,
normalize=”true”, values_format=”.2f”)
ax.set_title(“open_status confusion matrix (row-normalized)”)
plt.tight_layout(); plt.show()
sims = util.cos_sim(emb, emb).cpu().numpy()
np.fill_diagonal(sims, 0)
i, j = np.unravel_index(sims.argmax(), sims.shape)
print(f”nMost similar pair (cos={sims[i, j]:.3f}):”)
for n in (i, j):
print(f”n paper_id={work.iloc[n][‘paper_id’]} | ”
f”{work.iloc[n][‘taxonomy_level_1’]}”)
print(” “, work.iloc[n][TEXT_COL][:240].replace(“n”, ” “), “…”)
print(“nDone. Set SAMPLE_SIZE=None at the top to run on the full 14.1k rows.”)

We build a semantic search function that retrieves the most similar research problems for a given query. We then train a classifier on the embeddings to predict each problem’s open-status label. Finally, we compute similarity across all embedded problems to detect the closest pair and identify near-duplicate or strongly related problem statements.

In conclusion, we have a complete workflow for analyzing research-level mathematical problems using modern NLP and machine learning tools. We started with dataset exploration, then used TF-IDF, sentence embeddings, dimensionality reduction, clustering, semantic search, and classification to understand the corpus’s structure from multiple angles. It gives us a practical way to study how mathematical problems are grouped, how similar problems can be retrieved, and how embeddings can support both exploratory analysis and supervised prediction tasks.

Check out the Full Codes with Notebook. Also, feel free to follow us on Twitter and don’t forget to join our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us
The post Building a Semantic Search Engine and Open-Status Classifier over the ResearchMath-14k Dataset appeared first on MarkTechPost.