Semantic search has changed how we find information from product recommendations to contextual document retrieval. It works by mapping queries and items into a shared latent space and returning those vectors that are closest together in meaning.
Traditionally, we take the top-N most similar items say, the top 10 and call it a day. But there’s an underlying assumption here: that the top-10 are always meaningfully distinct and relevant. In practice, that assumption breaks down quickly.
In semantic spaces, “top-N” is often arbitrary. Some queries deserve 2 results, others 20 it depends on how the similarity scores behave.
During our early experiments, we noticed that the similarity score distributions across queries were highly inconsistent. Sometimes there was a sharp drop between relevant and irrelevant items; other times, the scores were nearly flat. Synthetic product descriptions, for instance, often used repetitive language such as “available in X colors and Y materials,” collapsing distinctions between items.
As a result, a fixed-N cutoff returned inconsistent quality sometimes too few results, sometimes too many. The solution? Let the data itself determine where to stop.
The idea: detect the drop-off point dynamically
Instead of choosing N in advance, we look at the pattern of differences between consecutive similarity scores. The goal is to find where the curve of scores changes behavior a point that separates truly relevant items from everything else.
Formally, for a ranked list of similarity scores:
S = [s₁, s₂, ..., sₙ]
We compute the pairwise differences:
ΔS = [s₂ − s₁, s₃ − s₂, ..., sₙ − sₙ₋₁]
If plotted, these differences often reveal a sharp negative “jump” where the semantic closeness suddenly drops that’s our potential cutoff.
The method: adaptive thresholding via z-scores
We treat this as a statistical detection problem. First, compute the mean (μ) and standard deviation (σ) of the score differences:
μ = mean(ΔS)
σ = std(ΔS)
Then, calculate the z-score for each gap:
Zᵢ = (ΔSᵢ − μ) / σ
Large negative z-scores correspond to significant drops candidate cutoff points. We choose all indices where:
Zᵢ < z_threshold
To ensure stability, we add a second condition using the relative percentage difference between consecutive scores:
Pᵢ = |sᵢ₊₁ − sᵢ| / sᵢ
Only if this relative drop is smaller than a predefined threshold t_D do we confirm it as a valid cutoff. This guards against small fluctuations in noisy embeddings.
In essence: we detect statistically significant changes in score gradients to let each query decide its own optimal cutoff rank.
Why this approach is different
- Per-query adaptivity: Each query can return a different number of results depending on score behavior.
- Statistical grounding: Uses z-score deviation instead of arbitrary thresholds.
- Bias correction: Handles flattened similarity distributions from templated or synthetic text.
- Model-agnostic: Works with any embedding model or similarity metric.
While similar in spirit to “elbow detection” in clustering, this approach operates directly on semantic similarity rankings not on error metrics or variance. It’s a probabilistic, lightweight, and easily integrable layer atop any ranking pipeline.
Results: relevance that adapts
Empirically, adaptive thresholding produces cleaner and more context-aware retrieval:
- Reduces irrelevant tail items in ambiguous queries.
- Preserves strong relevance for clear queries.
- Improves user satisfaction by returning variable-length, high-precision results.
Rather than asking “how many results should I show?”, we ask “where does relevance naturally stop?”
Comparison summary
| Approach | Behavior | Limitation |
|---|---|---|
| Fixed Top-N | Always returns same number of results | Fails for ambiguous or dense queries |
| Fixed Score Threshold | Cutoff by absolute similarity | Assumes uniform embedding scale |
| Adaptive Thresholding (ours) | Detects score drop-off per query | Requires computing μ, σ, z for ΔS |
Closing thoughts
Traditional retrieval treats ranking as a deterministic list: return the top-N and move on. But information retrieval is inherently probabilistic every query carries its own ambiguity, structure, and score distribution.
Adaptive Thresholding for Optimal Ranking reframes this process as a signal detection problem. By statistically identifying significant drops in similarity, it lets the data not arbitrary heuristics decide what’s relevant enough to show.
In a world where embeddings are ever-improving, adaptivity isn’t optional it’s how semantic search stays honest about relevance.
References
- Singh Gill, A., Patel, S., Varga, P., Miller, P., & Athanasiadis, S. (2025). Adaptive Thresholding for Optimal Ranking in Latent Semantic Spaces. SIGIR ’25, Padua, Italy.
- Knee Point Detection Algorithms (Satopaa et al., 2011). Finding a “Kneedle” in a Haystack.
- Dynamic Cutoff Methods in Information Retrieval (Croft et al., 2015). Search Engines: Information Retrieval in Practice.