Skip to main content

Config.GenerateSimilar

Path: build/Config/GenerateSimilar.hs | Language: Haskell | Lines: ~60

Configuration constants for embedding-based similar link recommendations


Overview

Config.GenerateSimilar centralizes tuning parameters for the similar links recommendation system. The system uses OpenAI text embeddings to find semantically related pages, presenting "Similar Links" sections that help readers discover relevant content.

The module contains three categories of settings: (1) embedding pipeline parameters like token limits and result counts, (2) similarity thresholds that control which matches are shown, and (3) blacklists that exclude problematic URLs from recommendations. These values were empirically tuned through observation of recommendation quality.

The configuration is consumed primarily by GenerateSimilar for computing recommendations and by generateDirectory for displaying auto-inferred tags on directory pages.


Public API

All exports are simple constants—no functions with complex signatures.

bestNEmbeddings :: Int

Number of similar results to retrieve per query. Set to 20.

Used by: findN in GenerateSimilar for k-nearest-neighbor lookup


maximumLength :: Int

Maximum character count for text sent to OpenAI's embedding API. Set to 32,700 characters.

Based on OpenAI's estimate of ~4 characters per BPE token, this fits within text-embedding-ada-002's 8,191 token limit. The shell script handles truncation and retry if calls fail, so this can be set close to the theoretical maximum.

Used by: Text formatting before API calls


minimumSuggestions :: Int

Minimum number of similar links required before showing the "Similar Links" section to readers. Set to 3.

If fewer matches pass all filters, the section is suppressed entirely—one or two weak suggestions aren't worth the UI overhead.

Used by: writeOutMatch in GenerateSimilar


iterationLimit :: Int

Maximum retry iterations for the k-NN search. Set to 6.

Prevents infinite loops when the search keeps failing to find enough valid results.

Used by: findN recursive search


embeddingsPath :: String

File path for the binary-serialized embeddings database. Set to "metadata/embeddings.bin".

Used by: readEmbeddings, writeEmbeddings


minDistance, maxDistance :: Double

Cosine similarity bounds for valid matches:

  • minDistance = 0.01 — Excludes self-matches and near-duplicates
  • maxDistance = 0.95 — Upper relevance threshold (note: this is a distance, not similarity, so lower = more similar)

These were empirically tuned. The comment notes a "cliff of relevancy" around 0.60 with text-embedding-ada-002, and warns that thresholds must be rechecked when changing embedding models.

Used by: findNearest filtering


blackList :: String -> Bool

Predicate function that returns True for URLs that should be excluded from recommendations.

Excludes:

  • Explicit blacklisted URLs (index, changelog, help pages)
  • Document index pages (/doc/*/index)
  • Newsletter archives (/newsletter/20*)
  • Lorem ipsum test pages (/lorem*)

Used by: findN result filtering


blackListURLs :: [String]

Explicit URL blacklist. Currently contains only ["/index", "/changelog", "/help"].

A large commented-out section preserves historical blacklist entries that exhibited pathological behavior (matching too many unrelated pages). The TODO suggests re-testing whether these are still needed.


minTagAuto :: Int

Minimum auto-inferred tags to display on directory pages. Set to 3.

If fewer than 3 tags are suggested, they're hidden—sparse tagging isn't useful.

Used by: generateDirectory.hs for directory page rendering


maxTitlesForTagGuessing :: Int

Maximum page titles sent to the LLM for tag name inference. Set to 30.

Beyond 30 titles, additional context provides diminishing returns while wasting tokens.

Used by: Tag guessing in GenerateSimilar


randSeed :: Word64

Fixed random seed for reproducible RP-tree construction. Set to 23.

Ensures embedding forest is deterministic across rebuilds.

Used by: RP-tree initialization


Internal Architecture

This module has no internal architecture—it's purely declarative constants. The only logic is blackList, which combines explicit membership testing with prefix/suffix pattern matching.


Key Patterns

Empirically-Tuned Magic Numbers

Every constant includes a comment explaining its rationale. The similarity thresholds carry warnings about model-specific calibration—a good practice for ML-adjacent code.

Defensive Blacklisting

The blacklist evolved from debugging pathological cases. One specific paper ("Estimating the effect-size of gene dosage...") is called out for somehow matching nearly every embedding, illustrating the kind of edge case this system encounters.

Conservative Minimums

Both minimumSuggestions and minTagAuto use 3 as a threshold—a recurring "magic 3" that represents the minimum useful information density.


Configuration

These constants are compile-time configuration. To change values, edit this file and rebuild. There's no runtime configuration mechanism.

ConstantValuePurpose
bestNEmbeddings20Results per query
maximumLength32,700Max chars to embed
minimumSuggestions3Min results to show
iterationLimit6Max search retries
minDistance0.01Lower similarity bound
maxDistance0.95Upper similarity bound
minTagAuto3Min tags for display
maxTitlesForTagGuessing30Max titles for LLM
randSeed23RP-tree seed

Integration Points

Consumers

ModuleUsage
GenerateSimilarAll embedding/similarity settings
generateDirectoryminTagAuto for tag display

Data Dependencies

  • embeddings.bin — Binary-serialized embedding vectors at metadata/embeddings.bin

See Also