generateSimilarLinks.hs

Path: build/generateSimilarLinks.hs | Language: Haskell | Lines: ~106

CLI tool for generating embedding-based "similar links" recommendations as HTML fragments

Overview

generateSimilarLinks.hs is the command-line entry point for gwern.net's embedding-based recommendation system. It generates "See also" HTML fragments containing similar articles based on semantic similarity of OpenAI embeddings.

The tool orchestrates the complete pipeline: reading existing embeddings and metadata, identifying pages that need embeddings, calling the OpenAI API to generate new embeddings, building an RP-tree spatial index for fast nearest-neighbor search, computing similarity rankings, reordering results for coherence, and writing out HTML fragments to metadata/annotation/similar/.

The system is designed for incremental updates—it only embeds missing pages, supports watch-mode for immediate updates during annotation workflows, and can operate in several modes: full rebuild, embedding-only, or update-missing-only. This makes it practical for a large corpus (thousands of documents) without re-embedding everything on each run.

CLI Usage

Basic Invocation

# Full pipeline: embed missing items, compute all similarities, write HTML
runghc -istatic/build/ ./static/build/generateSimilarLinks.hs

# Embed new items only (no similarity computation)
runghc -istatic/build/ ./static/build/generateSimilarLinks.hs --only-embed

# Update only items missing similarity files (faster than full rebuild)
runghc -istatic/build/ ./static/build/generateSimilarLinks.hs --update-only-missing-embeddings

Command-Line Flags

--only-embed: Generate embeddings for new items, save to database, then exit. Does not compute similarities or write HTML fragments. Used for incremental embedding during annotation work.
--update-only-missing-embeddings: Embed new items (if any), then compute similarities only for URLs missing HTML fragments. Skips full rebuild of all existing similarity files.
(no arguments): Full rebuild—embed missing items, compute similarities for all annotated URLs, write all HTML fragments.

Watch Mode Example

The tool supports an inotifywait-based daemon for automatic embedding:

# In crontab
@reboot screen -d -m -S "embed" bash -c \
  'cd ~/wiki/; while true; do \
     inotifywait ~/wiki/metadata/*.gtx -e attrib && \
     sleep 10s && \
     runghc -istatic/build/ ./static/build/generateSimilarLinks.hs --only-embed; \
   done'

This watches annotation database files (.gtx) and immediately embeds new annotations, making them available for similarity matching without waiting for nightly rebuild.

Internal Architecture

Data Flow

Load databases (metadata, backlinks, embeddings)
Identify missing embeddings: Compare metadata URLs with embedding database and list missing entries (no date-based ordering)
Generate embeddings: Call OpenAI API for missing items (batch limit: 2000 at once)
Prune stale embeddings: Remove embeddings for URLs no longer in metadata
Build RP-tree forest: Spatial index for fast k-nearest-neighbor search
Compute similarities: For each URL, find N nearest neighbors via tree search
Reorder results: Sort matches by pairwise distance for internal coherence
Write HTML fragments: Generate metadata/annotation/similar/URL.html files
Expire reciprocal matches: Delete similarity files for newly-matched items to trigger rebuild

Key Data Structures

Embedding tuple (from GenerateSimilar):

type Embedding = (String, Integer, String, String, [Double])
-- (url/path, ModifiedJulianDay age, embedded text, model id, vector)

Forest: RP-tree spatial index built from embeddings for fast approximate nearest-neighbor search.

Databases loaded:

md :: Metadata — annotation database (titles, abstracts, dates)
bdb :: Backlinks — bidirectional link index
edb :: Embeddings — OpenAI embedding vectors
sortDB :: ListSortedMagic — cached sorted-by-similarity lists

Processing Phases

Phase 1: Embedding

Filter items without abstracts (can't embed empty content)
Embed entries missing from the embeddings DB (no modification-time prioritization)
Batch at most 2000 embeddings per run (API cost control)
Use embed from GenerateSimilar to create embeddings (includes backlinks context)

Phase 2: Similarity Computation

Build RP-tree forest via embeddings2Forest
For each URL, call findN to get top N nearest neighbors
Re-sort matches with sortSimilars for pairwise coherence (not just distance to target)
Filter out items already linked in abstract or body

Phase 3: HTML Generation

Call writeOutMatch to generate HTML fragment
Fragments stored in metadata/annotation/similar/URLENCODED.html (max 274 chars filename)
Search links (Google Scholar/Google/Context) are appended when metadata has a title/abstract; not conditional on "few" results

Key Patterns

Incremental Embedding Strategy

The tool uses missingEmbeddings to identify annotation URLs not yet embedded, then similaritemExistsP to check for existing HTML fragments. This two-level check allows:

New annotations to be embedded immediately
Existing items missing similarity files to be regenerated
Full rebuilds to refresh all similarity rankings

Batching and Parallelization

maxEmbedAtOnce = 2000  -- API cost control
let mdlMissingChunks = chunksOf 10 mdlMissing
mapM_ (mapM_ (...)) mdlMissingChunks

Processing happens in chunks of 10 for progress visibility. Full rewrites use Control.Monad.Parallel.mapM_ for parallel HTML generation.

Reciprocal Update Triggering

When a new embedding is created and matches existing items, those items should show the new page as a match too (similarity is symmetric). Rather than recompute immediately, the tool deletes their HTML fragments via expireMatches, forcing rebuild on next run:

when (f `elem` todoLinks) $ expireMatches (snd nmatchesSorted)

Reordering for Coherence

Raw nearest-neighbor results are reordered with sortSimilars, which minimizes pairwise distances among the result set. This produces more internally coherent recommendation lists—items that are not only similar to the target, but also similar to each other.

Stale Embedding Cleanup

pruneEmbeddings removes embeddings for URLs no longer in the metadata database (typically renamed/deleted). Prevents false positives and database bloat.

Configuration

Via Config.GenerateSimilar:

bestNEmbeddings: Number of similar items to find (default ~10-20)
iterationLimit: Max k-NN search iterations before giving up
maxDistance: Maximum embedding distance threshold
minimumSuggestions: Don't write HTML if fewer matches found
blackList: URLs to exclude from recommendations
embeddingsPath: Path to serialized embeddings database

Via maxEmbedAtOnce constant:

Limits batch size to 2000 embeddings per run (API cost control)

Integration Points

Inputs

metadata/full.gtx (via link-metadata-hs): Annotation database with abstracts for embedding
metadata/backlinks.hs (via link-backlink-hs): Bidirectional link index for filtering
metadata/embeddings.bin: Serialized OpenAI embedding vectors
metadata/listsortedmagic.hs: Cached sorted-by-similarity lists

Outputs

metadata/annotation/similar/*.html: HTML fragments with similar link lists
metadata/embeddings.bin: Updated embedding database (binary serialization)

Called By

sync-sh: Nightly rebuild phase
Watch daemon: inotifywait-based embedding-on-write

Calls

All core functionality from generate-similar-hs:

embed: Generate OpenAI embeddings with backlinks context
readEmbeddings / writeEmbeddings: Binary serialization
embeddings2Forest: Build RP-tree spatial index
findN: k-NN search with iteration-based expansion
sortSimilars: Pairwise distance minimization reordering
writeOutMatch: HTML fragment generation
pruneEmbeddings: Remove stale entries
expireMatches: Delete HTML to force rebuild

Also uses:

link-metadata-hs: readLinkMetadata, sortItemPathDateModified
link-backlink-hs: readBacklinksDB, getSimilarLink

HTML Fragment Integration

Generated HTML fragments are transcluded by the annotation system when popups/popovers are displayed. The similar links appear in a "See also" section at the bottom of annotations.

Common Failure Points

OpenAI API Errors

If embeddings fail (rate limit, network error, API key), the tool will error out. Embeddings are batched at 2000/run to avoid hitting rate limits too hard. Retry by re-running—already-embedded items are skipped.

Insufficient Matches

If fewer than minimumSuggestions matches are found, no HTML is written. This is expected for very niche content with no similar pages.

Corrupted Embedding Database

Binary deserialization can fail if embeddings.bin is corrupted. The writeEmbeddings function has a write-validate-rename pattern to prevent corruption, but if it happens, delete the file to force full rebuild.

Watch Mode Pitfalls

The inotifywait daemon example watches for attrib events on .gtx files. If annotation writes don't trigger this event type (filesystem-dependent), use -e modify instead. The 10s sleep prevents race conditions with concurrent annotation writes.

Filename Length Limits

Output paths are truncated to 274 characters to avoid filesystem limits:

let f = take 274 $ "metadata/annotation/similar/" ++ urlEncode p ++ ".html"

Very long URLs will have truncated filenames, which can cause collisions (rare but possible).

Overview​

CLI Usage​

Basic Invocation​

Command-Line Flags​

Watch Mode Example​

Internal Architecture​

Data Flow​

Key Data Structures​

Processing Phases​

Key Patterns​

Incremental Embedding Strategy​

Batching and Parallelization​

Reciprocal Update Triggering​

Reordering for Coherence​

Stale Embedding Cleanup​

Configuration​

Integration Points​

Inputs​

Outputs​

Called By​

Calls​

HTML Fragment Integration​

Common Failure Points​

OpenAI API Errors​

Insufficient Matches​

Corrupted Embedding Database​

Watch Mode Pitfalls​

Filename Length Limits​

See Also​