generateSimilarLinks.hs
Path: build/generateSimilarLinks.hs | Language: Haskell | Lines: ~106
CLI tool for generating embedding-based "similar links" recommendations as HTML fragments
Overview
generateSimilarLinks.hs is the command-line entry point for gwern.net's embedding-based recommendation system. It generates "See also" HTML fragments containing similar articles based on semantic similarity of OpenAI embeddings.
The tool orchestrates the complete pipeline: reading existing embeddings and metadata, identifying pages that need embeddings, calling the OpenAI API to generate new embeddings, building an RP-tree spatial index for fast nearest-neighbor search, computing similarity rankings, reordering results for coherence, and writing out HTML fragments to metadata/annotation/similar/.
The system is designed for incremental updates—it only embeds missing pages, supports watch-mode for immediate updates during annotation workflows, and can operate in several modes: full rebuild, embedding-only, or update-missing-only. This makes it practical for a large corpus (thousands of documents) without re-embedding everything on each run.
CLI Usage
Basic Invocation
# Full pipeline: embed missing items, compute all similarities, write HTML
runghc -istatic/build/ ./static/build/generateSimilarLinks.hs
# Embed new items only (no similarity computation)
runghc -istatic/build/ ./static/build/generateSimilarLinks.hs --only-embed
# Update only items missing similarity files (faster than full rebuild)
runghc -istatic/build/ ./static/build/generateSimilarLinks.hs --update-only-missing-embeddings
Command-Line Flags
-
--only-embed: Generate embeddings for new items, save to database, then exit. Does not compute similarities or write HTML fragments. Used for incremental embedding during annotation work. -
--update-only-missing-embeddings: Embed new items (if any), then compute similarities only for URLs missing HTML fragments. Skips full rebuild of all existing similarity files. -
(no arguments): Full rebuild—embed missing items, compute similarities for all annotated URLs, write all HTML fragments.
Watch Mode Example
The tool supports an inotifywait-based daemon for automatic embedding:
# In crontab
@reboot screen -d -m -S "embed" bash -c \
'cd ~/wiki/; while true; do \
inotifywait ~/wiki/metadata/*.gtx -e attrib && \
sleep 10s && \
runghc -istatic/build/ ./static/build/generateSimilarLinks.hs --only-embed; \
done'
This watches annotation database files (.gtx) and immediately embeds new annotations, making them available for similarity matching without waiting for nightly rebuild.
Internal Architecture
Data Flow
- Load databases (metadata, backlinks, embeddings)
- Identify missing embeddings: Compare metadata URLs with embedding database and list missing entries (no date-based ordering)
- Generate embeddings: Call OpenAI API for missing items (batch limit: 2000 at once)
- Prune stale embeddings: Remove embeddings for URLs no longer in metadata
- Build RP-tree forest: Spatial index for fast k-nearest-neighbor search
- Compute similarities: For each URL, find N nearest neighbors via tree search
- Reorder results: Sort matches by pairwise distance for internal coherence
- Write HTML fragments: Generate
metadata/annotation/similar/URL.htmlfiles - Expire reciprocal matches: Delete similarity files for newly-matched items to trigger rebuild
Key Data Structures
Embedding tuple (from GenerateSimilar):
type Embedding = (String, Integer, String, String, [Double])
-- (url/path, ModifiedJulianDay age, embedded text, model id, vector)
Forest: RP-tree spatial index built from embeddings for fast approximate nearest-neighbor search.
Databases loaded:
md :: Metadata— annotation database (titles, abstracts, dates)bdb :: Backlinks— bidirectional link indexedb :: Embeddings— OpenAI embedding vectorssortDB :: ListSortedMagic— cached sorted-by-similarity lists
Processing Phases
Phase 1: Embedding
- Filter items without abstracts (can't embed empty content)
- Embed entries missing from the embeddings DB (no modification-time prioritization)
- Batch at most 2000 embeddings per run (API cost control)
- Use
embedfromGenerateSimilarto create embeddings (includes backlinks context)
Phase 2: Similarity Computation
- Build RP-tree forest via
embeddings2Forest - For each URL, call
findNto get top N nearest neighbors - Re-sort matches with
sortSimilarsfor pairwise coherence (not just distance to target) - Filter out items already linked in abstract or body
Phase 3: HTML Generation
- Call
writeOutMatchto generate HTML fragment - Fragments stored in
metadata/annotation/similar/URLENCODED.html(max 274 chars filename) - Search links (Google Scholar/Google/Context) are appended when metadata has a title/abstract; not conditional on "few" results
Key Patterns
Incremental Embedding Strategy
The tool uses missingEmbeddings to identify annotation URLs not yet embedded, then similaritemExistsP to check for existing HTML fragments. This two-level check allows:
- New annotations to be embedded immediately
- Existing items missing similarity files to be regenerated
- Full rebuilds to refresh all similarity rankings
Batching and Parallelization
maxEmbedAtOnce = 2000 -- API cost control
let mdlMissingChunks = chunksOf 10 mdlMissing
mapM_ (mapM_ (...)) mdlMissingChunks
Processing happens in chunks of 10 for progress visibility. Full rewrites use Control.Monad.Parallel.mapM_ for parallel HTML generation.
Reciprocal Update Triggering
When a new embedding is created and matches existing items, those items should show the new page as a match too (similarity is symmetric). Rather than recompute immediately, the tool deletes their HTML fragments via expireMatches, forcing rebuild on next run:
when (f `elem` todoLinks) $ expireMatches (snd nmatchesSorted)
Reordering for Coherence
Raw nearest-neighbor results are reordered with sortSimilars, which minimizes pairwise distances among the result set. This produces more internally coherent recommendation lists—items that are not only similar to the target, but also similar to each other.
Stale Embedding Cleanup
pruneEmbeddings removes embeddings for URLs no longer in the metadata database (typically renamed/deleted). Prevents false positives and database bloat.
Configuration
Via Config.GenerateSimilar:
bestNEmbeddings: Number of similar items to find (default ~10-20)iterationLimit: Max k-NN search iterations before giving upmaxDistance: Maximum embedding distance thresholdminimumSuggestions: Don't write HTML if fewer matches foundblackList: URLs to exclude from recommendationsembeddingsPath: Path to serialized embeddings database
Via maxEmbedAtOnce constant:
- Limits batch size to 2000 embeddings per run (API cost control)
Integration Points
Inputs
metadata/full.gtx(via link-metadata-hs): Annotation database with abstracts for embeddingmetadata/backlinks.hs(via link-backlink-hs): Bidirectional link index for filteringmetadata/embeddings.bin: Serialized OpenAI embedding vectorsmetadata/listsortedmagic.hs: Cached sorted-by-similarity lists
Outputs
metadata/annotation/similar/*.html: HTML fragments with similar link listsmetadata/embeddings.bin: Updated embedding database (binary serialization)
Called By
- sync-sh: Nightly rebuild phase
- Watch daemon:
inotifywait-based embedding-on-write
Calls
All core functionality from generate-similar-hs:
embed: Generate OpenAI embeddings with backlinks contextreadEmbeddings/writeEmbeddings: Binary serializationembeddings2Forest: Build RP-tree spatial indexfindN: k-NN search with iteration-based expansionsortSimilars: Pairwise distance minimization reorderingwriteOutMatch: HTML fragment generationpruneEmbeddings: Remove stale entriesexpireMatches: Delete HTML to force rebuild
Also uses:
- link-metadata-hs:
readLinkMetadata,sortItemPathDateModified - link-backlink-hs:
readBacklinksDB,getSimilarLink
HTML Fragment Integration
Generated HTML fragments are transcluded by the annotation system when popups/popovers are displayed. The similar links appear in a "See also" section at the bottom of annotations.
Common Failure Points
OpenAI API Errors
If embeddings fail (rate limit, network error, API key), the tool will error out. Embeddings are batched at 2000/run to avoid hitting rate limits too hard. Retry by re-running—already-embedded items are skipped.
Insufficient Matches
If fewer than minimumSuggestions matches are found, no HTML is written. This is expected for very niche content with no similar pages.
Corrupted Embedding Database
Binary deserialization can fail if embeddings.bin is corrupted. The writeEmbeddings function has a write-validate-rename pattern to prevent corruption, but if it happens, delete the file to force full rebuild.
Watch Mode Pitfalls
The inotifywait daemon example watches for attrib events on .gtx files. If annotation writes don't trigger this event type (filesystem-dependent), use -e modify instead. The 10s sleep prevents race conditions with concurrent annotation writes.
Filename Length Limits
Output paths are truncated to 274 characters to avoid filesystem limits:
let f = take 274 $ "metadata/annotation/similar/" ++ urlEncode p ++ ".html"
Very long URLs will have truncated filenames, which can cause collisions (rare but possible).
See Also
- GenerateSimilar.hs - Core embedding and similarity logic used by this CLI tool
- Config.GenerateSimilar - Configuration constants for thresholds and limits
- hakyll.hs - Build system that uses sortTagByTopic for tag page generation
- LinkMetadata.hs - Annotation database providing content for embeddings
- LinkBacklink.hs - Bidirectional link index used for filtering and context
- sync.sh - Build orchestrator that invokes generateSimilarLinks.hs