LinkMetadata.hs
Path: build/LinkMetadata.hs | Language: Haskell | Lines: ~944
The brain of gwern.net's annotation system - loads, creates, validates, and caches link metadata that powers popup annotations.
Overview
LinkMetadata.hs orchestrates the annotation system that makes gwern.net's popups work. When you hover over a link and see a popup with title, author, date, abstract, and related links, this module is responsible for:
- Loading metadata from the GTX files (
me.gtx,full.gtx,half.gtx,auto.gtx) - Creating new annotations by dispatching to various scrapers (Arxiv, Wikipedia, Twitter, etc.)
- Writing annotation fragments as HTML files that the frontend fetches for popups
- Walking Pandoc ASTs to mark links as annotated and assign unique IDs
The module follows a tiered caching strategy: hand-curated annotations in me.gtx/full.gtx override semi-automated ones in half.gtx, which override fully automatic ones in auto.gtx. This left-biased union means you can always override machine-generated annotations with better human-written ones.
A key design decision is the "partial annotation" concept - links without full abstracts can still have useful metadata (tags, backlinks, similar links) that make a popup worthwhile. The module calculates a scoring function to decide whether a partial is worth displaying.
Public API
Core Types (re-exported from LinkMetadataTypes)
type Metadata = M.Map Path MetadataItem
type MetadataItem = (String, String, String, String, [(String,String)], [String], String)
-- (Title, Author, Date, DateModified, KeyValues, Tags, Abstract)
type MetadataList = [(Path, MetadataItem)]
type Path = String
The MetadataItem 7-tuple is the heart of the annotation system. Key-values ([(String,String)]) store DOIs, IDs, and custom CSS extensions.
readLinkMetadata → IO Metadata
Fast loading of all GTX files without validation. Used during normal site builds.
readLinkMetadata :: IO Metadata
readLinkMetadata = do
me <- readGTXFast "metadata/me.gtx"
full <- readGTXFast "metadata/full.gtx"
half <- readGTXFast "metadata/half.gtx"
auto <- readGTXFast "metadata/auto.gtx"
let final = M.union (M.fromList me) $ M.union (M.fromList full) $
M.union (M.fromList half) (M.fromList auto)
return final
Called by: link-prioritize.hs, generateLinkBibliography.hs, and other tooling Calls: GTX.readGTXFast
readLinkMetadataSlow → IO Metadata
Slow loading with postprocessing (tag validation, date parsing, author canonicalization).
Called by: hakyll.hs (main build) Calls: GTX.readGTXSlow
readLinkMetadataAndCheck → IO Metadata
Full validation pass that checks for:
- Duplicate URLs/abstracts
- Missing mandatory fields (title, author, abstract in full.gtx)
- Invalid DOIs (missing
/, wrong format) - Malformed dates
- Non-existent tags
- Broken local file references
- Unbalanced quotes/brackets in abstracts
Called by: Build scripts, updateGwernEntries Calls: readGTXSlow, duplicateAbstracts, rewriteLinkMetadata
hasAnnotation md block → Block
Walks a Pandoc Block, marking each Link/Image as annotated or not by adding CSS classes:
link-annotated- full annotation availablelink-annotated-partial- partial metadata only
Also assigns unique link IDs via generateID.
hasAnnotation :: Metadata -> Block -> Block
hasAnnotation md = walk (hasAnnotationOrIDInline md)
Called by: hakyll.hs pandocTransform, writeAnnotationFragment Calls: hasAnnotationOrIDInline, addHasAnnotation, addID
createAnnotations md pandoc → IO ()
Walks a Pandoc document extracting all links, then calls annotateLink on each to potentially create new annotations.
createAnnotations :: Metadata -> Pandoc -> IO ()
createAnnotations md (Pandoc _ markdown) =
Par.mapM_ (annotateLink md) $ extractLinksInlines (Pandoc nullMeta markdown)
Called by: writeAnnotationFragment Calls: annotateLink, extractLinksInlines
writeAnnotationFragments am md sizes writeOnlyMissing → IO ()
Writes HTML fragment files for each annotation to metadata/annotation/. These fragments are what the frontend fetches for popups.
Two-pass approach:
- First pass: write partials (short/empty abstracts) so they exist for cross-references
- Second pass: write full annotations
Called by: hakyll.hs Calls: writeAnnotationFragment, getBackLinkCheck, getSimilarLinkCheck
addPageLinkWalk pandoc → Pandoc
Adds link-page class to links pointing to gwern.net essays (local paths with no file extension).
addPageLinkWalk :: Pandoc -> Pandoc
addPageLinkWalk = walk addPageLink
Called by: hakyll.hs, writeAnnotationFragment Calls: isPagePath
annotateLink md inline → IO (Either Failure (Path, MetadataItem))
The main annotation creation entry point. Given a Link inline:
- Checks if annotation already exists in metadata
- If not, calls
linkDispatcherto try fetching from various sources - Appends new annotation to
auto.gtxviaappendLinkMetadata
Returns Left Temporary for transient failures, Left Permanent for permanent ones, or Right (path, item) on success.
Called by: createAnnotations Calls: linkDispatcher, appendLinkMetadata
Internal Architecture
GTX File Hierarchy
metadata/
├── me.gtx # Personal/about-me annotations (highest priority)
├── full.gtx # Hand-written complete annotations
├── half.gtx # Tagged but not fully written
└── auto.gtx # Machine-generated (lowest priority, auto-cleaned)
Priority flows left to right - me.gtx overrides everything, auto.gtx is overridden by everything.
Annotation Fragment Generation Flow
Metadata DB
│
▼
writeAnnotationFragments
│
├─► For each (path, item):
│ │
│ ▼
│ getBackLinkCheck, getSimilarLinkCheck, getLinkBibLinkCheck
│ │
│ ▼
│ partialScoring (tags + abstract + backlinks + similar)
│ │
│ ▼
│ generateAnnotationBlock (builds Pandoc AST)
│ │
│ ▼
│ Pandoc transforms (linkIcon, linkLive, inflation, etc.)
│ │
│ ▼
│ writeHtml5String → HTML fragment
│ │
│ ▼
│ metadata/annotation/{encoded-path}.html
Link Processing Pipeline
Link in document
│
▼
hasAnnotationOrIDInline
│
├─► Check if already annotated (has class)
│
├─► Look up in Metadata map
│ │
│ ├─► Nothing/Empty → addID only
│ │
│ └─► Just item → addHasAnnotation + addRecentlyChanged + addID
│
└─► Return modified Link with classes:
- link-annotated / link-annotated-partial / (none)
- link-modified-recently (if modified within current month)
- Unique ID in anchor attribute
Partial Annotation Scoring
A partial annotation is displayed if any of these are true:
- Has 3+ tags
- Has any abstract text
- Has 2+ backlinks
- Has 7+ similar links
let partialScoring = 0 < sum [length (drop 2 ts),
length abst,
if blN > 1 then 1 else 0,
if slN > 6 then 1 else 0]
Key Patterns
Left-Biased Map Union for Priority
let final = M.union (M.fromList me) $
M.union (M.fromList full) $
M.union (M.fromList half) (M.fromList auto)
M.union is left-biased, so keys in me override full override half override auto. This allows progressive refinement of annotations.
Parallel Processing with monad-parallel
import qualified Control.Monad.Parallel as Par
Par.mapM_ (annotateLink md) $ extractLinksInlines ...
Link annotation and file operations use parallel map to speed up builds.
unsafePerformIO for Caching Decisions
| not $ unsafePerformIO $ doesFileExist $ fst $ getAnnotationLink $ T.unpack f = x'
The annotation fragment existence check uses unsafePerformIO because it's a pure function that needs IO to check file existence. This is safe because file existence is effectively constant during a build.
URL Canonicalization
let target' = replace "https://gwern.net/" "/" target
let target'' = if head target' == '.' then drop 1 target' else target'
All gwern.net URLs are normalized to local paths (/doc/foo.pdf not https://gwern.net/doc/foo.pdf) for consistent lookup.
Fallback Prefix Matching
lookupFallback :: Metadata -> String -> (FilePath, MetadataItem)
For URLs where the stored metadata key is a longer URL (including fragments), this falls back to prefix matching using the provided URL as the prefix; it does not strip fragments to find a base URL.
Configuration
From Config.Misc
minimumAnnotationLength- threshold for "full" vs "partial" annotation (~characters)currentMonthAgo- date threshold for "recently modified" highlightinggtxKeyValueKeyNames- whitelist of valid key names in KV pairsminFileSizeWarning- when to show file size in MB
From Config.LinkID
affiliationAnchors- organization anchors for duplicate URL detection
Hardcoded Thresholds
- Backlinks > 1 for partial scoring
- Similar links > 6 for partial scoring
- File size > 10MB disables prefetch
- URL > 273 chars triggers truncation warning
Integration Points
Reads From
metadata/me.gtx,full.gtx,half.gtx,auto.gtx- annotation databasesmetadata/annotation/*.html- checks for existing fragmentsdoc/directories - for tag validation
Writes To
metadata/auto.gtx- new machine-generated annotations (via appendLinkMetadata)metadata/annotation/{encoded-path}.html- popup HTML fragments_site/metadata/annotation/- copies for Hakyll sync
Key Module Dependencies
- GTX.hs - GTX format parsing/writing
- LinkMetadataTypes.hs - type definitions
- Annotation.hs -
linkDispatcherfor fetching new annotations - Annotation.Gwernnet.hs - scraper for local gwern.net pages
- LinkBacklink.hs - backlinks, similar links, link bibliography
- LinkID.hs - unique ID generation
- LinkArchive.hs - local archive URL resolution
- LinkLive.hs - live link handling
- Tags.hs - tag validation and processing
Frontend Integration
The HTML fragments written to metadata/annotation/ are fetched by popups.js when users hover over annotated links. The link-annotated / link-annotated-partial classes tell the JS whether to attempt a fetch.
See Also
- Annotation.hs - URL-to-scraper dispatcher for fetching new annotations
- LinkMetadataTypes.hs - Core type definitions (MetadataItem, Path, Failure)
- GTX.hs - GTX format parser/writer for annotation databases
- annotations.js - Frontend annotation data layer and caching
- Annotation/Arxiv.hs - arXiv paper metadata scraper
- Annotation/Biorxiv.hs - bioRxiv/medRxiv scraper
- Annotation/Gwernnet.hs - Local gwern.net page scraper
- Metadata/Format.hs - Abstract HTML cleaning utilities