Annotation.hs
Path: build/Annotation.hs | Language: Haskell | Lines: ~290
URL-to-scraper dispatcher for metadata extraction
Overview
Annotation.hs is the routing layer that decides how to fetch metadata for a given URL. When LinkMetadata.hs encounters a URL not already in the metadata database, it calls linkDispatcher here, which pattern-matches the URL against known sources (arXiv, bioRxiv, OpenReview, local PDFs, etc.) and delegates to the appropriate scraper.
The module follows a guard-clause pattern: linkDispatcherURL is a series of pattern matches testing URL prefixes/infixes, returning early when one matches. If no specialized scraper applies, it falls back to either local gwern.net scraping or a generic HTML title fetch.
Key design decisions:
- Scrapers return
Either Failure (Path, MetadataItem)whereFailureisTemporary(retry later) orPermanent(cache as unfetchable) - URL canonicalization happens before routing via
linkCanonicalize - Post-processing applies to all results: date guessing, title reformatting (italics, em-dashes), author/date inference from file paths
Public API
linkDispatcher :: Metadata -> Inline -> IO (Either Failure (Path, MetadataItem))
Main entry point. Takes existing metadata DB and a Pandoc Link inline, returns scraped metadata.
-- Example:
-- > md <- LinkMetadata.readLinkMetadata
-- > linkDispatcher md (Link nullAttr [] ("https://arxiv.org/abs/2512.03750",""))
Called by: LinkMetadata.annotateLink
Calls: linkDispatcherURL, tooltipToMetadata, guessAuthorDateFromPath, reformatTitle
tooltipToMetadata :: Path -> String -> (String, String, String)
Parses an existing link tooltip to extract (title, author, date). Re-exported from Metadata.Title.
Called by: linkDispatcher (when scraping fails permanently, extracts what it can from tooltip)
htmlDownloadAndParseTitleClean :: String -> IO String
Fetches a URL and extracts its <title> tag. Re-exported from Metadata.Title.
Called by: linkDispatcherURL (fallback for unknown URLs)
processItalicizer :: String -> IO String
Runs static/build/italicizer.py to auto-italicize work titles in strings.
Called by: reformatTitle
testGuessAuthorDate :: [(FilePath, (String,String), (String,String))]
Test suite for guessAuthorDateFromString. Returns failing test cases (empty list = all pass).
Internal Architecture
URL Routing Flow
linkDispatcher
│
├─ canonicalize URL
│
└─ linkDispatcherURL ─┬─ "/metadata/...", "/doc/www/", etc → Permanent failure (blacklist)
├─ "en.wikipedia.org" → extract title from URL
├─ "arxiv.org/abs/" → arxiv scraper
├─ "openreview.net" → openreview scraper
├─ "biorxiv.org" / "medrxiv.org" → biorxiv scraper
├─ "x.com" → twitter (username only)
├─ ".pdf" + local path → pdf scraper
├─ local path (no ext) → gwern scraper
└─ external URL → htmlDownloadAndParseTitleClean
Failure Type
data Failure = Temporary | Permanent
- Temporary: Scraping failed but might succeed later (API down, rate-limited). Not cached—will retry.
- Permanent: URL is blacklisted or fundamentally unscrappable. Cached to avoid repeated failures.
MetadataItem Type
type MetadataItem = (String, String, String, String, [(String,String)], [String], String)
-- Title Author Date DateCreated K-Vs Tags Abstract
Key Patterns
Guard-Clause Routing
linkDispatcherURL uses Haskell guards with pattern matching:
linkDispatcherURL md l
| anyPrefix l ["/metadata/...", ...] = return (Left Permanent)
| "arxiv.org/abs/" `isInfixOf` l = arxiv md l
| ".pdf" `isInfixOf` l = if head l' == '/' then pdf md l' else return (Left Permanent)
| otherwise = ...
Order matters: more specific patterns come before general ones.
Post-Processing Pipeline
All successful metadata goes through:
guessDateFromString- infer date from title/URL if missingreformatTitle- italicize work titles, convert hyphens to em-dashesguessAuthorDateFromPath- extract author/date from local file pathsauthorsUnknownPrint- warn about unrecognized authors
File Path Date/Author Inference
For local files like /doc/ai/2020-08-05-hippke-analysis.pdf, the module extracts:
- Date: finds valid YYYY-MM-DD patterns, takes the latest if multiple
- Author: token after date, validated against known author list
-- "/doc/traffic/2015-07-03-2016-01-03-gwern-analytics.pdf"
-- → date: "2016-01-03", author: "gwern"
Scraper Modules
Annotation.Arxiv (~105 lines)
Fetches from arXiv's Atom API (export.arxiv.org/api/query).
- Handles both
/abs/and/pdf/URLs - Processes LaTeX in titles/abstracts via Pandoc
- Constructs DOI if not provided (
10.48550/arXiv.XXXX.XXXXX) - Returns
Temporaryfailure if API lags (abstract may appear later)
Annotation.Biorxiv (~60 lines)
Scrapes bioRxiv/medRxiv HTML pages.
- Extracts Dublin Core metadata (
DC.Title,DC.Contributor, etc.) - Skips direct PDF links (returns empty metadata)
- Known bug: broken quote encoding (
9s→sworkaround)
Annotation.OpenReview (~40 lines)
Calls external shell script openReviewAbstract.sh.
- Combines TL;DR, description, and keywords into abstract
- Reuses
processArxivAbstractfor LaTeX handling
Annotation.PDF (~75 lines)
Reads local PDF metadata via exiftool.
- Falls back to Crossref API for abstract if DOI present
- Handles Creator vs Author field ambiguity (uses longer one)
- Appends page number to title if URL has
#page=N
Annotation.Gwernnet (~280 lines)
Scrapes gwern.net pages.
- Large extension blacklist (images, archives, data files)
- Extracts: title, description, abstract, thumbnail, ToC, tags
- Special handling for
/doc/index(generates directory listing) - Section anchors get their own section-specific abstracts
Configuration
URL Blacklists
Hardcoded in linkDispatcherURL:
anyPrefix l ["/metadata/annotation/backlink/", "/metadata/annotation/similar/",
"/doc/www/", "/ref/", "/blog/", "irc://", "mailto:"]
Extension Blacklist (Gwernnet)
Files with these extensions are skipped:
ext `elem` ["avi", "bmp", "csv", "doc", "epub", "gif", "jpg", "json",
"mp3", "mp4", "pdf", "png", "svg", "txt", "zip", ...]
Italicizer Whitelist
Titles that should not be auto-italicized:
whitelist = ["Guys and Dolls", "A critique of pure reason"]
Config Module References
Config.Misc.todayDayString- current date for validationConfig.Misc.cd- change to project root before running external toolsConfig.Misc.arxivAbstractRegexps/arxivAbstractFixedRewrites- arXiv cleanup patterns
Integration Points
Upstream (callers)
- LinkMetadata.hs -
annotateLinkcallslinkDispatcherfor unknown URLs - LinkMetadata.hs imports
gwerndirectly for forced re-scrapes
Downstream (callees)
- Metadata.Format - URL canonicalization, abstract cleaning, DOI processing
- Metadata.Author - author validation and canonicalization
- Metadata.Date - date guessing and validation
- Metadata.Title - title extraction and cleaning
- LinkAuto - auto-linking in abstracts
- Paragraph - paragraph splitting in abstracts
- Typography - typographic cleanup
External Tools
exiftool- PDF metadata extractioncurl- HTTP fetching (arXiv, bioRxiv, gwern.net)static/build/italicizer.py- title italicizationopenReviewAbstract.sh- OpenReview scraping
Adding a New Scraper
-
Create
Annotation/NewSource.hswith signature:newSource :: Metadata -> Path -> IO (Either Failure (Path, MetadataItem)) -
Import in
Annotation.hs:import Annotation.NewSource (newSource) -
Add guard clause in
linkDispatcherURL(order matters):| "newsource.com/" `isInfixOf` l = newSource md l -
Return
Left Temporaryfor transient failures,Left Permanentfor permanent ones -
Use standard cleaners:
cleanAbstractsHTML,cleanAuthors,trimTitle
See Also
- LinkMetadata.hs - Calls linkDispatcher via
annotateLink, manages the metadata database - LinkMetadataTypes.hs - Core type definitions (Failure, MetadataItem)
- Annotation/Arxiv.hs - arXiv paper scraper
- Annotation/Biorxiv.hs - bioRxiv/medRxiv scraper
- Annotation/OpenReview.hs - OpenReview conference paper scraper
- Annotation/PDF.hs - Local PDF metadata extraction
- Annotation/Gwernnet.hs - Local gwern.net page scraper
- popups.js - Frontend popup system that displays annotations
- extracts.js - Pop-frame extraction and rendering