Config.Tags
Path: build/Config/Tags.hs | Language: Haskell | Lines: ~993
Tag system configuration: aliases, hierarchy mappings, and display formatting
Overview
Config.Tags defines the gwern.net tag taxonomy and includes small bits of logic (predicates for blacklists and URL tagging, plus auto-generated short/long tag maps) alongside the data tables.
The module serves three primary purposes: (1) normalizing user-entered tags to canonical forms (e.g., "gpt-4" → "ai/nn/transformer/gpt/4"), (2) converting internal hierarchical paths to human-readable display names (e.g., "anime/eva" → <em>NGE</em>), and (3) guessing tags from URLs when scraping external links. The extensive alias tables (~200 entries in tagsShort2LongRewrites alone) reflect years of accumulated typo corrections and naming conventions.
A notable design decision is the use of hierarchical slash-separated paths as canonical tag identifiers. This enables both precise categorization (e.g., "ai/nn/transformer/gpt/4/poetry") and inheritance-based browsing. The display layer then collapses these verbose paths into concise labels.
Public API
tagTypoMaxDistance :: Int
Maximum Levenshtein edit distance for fuzzy tag matching. Set to 3.
Called by: Tags.guessTagFromShort (tag inference logic) Calls: N/A (constant)
tagGuessBlacklist :: String -> Bool
Predicate that returns True for paths that should not have tags auto-inferred from their directory structure. Covers project archives and mirrors.
Called by: Tag guessing logic in Tags module
Calls: Utils.anyPrefix
tagListBlacklist :: [String]
List of directory names to exclude from tag lists. Prevents archive directories from appearing as tags.
Called by: Tag listing/rendering Calls: N/A (constant)
urlTagDB :: [(String -> Bool, String)]
Lookup table mapping URL patterns to tags. Enables automatic tagging of links to known sites (e.g., "https://publicdomainreview.org/" → "history/public-domain-review").
Called by: Annotation scraping pipeline
Calls: isPrefixOf, isInfixOf, Utils.anyInfix
wholeTagRewritesRegexes :: [(String, String)]
Regex-based tag transformations for display. Handles capitalization (cs → CS, ai → AI), semantic rewrites (genetics/selection → evolution), and HTML formatting (anime/eva → <em>NGE</em>).
Called by: Tag rendering Calls: N/A (constant)
tagsShort2Long :: [(String, String)]
Master alias table mapping short/alternate forms to canonical paths. Combines manual rewrites with auto-generated inverses from tagsLong2Short.
Called by: Tag normalization, typo correction
Calls: tagsShort2LongRewrites, tagsLong2Short
tagsLong2Short :: [(String, String)]
Display name mappings from canonical paths to human-readable labels. First match wins (more specific paths listed first via reverse).
Called by: Tag rendering, HTML generation Calls: N/A (constant)
shortTagBlacklist :: [String]
Common English words and HTML fragments that should never be interpreted as tags.
Called by: Tags.guessTagFromShort
Calls: N/A (constant)
shortTagTestSuite :: [(String, String)]
Test cases for tag resolution. Used by the build system's test suite to verify alias mappings work correctly.
Called by: Test harness Calls: N/A (constant)
Internal Architecture
Data Flow
User input (annotation, URL, directory path)
│
▼
┌─────────────────────────┐
│ tagGuessBlacklist │ ─── Skip if path matches blacklist
└─────────────────────────┘
│
▼
┌─────────────────────────┐
│ tagsShort2Long │ ─── Normalize aliases/typos to canonical form
└─────────────────────────┘
│
▼
┌─────────────────────────┐
│ wholeTagRewritesRegexes │ ─── Apply display transformations
└─────────────────────────┘
│
▼
┌─────────────────────────┐
│ tagsLong2Short │ ─── Convert to human-readable label
└─────────────────────────┘
Alias Table Structure
tagsShort2LongRewrites handles several categories:
- Typo corrections: "gpt4" → "ai/nn/transformer/gpt/4"
- Synonyms: "dogs" → "dog", "psychedelics" → "psychedelic"
- Abbreviation expansion: "mr" → "genetics/heritable/correlation/mendelian-randomization"
- Legacy path migration: "ai/clip" → "ai/nn/transformer/clip"
- Common misspellings: Extensive anencephaly variants (~50 entries)
URL Tag Database
urlTagDB uses three matching strategies:
- Prefix matches: Domain-specific (e.g., "https://publicdomainreview.org/")
- Infix matches: Technology sites (e.g., "r-project.org" anywhere in URL)
- Special cases: Multi-domain patterns via predicates (e.g., EVA wiki sites)
Key Patterns
Hierarchical tag paths: Tags use slash-separated paths that mirror both the site's directory structure and conceptual taxonomy. "ai/nn/transformer/gpt/4/poetry" encodes: domain (AI) → architecture (neural network) → family (transformer) → model (GPT) → version (4) → content type (poetry).
First-match priority: tagsLong2Short is reversed so more specific paths match before their parents. Without this, "ai/nn/transformer/gpt/4" would match the generic "GPT" entry before "GPT-4".
Bidirectional generation: tagsShort2Long auto-generates reverse mappings from display names that don't contain special characters (spaces, HTML tags, parentheses), reducing maintenance burden.
Aggressive typo tolerance: The 3-character edit distance combined with ~200 explicit typo mappings means common errors like "gpt4" or "transfomer" resolve correctly.
Configuration
All configuration is compile-time via the literal tables in this module. To add:
- New tag alias: Add entry to
tagsShort2LongRewrites - New display name: Add entry to
tagsLong2Short(specific paths first) - New URL auto-tag: Add to
urlTagDBprefix/infix/special sections - Block a tag: Add to
tagGuessBlacklistortagListBlacklist
Integration Points
Consumers
- link-metadata-hs: Uses
urlTagDBfor automatic tag inference from URLs - annotation-hs: Applies tag normalization when processing metadata
- Tags.hs: Primary consumer of all exports for tag resolution logic
- HTML templates: Use
tagsLong2Shortfor rendering tag links
Related Modules
- Utils.hs: Provides
anyPrefixandanyInfixhelpers - Tags.hs: Contains business logic that uses these config tables
See Also
- Tags.hs - Core tag manipulation logic consuming this configuration
- guessTag.hs - CLI tool using
tagsShort2Longfor tag expansion - changeTag.hs - CLI tool for tag operations using these mappings
- Annotation.hs - Annotation processing that uses tag inference
- LinkMetadata.hs - Metadata management with tag support
- Hakyll.hs - Site generator that renders tag links
- Utils.hs - Provides
anyPrefixandanyInfixhelpers