Skip to main content

preprocess-markdown.hs

Path: build/preprocess-markdown.hs | Language: Haskell | Lines: ~50

Standalone preprocessor that transforms Markdown abstracts into annotated HTML with auto-links and recommendations


Overview

preprocess-markdown.hs is a small, focused command-line utility that reads Markdown from stdin, applies a series of Pandoc AST transformations, and outputs enriched HTML. It serves as a preprocessing step for annotations and abstracts before they enter the main Hakyll build pipeline.

The module performs three key transformations: (1) interwiki link expansion (converting shorthand like [foo](!W) to full Wikipedia URLs), (2) automatic term hyperlinking via regex patterns defined in LinkAuto, and (3) generation of "See Also" recommendations using embedding-based similarity matching. It also validates Wikipedia links to catch broken references early in the pipeline.

This tool is designed for single-document processing (annotations, abstracts) rather than full pages. It's invoked during annotation creation workflows to ensure consistency between manually-written abstracts and auto-generated ones.


Public API

main :: IO ()

Entry point that orchestrates the preprocessing pipeline.

Pipeline:

  1. Read Markdown from stdin
  2. Parse to Pandoc AST with full Pandoc extensions
  3. Apply convertInterwikiLinks transformation
  4. Apply linkAuto transformation
  5. Validate Wikipedia links via checkWP
  6. Render to HTML5
  7. Clean abstracts via cleanAbstractsHTML
  8. Generate embedding-based recommendations
  9. Output HTML with optional "See Also" section

Called by: Shell scripts, annotation workflows Calls: convertInterwikiLinks, linkAuto, checkWP, cleanAbstractsHTML, singleShotRecommendations


Internal Architecture

Processing Pipeline

stdin (Markdown)


┌─────────────────────┐
│ readMarkdown │ Parse with pandocExtensions
└─────────────────────┘


┌─────────────────────┐
│ convertInterwikiLinks│ !W, !G, etc. → full URLs
└─────────────────────┘


┌─────────────────────┐
│ linkAuto │ Regex-based term linking
└─────────────────────┘


┌─────────────────────┐
│ checkWP │ Validate Wikipedia links
└─────────────────────┘


┌─────────────────────┐
│ writeHtml5String │ Render to HTML
└─────────────────────┘


┌─────────────────────┐
│ cleanAbstractsHTML │ Sanitize output
└─────────────────────┘


┌─────────────────────┐
│ singleShotRecommendations │ Embedding similarity
└─────────────────────┘


stdout (HTML + See Also)

Wikipedia Validation

checkWP :: Pandoc -> IO ()
checkWP p = do
let links = filter ("wikipedia.org"`T.isInfixOf`) $ extractURLs p
mapM_ (isWPArticle True) links -- Check existence
mapM_ isWPDisambig links -- Warn on disambiguation pages

This validation catches two common errors:

  • Links to non-existent Wikipedia articles (typos, deleted pages)
  • Links to disambiguation pages (should link to specific article)

Key Patterns

Standalone Document Processing

Unlike the main Hakyll build which processes whole sites, this tool handles single documents in isolation:

main = do
originalMarkdown <- TIO.getContents -- Single document from stdin
-- ... process ...
putStrLn html -- Single document to stdout

This design enables:

  • Integration with Unix pipelines
  • Use in annotation creation workflows
  • Testing transformations on individual documents

Working Directory Management

The tool explicitly sets the working directory before loading databases:

C.cd  -- Ensure correct directory for metadata databases
matchList <- GS.singleShotRecommendations html

This is necessary because singleShotRecommendations needs to read the embeddings database, backlinks database, and metadata from specific file paths.

See Also Formatting

Recommendations are wrapped in a collapsible div for consistent styling:

"<div class=\"aux-links-append see-also-append collapse\">\n\n" ++
"<p><strong>See Also</strong>:</p>\n\n" ++
matchList ++
"\n</div>"

The collapse class allows the recommendations to be hidden by default on pages where they might be distracting.


Configuration

SettingSourcePurpose
pandocExtensionsPandocFull Markdown extension set
safeHtmlWriterOptionsUtilsHTML output settings
Interwiki prefixesConfig.InterwikiShorthand → URL mappings
LinkAuto patternsConfig.LinkAutoTerm → URL regex patterns
C.cdConfig.MiscWorking directory path

Integration Points

Dependencies

ModuleUsage
LinkMetadatacleanAbstractsHTML for output sanitization
LinkAutolinkAuto for automatic term linking
InterwikiconvertInterwikiLinks, isWPArticle, isWPDisambig
GenerateSimilarsingleShotRecommendations for See Also generation
QueryextractURLs for Wikipedia link validation

Input/Output

  • Input: Markdown text via stdin
  • Output: HTML with optional See Also section via stdout
  • Side effects: HTTP requests to Wikipedia API for link validation

Database Access

Reads (via singleShotRecommendations):

  • Embeddings database (metadata/embeddings.bin)
  • Backlinks database
  • Metadata database (.gtx file)

See Also