Skip to main content

Page Lifecycle

How a gwern.net page transforms from Markdown source to rendered HTML in the browser—and why each step exists.

The Big Picture

A single essay goes through 10 distinct phases before a reader sees it. Each phase exists because gwern.net optimizes for long-term reading, not authoring convenience. The complexity pays off in reader experience: hover any link to see a rich preview, every dollar amount adjusts for inflation, dead links automatically fall back to archived copies.

PhaseWhat happens
1. Source FilesMarkdown + GTX annotation databases
2. Pre-processingURL normalization, string fixes
3. Content GenerationBacklinks, similar links, directories
4. Pandoc ParseMarkdown → Pandoc AST
5. AST Transforms14 AST enrichments (the core)
6. Render to HTMLAST → HTML + templates
7. Post-processingMathJax, syntax highlighting
8. Validation50+ automated checks
9. Deploymentrsync + cache purge
10. Browser RuntimeJS event system, popups

Let's walk through what happens at each phase.

1. Source Files

Input: Markdown files (.md) + GTX annotation databases

Where: *.md files in repo root, metadata/*.gtx files

Gwern writes essays in Markdown with YAML frontmatter:

---
title: "The Scaling Hypothesis"
created: 2020-04-15
modified: 2024-01-10
status: finished
confidence: likely
tags: [AI, scaling]
---

Deep learning's surprising effectiveness may be explained by...

Annotations live separately in GTX files (a custom tab-separated format):

https://arxiv.org/abs/2001.08361	Scaling Laws for Neural Language Models	Kaplan et al 2020	2020-01-23	We study empirical scaling laws...	AI/scaling

Why separate files? Annotations are reused across hundreds of essays. Storing them centrally enables:

  • Single source of truth (fix one annotation, fix everywhere)
  • Efficient lookup (20,000+ annotations in memory during build)
  • Priority layering (full.gtx hand-written overrides auto.gtx machine-generated)

Deeper dive: GTX.hs, LinkMetadata.hs

2. Pre-processing

Input: Raw Markdown files

Output: Cleaned Markdown files

Where: sync.sh lines 100-177

Before compilation, sync.sh runs ~50 search-and-replace operations across all Markdown:

CategoryExampleWhy
URL normalizationtwitter.com/x.com/Canonical URLs for deduplication
Tracking removal?utm_source=... → (removed)Clean links
Name fixesYann Le CunYann LeCunConsistent author names
Citation styleet al.et alNo trailing period before year
Protocol upgradehttp://arxiv.orghttps://arxiv.orgSecurity

Before:

See [Scaling Laws](http://arxiv.org/abs/2001.08361?utm_source=twitter)
by Kaplan et al. (2020).

After:

See [Scaling Laws](https://arxiv.org/abs/2001.08361)
by Kaplan et al 2020.

Why automated? These are corrections Gwern would make manually anyway. Automating them means the source stays clean without constant vigilance.

Deeper dive: bash.sh (gwsed function)

3. Content Generation

Input: Cleaned Markdown + metadata databases

Output: Generated Markdown files, annotation HTML fragments

Where: sync.sh lines 245-320

Several types of content must be generated before Hakyll runs:

GeneratedSourceWhy
BacklinksScan all pages for incoming links"What links here" sections
Similar linksOpenAI embeddings + RP-tree searchRelated content discovery
Tag directoriesTag metadata from frontmatterBrowsable topic indexes
Link bibliographiesPer-page link collectionReference lists
Annotation HTMLGTX → rendered blockquotesPre-rendered for popup speed

Before (no backlinks section):

---
title: GPT-3
---

GPT-3 is a large language model...

After (backlinks generated):

<!-- metadata/backlinks/gpt-3.html contains -->
<section id="backlinks">
<h2>Backlinks</h2>
<ul>
<li><a href="/scaling">The Scaling Hypothesis</a></li>
<li><a href="/llm-economics">LLM Economics</a></li>
</ul>
</section>

Why pre-generate? Computing backlinks requires reading all pages. Doing this inside Hakyll would be slow and complex. Generating once and including is cleaner.

Deeper dive: generate-backlinks.hs, generate-similar.hs, generate-directory.hs

4. Pandoc Parse

Input: Pre-processed Markdown

Output: Pandoc AST (Abstract Syntax Tree)

Where: hakyll.hs via pandocCompilerWithTransformM

Pandoc parses Markdown into a structured tree representation:

Markdown:

# Introduction

See [GPT-3](https://arxiv.org/abs/2005.14165) for details.

AST (simplified):

Pandoc Meta
[ Header 1 ("introduction", [], []) [Str "Introduction"]
, Para [ Str "See "
, Link ("", [], []) [Str "GPT-3"]
("https://arxiv.org/abs/2005.14165", "")
, Str " for details."
]
]

Why an AST? Direct text manipulation is fragile—regex can't reliably distinguish code blocks from prose. The AST makes transformations safe: you can walk the tree, match specific node types, and transform without breaking structure.

Deeper dive: Query.hs, Utils.hs

5. AST Transforms

Input: Pandoc AST

Output: Enriched Pandoc AST

Where: hakyll.hs pandocTransform function

This is where gwern.net's magic happens. The AST passes through 14 transforms in sequence:

Expands shorthand wiki links to full URLs.

Before: [!W:Scaling hypothesis](!W)

After: [Scaling hypothesis](https://en.wikipedia.org/wiki/Scaling_hypothesis)

Why? Faster to write, cleaner source. Interwiki.hs

b. Auto-linking

Converts ~1,000 citation patterns into hyperlinks automatically.

Before: Kaplan et al 2020 found that...

After: [Kaplan et al 2020](https://arxiv.org/abs/2001.08361) found that...

Why? Academic citations are tedious to link manually. The system knows major papers. LinkAuto.hs

c. Annotation Creation

Triggers scraping for any URL lacking metadata. If you link to an arXiv paper, the system fetches title/authors/abstract.

Before: Link exists, no annotation in database

After: GTX entry created via scraper (arXiv API, PDF metadata, HTML parsing)

Why? Rich hover previews need metadata. Getting it automatically means every link can have a preview. Annotation.hs

d. Annotation Marking

Adds .link-annotated class to links that have annotations in the database.

Before: <a href="https://arxiv.org/abs/2001.08361">paper</a>

After: <a href="https://arxiv.org/abs/2001.08361" class="link-annotated">paper</a>

Why? The browser needs to know which links have popups available. LinkMetadata.hs

e. File Sizes

Adds file size metadata to PDF/video links.

Before: <a href="/doc/ai/scaling.pdf">PDF</a>

After: <a href="/doc/ai/scaling.pdf" data-file-size="2.1MB">PDF</a>

Why? Readers should know download sizes before clicking. LinkMetadata.hs

f. Inflation Adjustment

Converts nominal dollar amounts to current dollars with subscript showing original.

Before: $100 in 1990

After: $240<sub>$100 in 1990</sub>

Why? Historical dollar amounts are misleading without inflation context. Inflation.hs

Rewrites external URLs to local archived copies when available.

Before: <a href="https://example.com/article">article</a>

After: <a href="/doc/www/example.com/article.html" data-url-original="https://example.com/article">article</a>

Why? Links rot. Gwern archives important pages locally. If the original dies, the archived copy serves automatically. LinkArchive.hs

Assigns icons based on URL patterns (PDF icon, Wikipedia icon, etc.).

Before: <a href="https://arxiv.org/abs/2001.08361">paper</a>

After: <a href="..." class="link-icon-arxiv" data-link-icon-type="svg">paper</a>

Why? Visual cues help readers anticipate link destinations. LinkIcon.hs

i. Typography

Applies text polish: citation formatting, date subscripts, ruler cycling, title case.

Before: Kaplan et al 2020

After: <span class="cite"><span class="cite-author">Kaplan et al</span> <span class="cite-date">2020</span></span>

Why? Consistent typography improves readability. Semantic spans enable hover effects. Typography.hs

j. Image Dimensions

Adds width/height attributes to images to prevent layout shift.

Before: <img src="/image/gpt3.png">

After: <img src="/image/gpt3.png" width="800" height="600">

Why? Without dimensions, the page jumps as images load. Image.hs

k. Cleanup

Final passes that normalize the document:

  • Header self-links: Makes every <h2>/<h3> clickable to copy its anchor URL
  • Paragraph wrapping: Converts stray Plain blocks to Para for consistent <p> tags
  • Prefetch hints: Marks links eligible for instant.page prefetching
  • Class sanitization: Strips build-time-only classes like archive-not, link-annotated-not

6. Render to HTML

Input: Transformed AST

Output: HTML string

Where: hakyll.hs via Pandoc writer + templates

Pandoc renders the enriched AST to HTML, then Hakyll wraps it in templates:

<!DOCTYPE html>
<html>
<head>
<title>$title$</title>
<meta name="description" content="$description$">
</head>
<body class="page-$safe-url$">
<article>
<header>
<h1>$title$</h1>
<div class="page-metadata">$created$ · $status$</div>
</header>
<div id="markdownBody">
$body$
</div>
$if(backlinks-yes)$
<section id="backlinks">...</section>
$endif$
</article>
</body>
</html>

Why templates? Separation of content and presentation. Update the template once, all pages change. Template: default

7. Post-processing

Input: Raw HTML from Hakyll

Output: Production-ready HTML

Where: sync.sh lines 340-600

Several transforms happen after Hakyll:

TransformWhatWhy
MathJaxLaTeX → CSS+HTMLPre-render math server-side for fast load
Syntax highlightingCode → colored spansConsistent highlighting across languages
Class cleanupRemove build-only classesSmaller HTML, cleaner DOM
Document conversion.docx → .htmlPopup previews for Office files

Before (MathJax):

<span class="math">E = mc^2</span>

After (pre-rendered):

<span class="mjx-math"><span class="mjx-mi">E</span><span class="mjx-mo">=</span>...</span>

Why server-side? MathJax in the browser causes visible rendering delay. Pre-rendering eliminates it.

8. Validation

Input: Complete site in _site/

Output: Warnings, errors, or clean pass

Where: sync.sh lines 602-1247

Over 50 automated checks run in --slow mode:

CheckWhatWhy
HTML validityTidy HTML5 validationCatch malformed markup
Anchor integrityAll #id links resolveNo broken in-page links
YAML completenessRequired fields presentConsistent metadata
Grammar"a" vs "an" before vowelsStyle consistency
Banned domainsNo links to problematic sitesQuality control
Duplicate footnotesNo repeated #fn1 IDsValid HTML

Why so many checks? With 1,000+ pages, manual review is impossible. Automated checks catch regressions immediately.

Deeper dive: check-metadata.hs, anchor-checker.php

9. Deployment

Input: Validated _site/

Output: Live site at gwern.net

Where: sync.sh lines 1249-1283

Three rsync passes deploy the site:

  1. Infrastructure (static/): Force checksum sync for CSS/JS
  2. Pages: Checksum sync of HTML
  3. Assets: Size-only sync for large files

Then Cloudflare cache purge for recently-changed URLs.

Why multiple passes? Different content types have different invalidation needs. Static assets change rarely but must sync perfectly. Pages change often but only need checksum comparison.

10. Browser Runtime

Input: Static HTML + JS

Output: Interactive reading experience

Where: initial.js, rewrite.js, popups.js

When the page loads, JavaScript enhances it further:

The Event System

DOMContentLoaded

GW.contentDidLoad (transclude phase)

Transclude handlers resolve include-links

GW.contentDidLoad (rewrite phase)

Rewrite handlers transform DOM

GW.contentDidInject (eventListeners phase)

Event binding (hover, click handlers)

Runtime Transforms

HandlerWhatWhy
wrapImagesWrap <img> in <figure>Consistent figure structure
makeTablesSortableEnable column sortingInteractive data exploration
hyphenateAdd soft hyphensBetter text flow
Link icon managementEnable/disable iconsContext-dependent display

When you hover an annotated link:

  1. extracts.js detects the link type (annotation, local page, etc.)
  2. After 750ms delay, popups.js spawns a popup frame
  3. content.js fetches annotation HTML from /metadata/annotation/
  4. transclude.js resolves any include-links in the annotation
  5. rewrite.js handlers transform the popup content
  6. Popup displays with smart positioning

Why client-side enhancement? Some transforms depend on viewport size, user preferences, or dynamic state that can't be known at build time.

Deeper dive: extracts.js, popups.js, transclude.js

Let's trace what happens when you write this in Markdown:

Kaplan et al 2020 showed that loss scales as a power law.

2. Pre-processing

et al.et al (if period present)

5b. Auto-linking

Pattern Kaplan et al 2020 matches database → wrapped in link:

[Kaplan et al 2020](https://arxiv.org/abs/2001.08361) showed...

5c. Annotation Creation

arXiv URL checked against GTX database → annotation exists (or created via API)

5d. Annotation Marking

Link gets class:

<a href="https://arxiv.org/abs/2001.08361" class="link-annotated">Kaplan et al 2020</a>

arXiv domain → icon class:

<a href="..." class="link-annotated link-icon-arxiv">Kaplan et al 2020</a>

5i. Typography

Citation text wrapped in semantic spans:

<a href="..." class="link-annotated link-icon-arxiv">
<span class="cite">
<span class="cite-author">Kaplan et al</span>
<span class="cite-date">2020</span>
</span>
</a>

6. Render to HTML

Wrapped in page template with all other content.

10. Browser Runtime

On hover after 750ms:

  1. Popup spawns
  2. Annotation fetched: title, authors, abstract, tags
  3. Displayed in positioned popup with typography transforms applied

Final reader experience: Hover "Kaplan et al 2020" → see paper title, authors, publication date, abstract, and tags—all from a plain text citation in the source.

Why This Complexity?

Every phase serves gwern.net's core goal: optimize for readers, not writers.

ComplexityReader Benefit
14 AST transformsRich, consistent formatting without manual markup
Annotation systemInstant previews for 20,000+ links
Link archivingContent survives link rot
Pre-rendered MathJaxNo rendering flicker
Inflation adjustmentHistorical context without manual calculation
BacklinksDiscover related content
Validation suiteNo broken links or malformed pages

The build takes 30-60 minutes. But readers spend far more cumulative time on the site. Investing authoring complexity for reading quality is the right tradeoff for long-form content meant to be read for decades.

See Also