Page Lifecycle

How a gwern.net page transforms from Markdown source to rendered HTML in the browser—and why each step exists.

The Big Picture

A single essay goes through 10 distinct phases before a reader sees it. Each phase exists because gwern.net optimizes for long-term reading, not authoring convenience. The complexity pays off in reader experience: hover any link to see a rich preview, every dollar amount adjusts for inflation, dead links automatically fall back to archived copies.

Phase	What happens
1. Source Files	Markdown + GTX annotation databases
2. Pre-processing	URL normalization, string fixes
3. Content Generation	Backlinks, similar links, directories
4. Pandoc Parse	Markdown → Pandoc AST
5. AST Transforms	14 AST enrichments (the core)
6. Render to HTML	AST → HTML + templates
7. Post-processing	MathJax, syntax highlighting
8. Validation	50+ automated checks
9. Deployment	rsync + cache purge
10. Browser Runtime	JS event system, popups

Let's walk through what happens at each phase.

1. Source Files

Input: Markdown files (.md) + GTX annotation databases

Where: *.md files in repo root, metadata/*.gtx files

Gwern writes essays in Markdown with YAML frontmatter:

---
title: "The Scaling Hypothesis"
created: 2020-04-15
modified: 2024-01-10
status: finished
confidence: likely
tags: [AI, scaling]
---

Deep learning's surprising effectiveness may be explained by...

Annotations live separately in GTX files (a custom tab-separated format):

https://arxiv.org/abs/2001.08361	Scaling Laws for Neural Language Models	Kaplan et al 2020	2020-01-23	We study empirical scaling laws...	AI/scaling

Why separate files? Annotations are reused across hundreds of essays. Storing them centrally enables:

Single source of truth (fix one annotation, fix everywhere)
Efficient lookup (20,000+ annotations in memory during build)
Priority layering (full.gtx hand-written overrides auto.gtx machine-generated)

Deeper dive: GTX.hs, LinkMetadata.hs

2. Pre-processing

Input: Raw Markdown files

Output: Cleaned Markdown files

Where: sync.sh lines 100-177

Before compilation, sync.sh runs ~50 search-and-replace operations across all Markdown:

Category	Example	Why
URL normalization	`twitter.com/` → `x.com/`	Canonical URLs for deduplication
Tracking removal	`?utm_source=...` → (removed)	Clean links
Name fixes	`Yann Le Cun` → `Yann LeCun`	Consistent author names
Citation style	`et al.` → `et al`	No trailing period before year
Protocol upgrade	`http://arxiv.org` → `https://arxiv.org`	Security

Before:

See [Scaling Laws](http://arxiv.org/abs/2001.08361?utm_source=twitter)
by Kaplan et al. (2020).

After:

See [Scaling Laws](https://arxiv.org/abs/2001.08361)
by Kaplan et al 2020.

Why automated? These are corrections Gwern would make manually anyway. Automating them means the source stays clean without constant vigilance.

Deeper dive: bash.sh (gwsed function)

3. Content Generation

Input: Cleaned Markdown + metadata databases

Output: Generated Markdown files, annotation HTML fragments

Where: sync.sh lines 245-320

Several types of content must be generated before Hakyll runs:

Generated	Source	Why
Backlinks	Scan all pages for incoming links	"What links here" sections
Similar links	OpenAI embeddings + RP-tree search	Related content discovery
Tag directories	Tag metadata from frontmatter	Browsable topic indexes
Link bibliographies	Per-page link collection	Reference lists
Annotation HTML	GTX → rendered blockquotes	Pre-rendered for popup speed

Before (no backlinks section):

---
title: GPT-3
---

GPT-3 is a large language model...

After (backlinks generated):

<!-- metadata/backlinks/gpt-3.html contains -->
<section id="backlinks">
  <h2>Backlinks</h2>
  <ul>
    <li><a href="/scaling">The Scaling Hypothesis</a></li>
    <li><a href="/llm-economics">LLM Economics</a></li>
  </ul>
</section>

Why pre-generate? Computing backlinks requires reading all pages. Doing this inside Hakyll would be slow and complex. Generating once and including is cleaner.

Deeper dive: generate-backlinks.hs, generate-similar.hs, generate-directory.hs

4. Pandoc Parse

Input: Pre-processed Markdown

Output: Pandoc AST (Abstract Syntax Tree)

Where: hakyll.hs via pandocCompilerWithTransformM

Pandoc parses Markdown into a structured tree representation:

Markdown:

# Introduction

See [GPT-3](https://arxiv.org/abs/2005.14165) for details.

AST (simplified):

Pandoc Meta
  [ Header 1 ("introduction", [], []) [Str "Introduction"]
  , Para [ Str "See "
         , Link ("", [], []) [Str "GPT-3"]
                ("https://arxiv.org/abs/2005.14165", "")
         , Str " for details."
         ]
  ]

Why an AST? Direct text manipulation is fragile—regex can't reliably distinguish code blocks from prose. The AST makes transformations safe: you can walk the tree, match specific node types, and transform without breaking structure.

Deeper dive: Query.hs, Utils.hs

5. AST Transforms

Input: Pandoc AST

Output: Enriched Pandoc AST

Where: hakyll.hs pandocTransform function

This is where gwern.net's magic happens. The AST passes through 14 transforms in sequence:

a. Interwiki Links

Expands shorthand wiki links to full URLs.

Before: [!W:Scaling hypothesis](!W)

After: [Scaling hypothesis](https://en.wikipedia.org/wiki/Scaling_hypothesis)

Why? Faster to write, cleaner source. Interwiki.hs

b. Auto-linking

Converts ~1,000 citation patterns into hyperlinks automatically.

Before: Kaplan et al 2020 found that...

After: [Kaplan et al 2020](https://arxiv.org/abs/2001.08361) found that...

Why? Academic citations are tedious to link manually. The system knows major papers. LinkAuto.hs

c. Annotation Creation

Triggers scraping for any URL lacking metadata. If you link to an arXiv paper, the system fetches title/authors/abstract.

Before: Link exists, no annotation in database

After: GTX entry created via scraper (arXiv API, PDF metadata, HTML parsing)

Why? Rich hover previews need metadata. Getting it automatically means every link can have a preview. Annotation.hs

d. Annotation Marking

Adds .link-annotated class to links that have annotations in the database.

Before: <a href="https://arxiv.org/abs/2001.08361">paper</a>

After: <a href="https://arxiv.org/abs/2001.08361" class="link-annotated">paper</a>

Why? The browser needs to know which links have popups available. LinkMetadata.hs

e. File Sizes

Adds file size metadata to PDF/video links.

Before: <a href="/doc/ai/scaling.pdf">PDF</a>

After: <a href="/doc/ai/scaling.pdf" data-file-size="2.1MB">PDF</a>

Why? Readers should know download sizes before clicking. LinkMetadata.hs

f. Inflation Adjustment

Converts nominal dollar amounts to current dollars with subscript showing original.

Before: $100 in 1990

After: $240<sub>$100 in 1990</sub>

Why? Historical dollar amounts are misleading without inflation context. Inflation.hs

g. Link Archiving

Rewrites external URLs to local archived copies when available.

Before: <a href="https://example.com/article">article</a>

After: <a href="/doc/www/example.com/article.html" data-url-original="https://example.com/article">article</a>

Why? Links rot. Gwern archives important pages locally. If the original dies, the archived copy serves automatically. LinkArchive.hs

h. Link Icons

Assigns icons based on URL patterns (PDF icon, Wikipedia icon, etc.).

Before: <a href="https://arxiv.org/abs/2001.08361">paper</a>

After: <a href="..." class="link-icon-arxiv" data-link-icon-type="svg">paper</a>

Why? Visual cues help readers anticipate link destinations. LinkIcon.hs

i. Typography

Applies text polish: citation formatting, date subscripts, ruler cycling, title case.

Before: Kaplan et al 2020

After: <span class="cite"><span class="cite-author">Kaplan et al</span> <span class="cite-date">2020</span></span>

Why? Consistent typography improves readability. Semantic spans enable hover effects. Typography.hs

j. Image Dimensions

Adds width/height attributes to images to prevent layout shift.

Before: <img src="/image/gpt3.png">

After: <img src="/image/gpt3.png" width="800" height="600">

Why? Without dimensions, the page jumps as images load. Image.hs

k. Cleanup

Final passes that normalize the document:

Header self-links: Makes every <h2>/<h3> clickable to copy its anchor URL
Paragraph wrapping: Converts stray Plain blocks to Para for consistent <p> tags
Prefetch hints: Marks links eligible for instant.page prefetching
Class sanitization: Strips build-time-only classes like archive-not, link-annotated-not

6. Render to HTML

Input: Transformed AST

Output: HTML string

Where: hakyll.hs via Pandoc writer + templates

Pandoc renders the enriched AST to HTML, then Hakyll wraps it in templates:

<!DOCTYPE html>
<html>
<head>
  <title>$title$</title>
  <meta name="description" content="$description$">
</head>
<body class="page-$safe-url$">
  <article>
    <header>
      <h1>$title$</h1>
      <div class="page-metadata">$created$ · $status$</div>
    </header>
    <div id="markdownBody">
      $body$
    </div>
    $if(backlinks-yes)$
    <section id="backlinks">...</section>
    $endif$
  </article>
</body>
</html>

Why templates? Separation of content and presentation. Update the template once, all pages change. Template: default

7. Post-processing

Input: Raw HTML from Hakyll

Output: Production-ready HTML

Where: sync.sh lines 340-600

Several transforms happen after Hakyll:

Transform	What	Why
MathJax	LaTeX → CSS+HTML	Pre-render math server-side for fast load
Syntax highlighting	Code → colored spans	Consistent highlighting across languages
Class cleanup	Remove build-only classes	Smaller HTML, cleaner DOM
Document conversion	.docx → .html	Popup previews for Office files

Before (MathJax):

<span class="math">E = mc^2</span>

After (pre-rendered):

<span class="mjx-math"><span class="mjx-mi">E</span><span class="mjx-mo">=</span>...</span>

Why server-side? MathJax in the browser causes visible rendering delay. Pre-rendering eliminates it.

8. Validation

Input: Complete site in _site/

Output: Warnings, errors, or clean pass

Where: sync.sh lines 602-1247

Over 50 automated checks run in --slow mode:

Check	What	Why
HTML validity	Tidy HTML5 validation	Catch malformed markup
Anchor integrity	All `#id` links resolve	No broken in-page links
YAML completeness	Required fields present	Consistent metadata
Grammar	"a" vs "an" before vowels	Style consistency
Banned domains	No links to problematic sites	Quality control
Duplicate footnotes	No repeated `#fn1` IDs	Valid HTML

Why so many checks? With 1,000+ pages, manual review is impossible. Automated checks catch regressions immediately.

Deeper dive: check-metadata.hs, anchor-checker.php

9. Deployment

Input: Validated _site/

Output: Live site at gwern.net

Where: sync.sh lines 1249-1283

Three rsync passes deploy the site:

Infrastructure (static/): Force checksum sync for CSS/JS
Pages: Checksum sync of HTML
Assets: Size-only sync for large files

Then Cloudflare cache purge for recently-changed URLs.

Why multiple passes? Different content types have different invalidation needs. Static assets change rarely but must sync perfectly. Pages change often but only need checksum comparison.

10. Browser Runtime

Input: Static HTML + JS

Output: Interactive reading experience

Where: initial.js, rewrite.js, popups.js

When the page loads, JavaScript enhances it further:

The Event System

DOMContentLoaded
    ↓
GW.contentDidLoad (transclude phase)
    ↓
  Transclude handlers resolve include-links
    ↓
GW.contentDidLoad (rewrite phase)
    ↓
  Rewrite handlers transform DOM
    ↓
GW.contentDidInject (eventListeners phase)
    ↓
  Event binding (hover, click handlers)

Runtime Transforms

Handler	What	Why
`wrapImages`	Wrap `<img>` in `<figure>`	Consistent figure structure
`makeTablesSortable`	Enable column sorting	Interactive data exploration
`hyphenate`	Add soft hyphens	Better text flow
Link icon management	Enable/disable icons	Context-dependent display

When you hover an annotated link:

extracts.js detects the link type (annotation, local page, etc.)
After 750ms delay, popups.js spawns a popup frame
content.js fetches annotation HTML from /metadata/annotation/
transclude.js resolves any include-links in the annotation
rewrite.js handlers transform the popup content
Popup displays with smart positioning

Why client-side enhancement? Some transforms depend on viewport size, user preferences, or dynamic state that can't be known at build time.

Deeper dive: extracts.js, popups.js, transclude.js

A Link's Complete Journey

Let's trace what happens when you write this in Markdown:

Kaplan et al 2020 showed that loss scales as a power law.

2. Pre-processing

et al. → et al (if period present)

5b. Auto-linking

Pattern Kaplan et al 2020 matches database → wrapped in link:

[Kaplan et al 2020](https://arxiv.org/abs/2001.08361) showed...

5c. Annotation Creation

arXiv URL checked against GTX database → annotation exists (or created via API)

5d. Annotation Marking

Link gets class:

<a href="https://arxiv.org/abs/2001.08361" class="link-annotated">Kaplan et al 2020</a>

5h. Link Icons

arXiv domain → icon class:

<a href="..." class="link-annotated link-icon-arxiv">Kaplan et al 2020</a>

5i. Typography

Citation text wrapped in semantic spans:

<a href="..." class="link-annotated link-icon-arxiv">
  <span class="cite">
    <span class="cite-author">Kaplan et al</span>
    <span class="cite-date">2020</span>
  </span>
</a>

6. Render to HTML

Wrapped in page template with all other content.

10. Browser Runtime

On hover after 750ms:

Popup spawns
Annotation fetched: title, authors, abstract, tags
Displayed in positioned popup with typography transforms applied

Final reader experience: Hover "Kaplan et al 2020" → see paper title, authors, publication date, abstract, and tags—all from a plain text citation in the source.

Why This Complexity?

Every phase serves gwern.net's core goal: optimize for readers, not writers.

Complexity	Reader Benefit
14 AST transforms	Rich, consistent formatting without manual markup
Annotation system	Instant previews for 20,000+ links
Link archiving	Content survives link rot
Pre-rendered MathJax	No rendering flicker
Inflation adjustment	Historical context without manual calculation
Backlinks	Discover related content
Validation suite	No broken links or malformed pages

The build takes 30-60 minutes. But readers spend far more cumulative time on the site. Investing authoring complexity for reading quality is the right tradeoff for long-form content meant to be read for decades.

The Big Picture​

1. Source Files​

2. Pre-processing​

3. Content Generation​

4. Pandoc Parse​

5. AST Transforms​

a. Interwiki Links​

b. Auto-linking​

c. Annotation Creation​

d. Annotation Marking​

e. File Sizes​

f. Inflation Adjustment​

g. Link Archiving​

h. Link Icons​

i. Typography​

j. Image Dimensions​

k. Cleanup​

6. Render to HTML​

7. Post-processing​

8. Validation​

9. Deployment​

10. Browser Runtime​

The Event System​

Runtime Transforms​

Popup System​

A Link's Complete Journey​

2. Pre-processing​

5b. Auto-linking​

5c. Annotation Creation​

5d. Annotation Marking​

5h. Link Icons​

5i. Typography​

6. Render to HTML​

10. Browser Runtime​

Why This Complexity?​

See Also​

The Big Picture

1. Source Files

2. Pre-processing

3. Content Generation

4. Pandoc Parse

5. AST Transforms

a. Interwiki Links

b. Auto-linking

c. Annotation Creation

d. Annotation Marking

e. File Sizes

f. Inflation Adjustment

g. Link Archiving

h. Link Icons

i. Typography

j. Image Dimensions

k. Cleanup

6. Render to HTML

7. Post-processing

8. Validation

9. Deployment

10. Browser Runtime

The Event System

Runtime Transforms

Popup System

A Link's Complete Journey

2. Pre-processing

5b. Auto-linking

5c. Annotation Creation

5d. Annotation Marking

5h. Link Icons

5i. Typography

6. Render to HTML

10. Browser Runtime

Why This Complexity?

See Also