linkArchive.sh
Path: build/linkArchive.sh | Language: Bash | Lines: ~190
Archives web pages and PDFs locally using SingleFile snapshots for preemptive link preservation.
Overview
This script implements gwern.net's preemptive local archiving system, which creates permanent local copies of linked web pages to prevent link rot. It uses SingleFile (via Docker) to create static, self-contained HTML snapshots of web pages, and downloads PDFs directly. All archived content is stored in a deterministic directory structure: doc/www/$DOMAIN/$SHA1.{html,pdf}.
The script intelligently handles multiple modes (normal archiving, checking for existing archives, dry-run path calculation, no-preview mode), detects content types (HTML vs PDF), handles special cases (Substack sites that break with JavaScript, anchor fragments), and optionally previews archived pages in a browser for quality verification.
For large (>5MB) HTML files, it can optionally decompose the monolithic SingleFile snapshot back into a normal HTML+assets structure for more efficient loading.
Key Commands/Variables
URL: The URL to archiveCHECK: Mode flag - only check if archive exists, never createNO_PREVIEW: Mode flag - archive but don't open browser previewDRY_RUN: Mode flag - return hypothetical file path without any I/OHASH: SHA1 hash of URL (minus anchor) for deterministic filenamesDOMAIN: Extracted domain for directory organizationANCHOR: Fragment identifier (#...) from URL, preserved in output pathMIME_REMOTE: Content-Type from HTTP headersUSER_AGENT: Firefox user agent stringREMOVE_SCRIPTS: Boolean flag for sites that break with JavaScript (e.g., Substack)
Usage
./linkArchive.sh [--check|--no-preview|--dry-run] <URL>
Modes:
-
Normal mode (default): Archive URL and preview in browser
$ linkArchive.sh 'http://example.com/article'
doc/www/example.com/a1b2c3d4e5f6...html
# Opens browser with original vs archived comparison -
--check: Only check if archive exists, never create$ linkArchive.sh --check 'http://example.com/article'
doc/www/example.com/a1b2c3d4e5f6...html
# Or empty string if not found -
--no-preview: Archive without browser preview$ linkArchive.sh --no-preview 'http://example.com/article'
doc/www/example.com/a1b2c3d4e5f6...html -
--dry-run: Calculate hypothetical path without I/O$ linkArchive.sh --dry-run "https://x.com/user/status/123"
doc/www/x.com/8e93ed7854e0d7d8323cdbc667f946fee6f98d3d.html
# Note: returns .html when no archive exists (or .*=existing archive extension)
Examples:
# Archive with preview
$ linkArchive.sh 'http://www.jacurutu.com/viewtopic.php?p=101694'
/home/gwern/wiki/doc/www/www.jacurutu.com/718b0de585ef3dcd778a196fb2b8c842b42c7bc2.html
# Anchor fragments are preserved in output but not in filename
$ linkArchive.sh 'http://example.com/page.html#section'
doc/www/example.com/a1b2c3d4.html#section
Archive directory structure:
doc/www/
├── example.com/
│ ├── a1b2c3d4e5f6...html # SingleFile snapshot
│ └── 9f8e7d6c5b4a...pdf # Downloaded PDF
├── arxiv.org/
│ └── b2c3d4e5f6a1...pdf
└── twitter.com/
└── c3d4e5f6a1b2...html
Special handling:
- PDFs: Detected by MIME type or URL pattern, downloaded with
wget, optimized withocrmypdfand JBIG2 - Substack sites: JavaScript stripped to prevent error-page refreshes
- Export.arxiv.org: URLs rewritten to use export mirror while preserving directory structure
- Nitter domains: Full request used instead of
--headdue to lying servers - Large files (>5MB): Optionally deconstructed with
deconstruct_singlefile.php
Exit codes:
1:curlrequest failed2: 404 or 403 HTTP status3:curlHEAD request failed4: PDF download/validation failed5: HTML snapshot contains error page markers98: Wrong number of arguments99: No URL argument provided
Dependencies:
dockerwith SingleFile imagechromiumbrowser (for SingleFile)curl,wget: HTTP clientssha1sum: Hash calculationtimeout: Command timeoutsfile: MIME type detectionocrmypdf,pdftk: PDF processing (optional)phpwithdeconstruct_singlefile.php: Large file decomposition (optional)
See Also
- LinkArchive.hs - Haskell integration and database management
- Config.LinkArchive - URL whitelists and transformation rules
- deconstruct_singlefile.php - Splits large archives into HTML + assets
- sync.sh - Build process coordination
- hakyll.hs - Build system that triggers archive operations
- gwsed.sh - URL rewriting that may trigger archiving