bash.sh
Path: build/bash.sh | Language: Bash | Lines: ~1,250
Shell helper function library for gwern.net wiki operations
Overview
bash.sh is a comprehensive shell helper library intended to be sourced in ~/.bashrc. It provides utility functions for the day-to-day maintenance and development of gwern.net, spanning file management, PDF manipulation, caching, search, and content migration.
The library reflects a "batteries included" philosophy—rather than requiring separate tools, it bundles everything needed for common wiki operations. Functions handle path normalization between URLs and local files, wrap common operations with sensible defaults, and provide specialized workflows like HTTP→HTTPS migration and annotation tagging.
The design emphasizes interactive use from the command line, with tab-completion support for common arguments (directory paths, tags), colored output for warnings/errors, and integration with external tools like exiftool, pdftk, and parallel.
Public API
Path Conversion
path2File(args...) -> string
Converts URLs or wiki paths to absolute file paths under ~/wiki/. Handles multiple input formats: relative paths, ~/ prefixes, https://gwern.net/ URLs, and /doc/ paths.
Called by: Most other functions in the library
Calls: sed, echo
file2Path(path) -> string
Converts a local file path to a website path (inverse of path2File).
Called by: Internal utilities
Calls: sed, echo
Cache Management
cloudflare-expire(files...)
Purges specified files from Cloudflare's CDN cache. Converts file paths to URLs and calls the Cloudflare API.
Called by: Manual invocation after content updates
Calls: path2File, curl, jq
cloudflare-expire-all()
Expires all Markdown and static files from the CDN cache. Used for major site updates.
Called by: Manual invocation
Calls: find, parallel, cloudflare-expire
PDF Operations
pdf(files...)
Opens PDFs in evince viewer, handling path normalization.
pdf-cut(files...)
Removes the first page of a PDF (spam removal from JSTOR, etc.). Preserves metadata via exiftool.
Alias: pdfcut
Calls: pdftk, exiftool, crossref
pdf-cut-last(file)
Removes the last page of a PDF. Similar to pdf-cut but for trailing pages.
Alias: pdfcut-last
Calls: pdftk, exiftool, crossref
pdf-cut-append(file)
Moves the first page to the end instead of deleting it. Useful for preserving cover pages while changing display order.
Alias: pdfcut-append
Calls: pdftk, exiftool, crossref
pdf-append(target_pdf, input_files...)
Concatenates multiple files (PDFs, documents, spreadsheets, images) into a single PDF. Converts non-PDF inputs automatically. Preserves metadata from the first file.
Calls: pdftk, exiftool, libreoffice, doc2pdf, img2pdf, convert, crossref
doc2pdf(input, [output])
Converts Word/ODT documents to PDF via LibreOffice headless mode.
Calls: soffice
e(file, exiftool_args...)
PDF/image metadata editor wrapper. Strips encryption, normalizes PDFs through pdftk, and cleans up title formatting.
Calls: exiftool, pdftk
Image Operations
crop(files...)
Trims whitespace from around JPG/PNG images using ImageMagick.
Calls: convert, identify, parallel
crop_one(file)
Single-file version of crop with size warning for large images.
pad(files...)
Adds 30px white border around images (for images cropped too tightly).
Alias: pad-white
Calls: mogrify
pad-black(files...)
Adds 30px dark gray border (#161616) matching the site's dark mode background.
Calls: mogrify
crop-pad(files...) / crop-pad-black(files...)
Combined crop and pad operations.
png2JPGQualityCheck(files...)
Checks if PNGs can be converted to JPG with minimal quality loss. Uses PSNR metric and size reduction threshold.
Calls: convert, compare, stat, bc
Search Functions
gw(query)
Primary site-wide search. Searches Markdown files, Haskell sources, JavaScript, CSS, and configs. Falls back to IRC logs if no results.
Called by: Manual invocation
Calls: find, grep, gwl
gwf(query)
Searches file names only (not content).
gwn(query)
Searches newsletter files only.
gwa(query) / gwal(query)
Searches annotations. gwal pipes through less with highlighting.
Defined in: ~/wiki/static/build/gwa
gwl(query)
Searches #lesswrong IRC logs.
Content Migration
gwsed(old, new) (referenced, defined elsewhere)
Site-wide search and replace.
gwhttp(domain)
HTTP→HTTPS migration helper. Extracts domain and calls gwsed.
gwmv(old, new)
Moves files with full site integration:
- Git-tracks the move
- Syncs to server via rsync
- Updates all references via
gwsed - Updates nginx redirects
- Handles PNG→JPG conversion with history preservation
Called by: gwmvdir, manual invocation
Calls: git mv, rsync, gwsed, stringReplace, crossref
gwmvdir(old, new)
Moves entire directories by iterating gwmv over contents.
Calls: gwmv
gwtag(url, tags...)
Adds or modifies annotation tags via the changeTag Haskell tool.
Aliases: gwt, t
Calls: changeTag
Link Archiving
mvuri(encoded_path)
Moves a browser-saved HTML snapshot to the correct archive location. Handles:
file://URI decoding- Direct URL insertion with database update
- Waiting for downloads to complete
- Large file optimization via
deconstruct_singlefile.php
Calls: linkArchive.sh, ghci, inotifywait, is_downloading
is_downloading(file, [min_size_kb])
Waits for a file to become quiescent (finished downloading). Checks modification time and minimum size.
Testing
lorem_update()
Updates gold test snapshots from live site.
Calls: lorem_download, get_lorem_pages
run_gold_test()
Compares current site against stored snapshots for regression detection.
Calls: compare_page, get_lorem_pages
Utility Functions
bold(text) / red(text) / green(text) / yellow(text)
ANSI color output helpers.
wrap(command, warning)
Executes command and displays red-highlighted warning if output is non-empty.
ge / gev / gf / gfv / gec / gfc
Grep wrappers: extended regex, fixed strings, inverted match, colored variants.
everyNDays(n, [offset])
Returns true every N days based on day-of-year modulus. Used for scheduling periodic tasks without randomness.
sort_by_lastmodified()
Sorts directories (from stdin) by most recently modified git-tracked file. Useful for prioritizing build operations.
length()
Sorts lines by character length.
Internal Architecture
Path Normalization Pipeline
All file-handling functions route through path2File for consistent path handling:
Input (any format)
↓
path2File()
↓ strips wiki/ prefix
↓ expands ~/
↓ removes https://gwern.net
↓ removes anchors (#...)
↓ cleans double-slashes
↓
/home/gwern/wiki/path/to/file
PDF Processing Pipeline
PDF operations follow a consistent pattern:
- Normalize path via
path2File - Create temp file
- Process with
pdftk - Preserve metadata with
exiftool -TagsFromFile - Move temp to original
- Trigger background
crossrefupdate
File Move Integration
gwmv coordinates multiple subsystems:
gwmv old new
├─ git mv (version control)
├─ rsync (server sync, background)
├─ gwsed (reference updates)
├─ stringReplace (nginx config)
├─ echo >> nginx.conf (add redirect)
└─ rm annotations (cleanup)
Key Patterns
Parallel Processing
Functions use GNU parallel for batch operations:
find ... | parallel cloudflare-expire
ls $(path2File "$@") | parallel crop_one
Background Jobs
Long-running operations spawn background processes:
(crossref "$ORIGINAL" &)
rsync ... gwern@server:path &
Defensive Path Handling
Multiple layers of validation prevent accidental damage:
if [[ -a ~/wiki$NEW ]]; then
red "Target exists! Will not overwrite."
return 5
fi
Tab Completion
Dynamic completion based on directory structure:
GWERNNET_DIRS_FULL="$(find wiki/doc/ -type d | ...)"
complete -W "$GWERNNET_DIRS_FULL ..." gwtag upload
Configuration
Environment Variables
| Variable | Purpose |
|---|---|
N | Default Haskell parallelism (default: 14) |
CLOUDFLARE_TARGET | Zone ID for cache purging |
CLOUDFLARE_CACHE_TOKEN | API auth token |
Shell Options
set -e # Exit on error (global)
shopt -s nocasematch # Case-insensitive matching (local to pdf-append)
External Dependencies
pdftk- PDF manipulationexiftool- Metadata editingparallel- Parallel executionlibreoffice- Document conversionimg2pdf/convert- Image to PDFinotifywait- File system watchingcurl/jq- API calls
Integration Points
With sync.sh
Functions like everyNDays, sort_by_lastmodified, wrap, and color helpers are used by the build orchestrator.
With Haskell Tools
changeTag- annotation managementcrossref- PDF metadata enrichmentlinkArchive.sh- archive database updatesghciLinkArchive.hs - direct database insertion
With External Services
- Cloudflare API for cache management
- InvertOrNot.com API for dark-mode image reporting
Shared State
~/wiki/- wiki content directory~/wiki/metadata/annotation/- annotation files~/wiki/static/redirect/*.conf- nginx redirect configs~/.bashrc- sources this file
See Also
- sync.sh - Build orchestrator that sources this library
- hakyll.hs - Core site generator invoked by sync.sh
- LinkArchive.hs - Archive URL database used by mvuri
- changeTag.hs - Tag modification tool invoked by gwtag
- GTX.hs - GTX file format used by annotation functions