Skip to main content

preprocess-annotation.sh

Path: build/preprocess-annotation.sh | Language: Bash | Lines: ~7

Simple stdin preprocessor for annotation editing workflow.


Overview

preprocess-annotation.sh is a minimal wrapper script that reads raw annotation content from stdin and pipes it through preprocess-markdown to prepare it for manual editing in Emacs. The preprocessed output is then stored in annotation database files (full.gtx or half.gtx).

This script is part of the manual annotation editing workflow on gwern.net. When adding or updating link annotations, the raw scraped or draft content is passed through this script to normalize formatting before being opened in an editor buffer. Historically, this also included Pandoc HTML conversion and HTML Tidy formatting (now commented out), but the current version only performs Markdown preprocessing.

The simplicity of this script reflects the design philosophy that most annotation processing should happen in dedicated tools (preprocess-markdown) rather than being scattered across multiple scripts.

Key Commands/Variables

Main pipeline:

cat - | preprocess-markdown

Components:

  • cat - - Read from stdin (explicit pipe-through idiom)
  • preprocess-markdown (from $PATH) - Core Markdown normalizer

Commented-out legacy pipeline:

# pandoc --mathjax --metadata title='Annotation preview' --to=html5 --from=html
# tidy -quiet --show-warnings no --show-body-only auto -indent -wrap 0 \
# --clean yes --merge-divs no --break-before-br yes --logical-emphasis yes \
# --quote-nbsp no || true
  • pandoc - Would convert Markdown → HTML5 with MathJax support
  • tidy - Would format/clean HTML with specific options
  • || true - Suppress tidy's warning exit status (always exits 1)

Purpose of legacy code: These were likely used when annotations were stored/edited as HTML rather than Markdown. The pipeline would render a preview of how the annotation would appear. Now annotations are kept in Markdown form, so HTML conversion is deferred to build time.

Usage

Standard invocation (from stdin):

echo "Some *raw* annotation text" | ./preprocess-annotation.sh > processed.txt

In Emacs workflow:

;; Hypothetical Emacs function
(defun edit-annotation (url)
(let ((raw-content (fetch-annotation-draft url)))
(shell-command-on-region
(point-min) (point-max)
"~/wiki/static/build/preprocess-annotation.sh"
(get-buffer-create "*annotation*"))))

Manual testing:

# Test preprocessing on a sample annotation
cat <<EOF | ./preprocess-annotation.sh
A *neural* network paper about [transformers](https://example.com).

Contains **bold** and _italic_ text.
EOF

What preprocess-markdown likely does:

  • Normalize whitespace and line breaks
  • Standardize Markdown formatting (list indentation, emphasis markers)
  • Convert HTML entities to Unicode
  • Strip or normalize problematic characters
  • Apply site-specific Markdown conventions

No arguments:

  • Script takes no command-line arguments
  • All input via stdin, output to stdout (Unix filter pattern)

Exit status:

  • Returns exit status of preprocess-markdown (pipe propagates last command status)

See Also