annotation-dump.hs

Path: build/annotation-dump.hs | Language: Haskell | Lines: ~69

CLI utility for querying and grepping the annotation database

Overview

annotation-dump.hs is a command-line utility for dumping and querying gwern.net's annotation database. It reads the four annotation GTX files (me.gtx, full.gtx, half.gtx, auto.gtx), merges them, and outputs single-line formatted entries suitable for grep-based searching. The tool serves two primary use cases: (1) dumping the entire annotation database for inspection, and (2) looking up specific URLs or finding all annotations that reference those URLs as substrings.

The tool formats each annotation on a single line with color-coded fields (source label, authors, year, URL, tags, title, abstract) separated by semicolons. This format enables pipeline-based workflows where you can extract links from a markdown file, pass them through annotation-dump.hs, and quickly identify which links lack annotations or tags.

A key feature is its handling of "empty" annotations: when queried for a URL with no metadata, it outputs the bare URL with [] to indicate missing tags, enabling workflows that identify untagged links for manual curation.

Public API / CLI Usage

Basic Usage

Dump entire database:

./annotation-dump.hs

Outputs all annotations in single-line format, sorted by path/date.

Query specific URLs:

echo "/doc/ai/2020-smith.pdf" | ./annotation-dump.hs
# Or with multiple URLs:
cat urls.txt | ./annotation-dump.hs

Returns:

Direct matches (the annotation for each queried URL)
Substring matches (any annotation containing the queried URL)
Empty markers (queried URLs with no annotation, shown as URL [])

Output Format

Each line contains semicolon-separated fields:

<label>; <authors_cite>; <year>; <authors_raw>; <colored_url>; <tags>; "<title>"; (<authors_truncated>); <date>; <abstract>; <generated_url>

Where:

label: Source file (m=me.gtx, f=full.gtx, h=half.gtx, a=auto.gtx)
authors_cite: Citation format (e.g., "Smith et al")
year: First 4 chars of date
authors_raw: Raw author string from metadata (c)
colored_url: Green ANSI-colored URL
tags: Array of tag strings
title: Purple ANSI-colored quoted title
authors_truncated: Author list (truncated with … if needed)
abstract: Single-line abstract (newlines/extra spaces collapsed)
generated_url: Optional canonical URL (green, if differs from path)

Integration Patterns

Find untagged links:

./annotation-dump.hs | grep " \[\]"

Check which links in a file need annotations:

./link-extracter.hs doc.md | ./annotation-dump.hs | grep " \[\]"

Search for specific topic:

./annotation-dump.hs | grep -i "machine learning"

Internal Architecture

Data Flow

Load GTX files: Reads four GTX databases using readGTXSlow
Merge databases: Creates incompleteDB (raw union) and finalDB (blacklist-filtered, labeled)
Sort and format: Sorts by path/date, converts to single-line format
Process input:
- If stdin empty → dump all formatted annotations
- If stdin has URLs → lookup each + filter for substring matches

Key Data Structures

type Path = String
type MetadataItem = (Title, CiteKey, Date, ...)
type MetadataList = [(Path, MetadataItem)]
type Metadata = Map Path MetadataItem

-- Labeled metadata includes source file indicator
(Path, (MetadataItem, String))  -- String is "m"/"f"/"h"/"a"

Database Hierarchy

Files merged with precedence (first wins):

me.gtx - Hand-curated definitions (highest priority)
full.gtx - Complete manually-written annotations
half.gtx - Tagged but not fully polished
auto.gtx - Auto-generated cached metadata (lowest priority)

Blacklist Filter

The blacklist function excludes:

Empty titles (title=="")
Wikipedia pages (en.wikipedia.org)
Index pages (/doc/.../index)

This prevents noise from auto-generated stub entries.

Key Patterns

Dual Database Strategy

The tool maintains two versions of the merged database:

incompleteDB: Raw merge without filtering, used for lookups
finalDB: Blacklist-filtered, used for full dumps

This ensures that direct URL queries return results even for blacklisted items, while full dumps exclude them.

Empty Annotation Marker

When a queried URL has no metadata, the tool outputs:

/path/to/file []

The [] marker is intentionally chosen to match the tags field format, enabling consistent grep patterns for "untagged links" regardless of whether they have auto-titles.

ANSI Color Coding

Output uses terminal colors for readability:

Green (\x1b[32m): URLs and generated URLs
Purple (\x1b[35m): Titles
Reset (\x1b[0m): Ends color spans

Substring Matching

When URLs are provided via stdin, the tool performs two types of lookups:

Direct lookup: M.lookup url db for exact match
Substring filter: anyInfix stdin finalSingleLine for cross-references

This enables discovering annotations that reference a URL (e.g., finding all annotations that cite a paper).

Citation Formatting

Uses authorsToCite for smart author formatting:

Single author: "Smith"
Two authors: "Smith & Jones"
Three+ authors: "Smith et al"

Truncates author lists with ellipsis if too long.

Integration Points

Dependencies

Modules:

GTX: readGTXSlow for parsing GTX files
LinkID: authorsToCite, generateURL for citation/URL generation
LinkMetadata: sortItemPathDate for sorting
Tags: validateTagsSyntax for tag validation
Metadata.Author: authorsTruncateString for author truncation

Data files:

metadata/me.gtx
metadata/full.gtx
metadata/half.gtx
metadata/auto.gtx

Common Workflows

With link-extracter.hs:

./link-extracter.hs doc.md | ./annotation-dump.hs

With grep for quality control:

./annotation-dump.hs | grep -E '(^\[\]|TODO|FIXME)'

Audit specific source file:

./annotation-dump.hs | grep "^m;" # Only manually-curated entries

Overview​

Public API / CLI Usage​

Basic Usage​

Output Format​

Integration Patterns​

Internal Architecture​

Data Flow​

Key Data Structures​

Database Hierarchy​

Blacklist Filter​

Key Patterns​

Dual Database Strategy​

Empty Annotation Marker​

ANSI Color Coding​

Substring Matching​

Citation Formatting​

Integration Points​

Dependencies​

Common Workflows​

See Also​