Skip to main content

italicizer.py

Path: build/italicizer.py | Language: Python | Lines: ~1294

LLM-based script to add missing italics formatting to text using semantic understanding of English writing conventions.


Overview

italicizer.py solves a common problem in web publishing: loss of italic formatting when extracting titles and text from PDFs, HTML pages, or APIs. The script uses GPT-4.1 to intelligently add <em> tags to text based on standard English typographic conventions, such as italicizing book titles, movie names, scientific species, foreign words, and ship names.

The tool is designed for automated metadata enhancement in the gwern.net build pipeline. It handles the long tail of edge cases that would be impractical to solve with rule-based systems, leveraging the LLM's knowledge of titles, proper nouns, and stylistic conventions. The script includes extensive few-shot examples (over 1,200 lines) covering books, films, periodicals, scientific names, foreign terms, musical terminology, and ambiguous cases.

When uncertain, the script returns an empty string "" rather than risk incorrect formatting. This conservative approach prevents false positives while still catching most clear-cut cases.

Key Functions

  • Main processing loop: Reads input from stdin or command-line argument, sends to OpenAI API
  • Prompt construction: Massive few-shot prompt with 1,200+ examples demonstrating when to italicize and when not to
  • Response handling: Returns formatted text with <em> tags or empty string if no changes needed

Command Line Usage

Set your OpenAI API key and pipe text to the script:

# From stdin
echo "Moby-Dick" | OPENAI_API_KEY="sk-XXX" python italicizer.py
# Output: <em>Moby-Dick</em>

# As argument
OPENAI_API_KEY="sk-XXX" python italicizer.py "The USS Enterprise"
# Output: The <em>USS Enterprise</em>

# No italics needed
echo "Herman Melville" | OPENAI_API_KEY="sk-XXX" python italicizer.py
# Output: ""

Key rules encoded in the prompt:

  • Italicize: Book/movie/TV titles, periodicals, long poems, ship/plane/spaceship names, unfamiliar foreign words, scientific names, variables, musical terms
  • Don't italicize: Personal/location names, software, short works, loanwords, brands, punctuation outside phrases, acronyms

See Also