Skip to main content

Metadata.Title

Path: build/Metadata/Title.hs | Language: Haskell | Lines: ~76

Title extraction and cleanup for web page annotations


Overview

Metadata.Title handles the extraction and cleaning of titles from external web pages. Since <title> tags are notoriously inconsistent—containing error pages, site branding, Unicode garbage, and boilerplate—this module implements a multi-stage filtering pipeline that combines shell script extraction, regular expression cleanup, and LLM-based refinement.

The module serves the annotation system by providing clean, human-readable titles for links. Titles flow through: (1) HTML download and parsing via download-title.sh, (2) separator-based truncation to remove site names, (3) blocklist filtering for known-bad patterns, and (4) optional AI cleanup via title-cleaner.py for edge cases that survive rule-based filtering.

A key design decision is the strict substring requirement for AI-cleaned titles: any cleaned title must be a strict substring of the rule-cleaned title' (not the raw original), preventing LLM confabulation or rewrites. The module also provides tooltipToMetadata for reverse-parsing citation tooltips back into structured (title, author, date) tuples, and wikipediaURLToTitle for converting Wikipedia URLs into readable titles.


Public API

htmlDownloadAndParseTitleClean :: String -> IO String

Main entry point for fetching and cleaning a URL's title. Downloads HTML, extracts <title>, removes site separators and branding, filters against blocklists, and optionally invokes LLM cleanup.

Called by: LinkMetadata annotation scrapers Calls: htmlDownloadAndParseTitle, cleanTitleWithAI, cleanAbstractsHTML

htmlDownloadAndParseTitle :: String -> IO String

Low-level title fetch. Shells out to download-title.sh which uses curl and Perl's HTML::TreeBuilder to extract the raw <title> content.

Called by: htmlDownloadAndParseTitleClean Calls: runShellCommand (to download-title.sh)

wikipediaURLToTitle :: String -> String

Converts Wikipedia URLs to human-readable titles. Handles URL decoding, underscore-to-space conversion, and section anchors (converting # to §).

wikipediaURLToTitle "https://en.wikipedia.org/wiki/Foo_Bar#Section"
-- → "Foo Bar § Section"

Called by: Wikipedia annotation handlers Calls: urlDecode, trimTitle, cleanAbstractsHTML

tooltipToMetadata :: String -> String -> (String, String, String)

Reverse-parses citation tooltips (like "'Title', Author 2020") back into structured (title, author, date) tuples. Used when recovering metadata from existing annotations.

Called by: Metadata recovery routines Calls: filterMeta, pageNumberParse, sed

cleanTitleWithAI :: String -> IO String

Invokes title-cleaner.py to clean titles using gpt-5-mini (temperature 1). Returns empty string on failure.

Called by: htmlDownloadAndParseTitleClean Calls: runShellCommand (to title-cleaner.py)


Internal Architecture

Title Cleaning Pipeline

URL


┌─────────────────────────┐
│ download-title.sh │ curl + HTML::TreeBuilder
│ (Perl HTML parsing) │
└───────────┬─────────────┘
│ raw <title> content

┌─────────────────────────┐
│ Separator truncation │ Remove text after —·|
│ (site name removal) │
└───────────┬─────────────┘


┌─────────────────────────┐
│ String replacements │ Fix encoding issues (Â, â, etc)
│ (C.stringReplace) │
└───────────┬─────────────┘


┌─────────────────────────┐
│ Suffix/prefix deletion │ Remove " - YouTube", etc
│ (C.stringDelete) │
└───────────┬─────────────┘


┌─────────────────────────┐
│ Blocklist filter │ Reject "404", "Page Not Found", etc
│ (C.badStrings) │
└───────────┬─────────────┘


┌─────────────────────────┐
│ Length validation │ Reject if <5 or >500 chars
└───────────┬─────────────┘


┌─────────────────────────┐
│ title-cleaner.py │ gpt-5-mini for edge cases
│ (LLM cleanup) │
└───────────┬─────────────┘


┌─────────────────────────┐
│ Substring validation │ LLM result must be substring
│ (anti-confabulation) │ of `title'`
└───────────┴─────────────┘

Configuration Data Structures

All filtering rules live in Config.Metadata.Title:

-- Site separator characters (—·|)
separators :: String

-- Exact match blocklist (~180 entries)
badStrings :: [String]

-- Substring pattern blocklist
badStringPatterns :: [String]

-- Encoding fix replacements
stringReplace :: [(String, String)]

-- Suffix/prefix deletions (~200 entries)
stringDelete :: [String]

Key Patterns

Separator-Based Site Name Removal

Many sites append their name after a separator. The module finds the last separator and truncates:

-- "Article Title — Site Name" → "Article Title"
if any (`elem` C.separators) title
then reverse $ tail $ dropWhile (`notElem` C.separators) $ reverse title
else title

This handles patterns like:

  • "Article — New York Times"
  • "Post · Medium"
  • "Page | Site Name"

Strict Substring Anti-Confabulation

The LLM cleanup has a safety check: cleaned titles must be substrings of title':

if titleCleaned /= title' && titleCleaned `isInfixOf` title'
then titleCleaned
else title' -- fall back to rule-based result

This prevents the LLM from rewriting or adding content—it can only delete.

Mixed Prefix/Suffix Deletion

stringDelete entries use trailing/leading spaces to indicate deletion type:

-- Trailing space = prefix deletion
"GitHub - " -- removes "GitHub - " from start

-- Leading space = suffix deletion
" - YouTube" -- removes " - YouTube" from end

This is processed by Utils.deleteMixedMany.


Configuration

Config.Metadata.Title

ConstantTypeDescription
separatorsStringCharacters used to split site names: "—·|"
badStrings[String]~180 exact-match titles to reject (404 pages, error messages, site names)
badStringPatterns[String]Substring patterns to reject ("Redirecting to", "404 ", etc.)
stringReplace[(String,String)]Encoding fixes (° → °, etc.)
stringDelete[String]~200 site-specific prefixes/suffixes to strip

title-cleaner.py

The LLM script uses gpt-5-mini with:

  • Temperature: 1
  • ~400 few-shot examples covering edge cases
  • Tasks: Remove boilerplate, fix encoding, convert *italic* to <em>, identify error pages

Integration Points

External Scripts

ScriptPurpose
build/download-title.shHTML fetch + Perl parsing
build/title-cleaner.pyLLM-based title cleanup

Dependencies

  • Metadata.Format: filterMeta, pageNumberParse, trimTitle, cleanAbstractsHTML
  • Utils: delete, replace, sed, anyInfix, trim, replaceMany, deleteMixedMany
  • Network.HTTP: urlDecode for Wikipedia URLs
  • Data.FileStore.Utils: runShellCommand for shell invocation

Shared State

  • Requires CM.cd to change to project root before shell commands
  • Uses OPENAI_API_KEY environment variable (via title-cleaner.py)

See Also