Utils.hs
Path: build/Utils.hs | Language: Haskell | Lines: ~774
Shared utility functions for file I/O, string manipulation, Pandoc AST transformations, and URL handling
Overview
Utils.hs is the central utility module for the gwern.net Haskell backend, providing approximately 150 helper functions used throughout the build system. It consolidates common operations that don't belong to any specific domain module, serving as a shared dependency for Typography.hs, Annotation.hs, LinkMetadata.hs, and other backend components.
The module covers five major areas: (1) file I/O with atomic writes and change detection, (2) string manipulation including regex-based search-replace and fixed-string operations, (3) Pandoc AST utilities for converting between HTML/Markdown/plaintext and manipulating inline elements, (4) URL parsing and validation with gwern.net style enforcement, and (5) statistical and date calculation helpers.
A key design decision is the heavy use of Text alongside String, with most functions having both variants (e.g., replace/replaceT, delete/deleteT). The module also emphasizes safety through checked variants like replaceChecked and comprehensive error messages that include context for debugging.
Public API
File I/O
writeUpdatedFile(template, target, contentsNew) -> IO ()
Writes content to a file only if changed, using atomic rename via temp file. Creates parent directories as needed.
Called by: Most build modules when writing output files
Calls: doesFileExist, createDirectoryIfMissing, emptySystemTempFile, renameFile
safeGetFileSize(path) -> IO Integer
Returns file size, or 0 on error (no exception thrown).
Called by: getDirectoryContentsSizeRecursive
Calls: getFileSize (wrapped in try/catch)
getDirectoryContentsSizeRecursive(dirPath) -> IO Integer
Recursively calculates total size of all files in a directory tree.
Called by: Build statistics gathering
Calls: safeGetFileSize, listDirectory, doesFileExist, doesDirectoryExist
getMostRecentlyModifiedDir(dir) -> IO String
Returns Unix timestamp of most recently modified file in a directory.
Called by: Cache invalidation logic
Calls: listDirectory, getModificationTime
String Manipulation
replace(before, after, str) -> str
Fixed-string replacement. Errors if before == after.
replace "foo" "bar" "foo baz foo" -- "bar baz bar"
Called by: Nearly all modules
Calls: split, intercalate
replaceChecked(before, after, str) -> str
Strict replacement that errors if: any argument is null, arguments aren't unique, or replacement didn't happen.
Called by: Code paths where replacement must succeed
Calls: replace
sed(before, after, str) -> str
Regex-based search and replace using POSIX extended regex syntax. Catches exceptions and provides debugging context.
sed "^<p>(.*)</p>$" "\\1" "<p>hello</p>" -- "hello"
Called by: toHTML, HTML cleanup functions
Calls: mkRegex, subRegex
sedMany(regexps) -> (String -> String)
Apply multiple regex replacements in sequence.
Called by: Batch text cleanup
Calls: sed, isUniqueList
delete(target) -> (String -> String)
Delete substring from string (specialized replace x "").
Called by: Content cleanup
Calls: replace
deleteMany(targets) -> (String -> String)
Delete multiple substrings.
Called by: inlinesToText, content cleanup
Calls: delete, isUniqueList
deleteMixed(target, str) -> String
Smart deletion: if target starts with space, remove as suffix; if ends with space, remove as prefix; otherwise delete anywhere.
deleteMixed " - Site Name" "Title - Site Name" -- "Title"
deleteMixed "Site: " "Site: Title" -- "Title"
Called by: Title cleanup in annotation scrapers
Calls: delete
trim(str) -> String
Strip leading/trailing whitespace and hyphens.
Called by: simplifiedString, various cleanups
Calls: dropWhile, reverse
Pandoc Utilities
toPandoc(html) -> Pandoc
Parse HTML string to Pandoc AST.
Called by: simplifiedHtmlToString, content processing
Calls: readHtml
toHTML(inline) -> String
Render a single Pandoc Inline to HTML string, stripping <p> wrapper.
toHTML $ Span nullAttr [Str "foo"] -- "<span>foo</span>"
toHTML $ Str "foo" -- "foo"
Called by: HTML generation
Calls: writeHtml5String, sed
toMarkdown(html) -> String
Convert HTML to Markdown.
Called by: Content conversion
Calls: readHtml, writeMarkdown
simplified(block) -> Text
Render a Pandoc Block to plain text (no formatting).
Called by: Title/description extraction
Calls: simplifiedDoc
simplifiedDoc(pandoc) -> Text
Render entire Pandoc document to plain text with very wide column width (prevents unwanted line breaks).
Called by: simplified, simplifiedHtmlToString
Calls: writePlain
inlinesToText(inlines) -> Text
Extract plain text from Pandoc inlines, recursively unwrapping formatting.
inlinesToText [Emph [Str "hello"], Str " ", Strong [Str "world"]]
-- "hello world"
Called by: Link text extraction, title handling Calls: Recursive pattern matching on Inline types
parseRawAllClean(pandoc) -> Pandoc
Parse RawBlock HTML into proper Pandoc AST and clean up empty Divs/Spans.
Called by: Content normalization
Calls: parseRawBlock, cleanUpSpans, cleanUpDivsEmpty
addClass(class, inline) -> Inline
Add CSS class to Link, Span, Image, or Code inline.
Called by: Link annotation, styling Calls: Pattern matching on Inline
removeClass(class, inline) -> Inline
Remove CSS class from inline element.
Called by: Cleanup passes
Calls: filter
hasClass(class, inline) -> Bool
Check if inline element has a CSS class.
Called by: Conditional processing
Calls: elem
addKey / removeKey / hasKey
Manipulate key-value attributes on inline elements.
URL Handling
host(url) -> Text
Extract hostname from URL. Enforces gwern.net style: root domains must have trailing slash.
host "https://example.com/path" -- "example.com"
host "https://example.com" -- prints warning, returns "example.com"
Called by: Link categorization, domain checks
Calls: parseURIReference, escapeUnicode
isURL(url) -> Bool
Check if string is valid HTTP/HTTPS URL.
Called by: Link validation
Calls: parseURI, escapeUnicode
isURLAny(url) -> Bool
Check any URL (local paths, mailto:, irc:, etc.).
Called by: General link validation
Calls: isURL, isURILocalT
isLocal(text) -> Bool
Check if URL is local (starts with /).
Called by: Link categorization
Calls: T.head
hasExtension(ext, path) -> Bool
Check file extension of URL/path.
Called by: File type detection
Calls: extension
escapeUnicode(text) -> Text
Percent-encode Unicode characters for URI compatibility.
Called by: host, isURLT
Calls: escapeURIString
Statistics & Dates
calculatePercentilesFromWholeNumbers(sizes) -> [Int]
Calculate percentile rank (0-100) for each positive integer in a list.
Called by: Size-based styling/ranking
Calls: sort, Map.fromList
calculateDateSpan(start, end) -> Int
Calculate days between two dates (supports YYYY, YYYY-MM, YYYY-MM-DD formats).
calculateDateSpan "1939-09-01" "1945-05-08" -- 2077
Called by: Duration calculations
Calls: parseDate, diffDays
formatDaysInLargestUnit(days) -> String
Format day count as days/months/years.
formatDaysInLargestUnit 365 -- "1y"
formatDaysInLargestUnit 45 -- "1m"
formatDaysInLargestUnit 7 -- "7d"
Called by: Human-readable duration display Calls: Arithmetic
formatIntWithCommas(n) -> String
Format integer with thousand separators.
formatIntWithCommas 1234567 -- "1,234,567"
Called by: Statistics display
Calls: intercalate, chunksOf
Utility Helpers
fixedPoint(f, x) -> x
Repeatedly apply function until output stabilizes. Detects cycles and infinite loops (5000 iteration limit).
Called by: Normalization loops Calls: Recursive with Set-based cycle detection
frequency(list) -> [(Int, a)]
Count occurrences, return sorted by frequency ascending.
frequency "hello" -- [(1,'e'),(1,'h'),(2,'l'),(1,'o')]
Called by: Analysis/statistics
Calls: group, sort
repeated(list) -> [a]
Find elements appearing more than once.
repeated "foo bar" -- "o"
Called by: Duplicate detection
Calls: M.filter, M.fromListWith
ensure(location, description, predicate, list) -> list
Assert predicate holds for all list elements; fatal error with context if not. Forces evaluation via deepseq.
Called by: Validation passes
Calls: deepseq, error
truncateString(maxLen, str) -> String
Truncate string at word boundary, append "…".
truncateString 20 "This is a long string" -- "This is a long…"
Called by: Title display, column fitting
Calls: elemIndices, take
Internal Architecture
Data Flow Patterns
HTML Input → toPandoc → Pandoc AST → walk/topDown transforms → toHTML/toMarkdown → Output
↑
parseRawAllClean (normalize Raw* blocks)
↓
cleanUpSpans/cleanUpDivsEmpty (remove cruft)
Key Data Structures
Pandoc Attr: (id, [classes], [(key,value)]) - the attribute tuple used throughout
nullAttr = ("", [], [])
WriterOptions: safeHtmlWriterOptions sets column width to 9999 to prevent unwanted line breaks in tables/grids.
String vs Text
Most functions come in pairs:
replace/replaceTdelete/deleteTdeleteMany/deleteManyTanyInfix/anyInfixTkvLookup/kvLookupT
The T suffix indicates Data.Text versions.
Key Patterns
Atomic File Writes
writeUpdatedFile template target contentsNew = do
existsOld <- doesFileExist target
if not existsOld then do
createDirectoryIfMissing True (takeDirectory target)
TIO.writeFile target contentsNew
else do
contentsOld <- TIO.readFile target
if contentsNew /= contentsOld then do
tempPath <- emptySystemTempFile ("hakyll-"++template)
TIO.writeFile tempPath contentsNew
renameFile tempPath target -- atomic on POSIX
else return ()
Only writes when content changed. Uses temp file + rename for atomicity, preventing partial writes.
Safe Regex with Debug Context
sed before after s = unsafePerformIO $ do
catch action handleExceptions
where
handleExceptions e = return $
"Error occurred. Exception: " ++ show e ++
"; arguments were: '" ++ before ++ "' : '" ++ after ++ "' : '" ++ s ++ "'"
Regex operations catch exceptions and include full context (pattern, replacement, input) in error messages.
Fixed-Point with Cycle Detection
fixedPoint' n seen f i
| i `S.member` seen = error $ "Cycle detected! Last result: " ++ show i
| otherwise = let i' = f i
in if i' == i then i
else fixedPoint' (n-1) (S.insert i seen) f i'
Tracks all previous values in a Set to detect cycles, with 5000-iteration safety limit.
Pandoc Inline Class Manipulation
addClass clss x@(Link (i, clsses, ks) s (url, tt)) =
if clss `elem` clsses
then x
else Link (i, clss:clsses, ks) s (url, tt)
Uniform handling across Link, Span, Image, Code with idempotent add (no duplicates).
Configuration
Writer Options
safeHtmlWriterOptions - Used for HTML output:
writerColumns = 9999- Prevents line-breaking in grid tableswriterExtensions = enableExtension Ext_shortcut_reference_links pandocExtensions- Cleaner link syntax
Regex Library
Uses regex-compat-tdfa (not regex-compat) for Unicode support. This is critical—wrong package causes silent failures with non-ASCII text.
Fixed-Point Limits
- Default iteration limit: 5000
- Can be tuned per-call if number of rewrite rules is known
Integration Points
Terminal Output
printGreen :: String -> IO () -- Normal progress (green text)
printRed :: String -> IO () -- Errors/warnings (red background)
Used throughout build system for status messages. Red messages truncated at 2048 chars.
Shared Predicates
Several predicates are used across modules:
isInflationURL/isInflationLink- Detect$/₿prefixed pricesisLocal- Local vs external URLhasExtension- File type checking
Pandoc Ecosystem
Depends on:
Text.Pandoc- Core AST types and readers/writersText.Pandoc.Walk- AST traversal (walk,topDown)
Provides utilities consumed by Typography.hs, LinkMetadata.hs, and content processing.
External Tools
inlineMath2Text shells out to static/build/latex2unicode.py for LaTeX→Unicode conversion.
See Also
- Hakyll.hs - Site generator that imports Utils as a core dependency
- Query.hs - Uses Pandoc AST utilities for link extraction
- Typography.hs - Major consumer of string manipulation and Pandoc utilities
- Annotation.hs - Uses string manipulation and URL handling
- LinkMetadata.hs - Uses file I/O, URL validation, and text processing
- Unique.hs - Companion validation utilities for configuration lists
- sync.sh - Build orchestrator that coordinates modules using Utils