Skip to main content

markdown-length-checker.hs

Path: build/markdown-length-checker.hs | Language: Haskell (runghc script) | Lines: ~36

Validates that code blocks in Markdown files don't contain overly long lines


Overview

This is a simple linting tool that scans Markdown files for code blocks containing lines at least 110 characters long. The rationale is that excessively long lines in code examples indicate source code that could benefit from refactoring for clarity and readability.

The script uses Pandoc to parse Markdown into an AST, then queries all CodeBlock elements to find lines exceeding the threshold. It's designed for batch processing via find -exec, making it easy to scan an entire documentation tree.

This fits into gwern.net's quality assurance tooling—ensuring that code examples in essays and documentation remain readable without horizontal scrolling, especially important given the site's focus on long-form content consumption.


Public API

main :: IO ()

Entry point. Reads a single filename from command-line arguments, processes it, and prints any violations to stdout.

Usage:

runghc markdown-length-checker.hs path/to/file.md
# Or batch:
find ~/wiki/ -name "*.md" -exec runghc markdown-length-checker.hs {} \;

Output format: filepath: CodeBlock (attr) "offending content..."


processLint :: FilePath -> T.Text -> T.Text

Core processing function. Parses Markdown content and returns formatted violation messages.

Called by: main Calls: lineCheck, Pandoc's readMarkdown

Returns: Empty text if no violations, otherwise newline-separated list of violations prefixed with filename.


Internal Architecture

Processing Pipeline

File content

readMarkdown (Pandoc parser with full extensions)

topDown longCodeLines (transform: mark blocks with long lines)

queryWith clean (extract: collect only CodeBlocks)

Format violations with filename prefix

Key Functions

FunctionPurpose
lineCheckCombines transformation and extraction into single query
longCodeLinesTransforms CodeBlocks: keeps block if any line ≥110 chars, replaces with Plain [] otherwise
cleanFilter that extracts only CodeBlock elements from AST

Data Flow

The topDown longCodeLines traversal visits every Block in the document. For CodeBlock nodes, it checks each line's length—if any line reaches 110+ characters, the block is preserved; otherwise it's replaced with an empty Plain [] sentinel. The subsequent queryWith clean collects only the surviving CodeBlock elements.


Key Patterns

Pandoc AST Querying

Uses Pandoc's generic traversal combinators:

  • topDown for transformation (modifying nodes while traversing)
  • queryWith for extraction (collecting matching nodes)

This two-phase approach (transform then query) is slightly unusual—a single queryWith with filtering would be more direct—but it works correctly.

Threshold Hardcoding

The 110-character limit is hardcoded in longCodeLines. This value is a reasonable default for code readability, roughly matching common terminal widths and editor configurations.

Minimal Error Handling

Parse failures are reported but don't halt batch processing. The error message includes the problematic file path and Pandoc's error details.


Configuration

SettingLocationDefaultNotes
Line length thresholdlongCodeLines function110 charsHardcoded; edit source to change
Pandoc extensionsprocessLintpandocExtensionsFull extension set for gwern.net compatibility

Integration Points

Build System

Not directly integrated into sync-sh—this is a manual/ad-hoc linting tool rather than a build-blocking check.

Typical Usage

# Check all Markdown files
find ~/wiki/ -name "*.md" -exec runghc markdown-length-checker.hs {} \;

# Check single file
runghc markdown-length-checker.hs essay.md

Output Interpretation

Empty output means all code blocks pass. Non-empty output shows the full CodeBlock AST node for each violation, which includes:

  • Block attributes (language annotation, classes, key-value pairs)
  • The complete code content (not just the offending lines)

See Also