404-guesser.js

Path: js/404-guesser.js | Language: JavaScript | Lines: ~263

Enhances 404 error pages with intelligent URL suggestions based on Levenshtein distance

Overview

The 404 guesser provides a user-friendly enhancement to gwern.net's 404 error pages by automatically suggesting similar URLs that the user might have intended to visit. When a user lands on a non-existent page (such as /rubiiks-cube instead of /rubiks-cube), the script fetches the site's sitemap, computes basename matches and full-path matches separately (each sorted by edit distance), concatenates those lists, dedupes, and then presents the first 10 suggestions.

The implementation is optimized for performance despite gwern.net's large sitemap (37,000+ URLs, 4.7MB uncompressed). It uses a bounded edit distance calculation that typically completes in ~3 seconds, and the sitemap compresses well (to ~0.5MB) making the bandwidth cost comparable to loading a couple of images. The script intelligently handles both full-path matching and basename-only matching as a fallback, catching cases where directory paths are wrong or missing.

Legitimate 404 errors are rare on gwern.net (<10 per day) due to comprehensive redirect rules and maintained internal links, so this feature primarily benefits real users who make typos rather than bots. The script includes logic to skip suggestion generation when the current URL is /404 (a redirect case where the original URL is lost).

Public API

Main Entry Point

async function suggest404Alternatives()

Purpose: Main orchestration function Behavior: Fetches sitemap, finds similar URLs, injects suggestions into page Called by: window.load event listener Calls: fetchSitemap(), urlPathnamesFromSitemapString(), findSimilarUrlPaths(), injectSuggestions() Early exit: If current path ends with /404 (redirected case, not a real 404)

Internal Architecture

Data Flow

Sitemap Retrieval: fetchSitemap() fetches /sitemap.xml via fetch() API
XML Parsing: urlPathnamesFromSitemapString() uses DOMParser to extract loc elements and convert to pathnames
Similarity Matching: findSimilarUrlPaths() compares current path against all sitemap URLs
DOM Injection: injectSuggestions() creates a section element and inserts before #other-options

Core Functions

fetchSitemap()

async function fetchSitemap()

Fetches location.origin + "/sitemap.xml"
Returns sitemap text or null on error
Uses standard fetch() API with error handling

urlPathnamesFromSitemapString(sitemapText)

function urlPathnamesFromSitemapString(sitemapText)

Parses sitemap XML string using DOMParser
Extracts all loc elements from the sitemap XML
Returns array of pathname strings (just the path component, no domain)

findSimilarUrlPaths(urlPaths, targetPath, numResults, maxDistance)

function findSimilarUrlPaths(urlPaths, targetPath, numResults, maxDistance)

Full-path matching with bounded Levenshtein distance
Pre-filters by length difference (Math.abs(length_a - length_b) <= maxDistance)
Calculates edit distance for remaining candidates
Sorts by distance ascending, takes top numResults
Returns array of up to numResults pathname strings

findSimilarByBasename(urlPaths, targetPath, numResults, maxDistance)

function findSimilarByBasename(urlPaths, targetPath, numResults, maxDistance)

Fallback matcher for cases like /ai-slop → /blog/ai-slop
Extracts last path segment (basename) using URLFromString().pathSegments.last
Skips if basename has fewer than 3 characters or is a directory-only path
Filters out single-character basenames
Matches basenames only, returns full pathnames

injectSuggestions(currentPath, suggestedUrlStrings)

function injectSuggestions(currentPath, suggestedUrlStrings)

Builds HTML for suggestion list
Wraps in a columns div if more than 3 suggestions
Inserts before #other-options element in #markdownBody
Calls Extracts.addTargetsWithin() to enable popups on suggestion links

Key Patterns

Bounded Levenshtein Distance

The script implements a custom boundedLevenshteinDistance() function that aborts early if the distance exceeds a threshold:

function boundedLevenshteinDistance(a, b, maxDistance)

Optimizations:

Length pre-check: Returns maxDistance + 1 if absolute length difference > maxDistance
Row-wise early exit: After computing each row of the dynamic programming matrix, checks if minimum distance in that row exceeds maxDistance and aborts if so
Result: Avoids computing full O(n*m) matrix for dissimilar strings

This is critical for performance: with 37,000+ URLs, full Levenshtein distance would be prohibitively slow. The bounded version with maxDistance=8 completes in ~3 seconds.

Two-Stage Matching Strategy

The algorithm combines both matching approaches:

Basename matching first: Matches only the last path segment with stricter threshold (maxDistance * 0.5 = 4)
Full-path matching second: Matches entire pathnames with full threshold (maxDistance = 8)

Results are concatenated, deduplicated via .unique(), and limited to maxNumResults (10). This catches both wrong/missing directory structures (e.g., /ai-slop to /blog/ai-slop) and typos in the exact path.

Sitemap Caching Considerations

The sitemap is fetched client-side and may be cached by the browser. The author notes this is acceptable because:

Legitimate 404s are rare
Most 404s are to old URLs, not newly created ones
Probability of: (1) hitting 404, (2) caching sitemap, then (3) hitting 404 for brand-new URL is negligible

The server-side sitemap cache is flushed on every sync, ensuring server freshness.

Performance Justification

Despite the 4.7MB sitemap size, the feature is deemed worthwhile:

Compresses to ~0.5MB (comparable to a few images)
Legitimate users hit 404s rarely (fewer than 10 per day)
Ill-behaved bots typically don't execute JavaScript
If bots do load it, they waste their own bandwidth (self-DoS)
For legitimate users, 0.5MB + 3 seconds is faster than manually searching Google

Integration Points

Dependencies

utility.js:

elementFromHTML() - Creates DOM element from HTML string
URLFromString().pathSegments.last - Extracts basename from URL string

Extracts.js:

Extracts.addTargetsWithin(element) - Enables popup/annotation functionality on injected links

DOM Targets

Injection point:

Inserts before #other-options element within #markdownBody

Generated HTML structure:

The script injects a section with ID guessed-urls containing a header, description paragraph, and a list of suggestion links. If more than 3 suggestions, the list is wrapped in a columns div for multi-column display.

Event Hooks

window.load:

Registers suggest404Alternatives() to run after page load
Ensures DOM is ready before injection

Configuration

Constants

const sitemapURL = location.origin + "/sitemap.xml";

Tunable Parameters

In suggest404Alternatives():

numResults: 10 - Maximum number of suggestions to display
maxDistance: 8 - Maximum edit distance for full-path matching (chosen heuristically)

In findSimilarUrlPaths():

Basename fallback uses Math.ceil(maxDistance * 0.5) = 4 for stricter matching

In injectSuggestions():

wrapInColumnsIfMoreEntriesThan: 3 - Column layout threshold

Security Notes

The implementation trusts the sitemap.xml and does not sanitize XML or URLs. The author notes:

Security Note: the sitemap.xml is trusted and assumed to contain clean valid XML with well-behaved alphanumerical/punctuation URLs and maybe a bit of Unicode. We do not attempt to sanitize or validate the XML or the parsed URLs, so if used in settings where attackers may control the sitemap.xml (eg. user-generated content), the XML part should probably be rewritten to be ultra-paranoid & do things like ping the final suggested URLs to make sure they exist.

This is acceptable for gwern.net's controlled environment but would need hardening for user-generated content scenarios.

Overview​

Public API​

Main Entry Point​

Internal Architecture​

Data Flow​

Core Functions​

Key Patterns​

Bounded Levenshtein Distance​

Two-Stage Matching Strategy​

Sitemap Caching Considerations​

Performance Justification​

Integration Points​

Dependencies​

DOM Targets​

Event Hooks​

Configuration​

Constants​

Tunable Parameters​

Security Notes​

See Also​