Annotation.OpenReview

Path: build/Annotation/OpenReview.hs | Language: Haskell | Lines: ~42

Scrapes academic paper metadata from OpenReview.net conference papers

Overview

OpenReview.net is a platform for open peer review of academic papers, hosting many machine learning conference submissions (ICLR, NeurIPS workshops, etc.). This module extracts structured metadata from OpenReview paper URLs for use in gwern.net's annotation system.

The module delegates the actual HTTP fetching and JSON parsing to an external shell script (openReviewAbstract.sh), then processes and normalizes the returned data. It handles both /forum?id= (discussion page) and /pdf?id= (direct PDF) URL formats, converting the latter to the former for scraping since the forum page contains the structured metadata.

A notable feature is the combination of multiple abstract-like fields: OpenReview papers often have a "TL;DR" summary in addition to the full abstract, plus keywords. All three are combined into a unified abstract field with the TL;DR first, then full abstract, then keywords as an HTML comment.

Public API

`openreview :: Metadata -> Path -> IO (Either Failure (Path, MetadataItem))`

Main entry point for scraping an OpenReview paper URL.

Parameters:

md - The global metadata database (passed through for processParagraphizer)
p - The OpenReview URL to scrape

Returns:

Right (path, metadataItem) on success with extracted metadata
Left Permanent on failure (download error or parse failure)

Called by: Annotation.hs (URL dispatcher, pattern matches on openreview.net/forum?id= or openreview.net/pdf?id=)

Calls:

runShellCommand - Executes openReviewAbstract.sh with the URL
processArxivAbstract - Cleans up LaTeX math notation in abstracts
cleanAbstractsHTML - Sanitizes HTML in abstract text
processParagraphizer - Adds paragraph breaks to long abstracts
linkAutoHtml5String - Auto-links recognized terms in keywords
cleanAuthors / trimTitle - Normalizes author and title strings

Internal Architecture

Data Flow

URL (forum or pdf)
        │
        ▼
openReviewAbstract.sh (curl + jq)
        │
        ▼
  Line-separated output:
  [title, author, date, tldr, abstract, keywords...]
        │
        ▼
  Parse & clean each field
        │
        ▼
  Combine: tldr + abstract + "<!-- Keywords: ... -->"
        │
        ▼
  MetadataItem tuple

Shell Script Behavior

The companion script build/openReviewAbstract.sh does the heavy lifting:

Fetches the URL via curl with redirects followed
Pipes HTML through tidy to normalize it
Greps for the pageProps JSON blob embedded in the page
Uses jq to extract: title, authors (joined), timestamp (converted to date), TL;DR, abstract, keywords (joined)
Cleans up null values and LaTeX artifacts

The script handles two OpenReview JSON formats (with and without .value nesting) for compatibility across different paper types.

MetadataItem Structure

Returns a 7-tuple:

(title, authors, date, "", [], [], abstract)
--  ^      ^      ^    ^   ^   ^     ^
--  |      |      |    |   |   |     Combined TL;DR + abstract + keywords
--  |      |      |    |   |   Tags (empty)
--  |      |      |    |   Key-value pairs (empty)
--  |      |      |    DOI/created date (empty)
--  |      |      Publication date (YYYY-MM-DD)
--  |      Comma-separated author list
--  Cleaned title

Key Patterns

URL Normalization

PDF URLs are converted to forum URLs before scraping:

let p' = replace "/pdf?id=" "/forum?id=" p

This ensures consistent metadata extraction since the PDF endpoint doesn't contain the structured metadata JSON.

Keyword Embedding as HTML Comment

Keywords are preserved but hidden from display using HTML comments:

let keywords'
  | null keywords || keywords == [""] = ""
  | length keywords > 1 =
      unlines (init keywords) ++ "\n<!-- [Keywords: " ++ last keywords ++ "] -->"
  | otherwise = "<!-- [Keywords: " ++ concat keywords ++ "] -->"

This keeps keywords searchable in the raw HTML while not cluttering the visible abstract.

Arxiv Abstract Processing Reuse

OpenReview papers often contain LaTeX math notation similar to arXiv papers, so the module reuses processArxivAbstract for cleanup:

let tldr' = cleanAbstractsHTML $ processArxivAbstract tldr
let desc' = cleanAbstractsHTML $ processArxivAbstract desc

Configuration

No direct configuration options. Behavior is controlled by:

The openReviewAbstract.sh script's jq selectors
Shared abstract cleaning functions from Metadata.Format
Author normalization rules in Metadata.Author

Integration Points

URL Dispatch (Annotation.hs)

The annotation dispatcher routes to this module based on URL prefix matching:

| "https://openreview.net/forum?id=" `isPrefixOf` l
  || "https://openreview.net/pdf?id=" `isPrefixOf` l =
    openreview md l

Link Archiving (Config/LinkArchive.hs)

OpenReview forum URLs are converted to PDF URLs for archiving:

replace "https://openreview.net/forum" "https://openreview.net/pdf"

This complements this module's inverse transformation, ensuring users see metadata from the forum page while archived PDFs provide permanence.

Link Icons (Config/LinkIcon.hs)

OpenReview links display a distinctive "OR" icon in red (#8c1b13), matching the site's branding.

Overview​

Public API​

openreview :: Metadata -> Path -> IO (Either Failure (Path, MetadataItem))​

Internal Architecture​

Data Flow​

Shell Script Behavior​

MetadataItem Structure​

Key Patterns​

URL Normalization​

Keyword Embedding as HTML Comment​

Arxiv Abstract Processing Reuse​

Configuration​

Integration Points​

URL Dispatch (Annotation.hs)​

Link Archiving (Config/LinkArchive.hs)​

Link Icons (Config/LinkIcon.hs)​

See Also​