upload.sh
Path: build/upload.sh | Language: Bash | Lines: ~279
Comprehensive file upload manager with automatic optimization, naming, and deployment.
Overview
upload.sh is the primary interface for adding files to gwern.net, handling everything from file naming and format conversion to optimization, metadata extraction, and deployment. It serves as a quality gate ensuring all uploaded content follows site conventions: lowercase filenames, standardized extensions, globally unique names, and optimized formats.
The script distinguishes between two upload modes: temporary uploads (one argument) go to /doc/www/misc/, while permanent uploads (two arguments) are placed in topic-specific directories like /doc/statistics/decision/. Files undergo format-specific processing: PDFs are OCR'd and compressed, images are metadata-stripped and checked for optimization opportunities, and text files are reformatted for readability. (No automatic deletion is scheduled in the script.)
A key feature is the large-file strategy: files over 200MB are uploaded normally but added to .gitignore to avoid bloating the Git repository, providing the benefits of Git-based deployment without the overhead of Git-LFS or Git-annex. The script integrates with the broader gwern.net infrastructure by calling external tools (compressPdf, crossref, cloudflare-expire) and opening uploaded files in a browser for verification.
Key Commands/Variables
Main Function - _upload():
Filename normalization:
FILENAME="$(echo "$1" | tr '[:upper:]' '[:lower:]' | sed -e 's/\.jpeg$/.jpg/')"
# Always lowercase, standardize .jpeg → .jpg
Extension validation:
ALLOWED_EXTENSIONS=$(find ~/wiki/ -type f -printf '%f\n' | awk -F. 'NF>1 {print $NF}' | sort -iu)
# Reject any extension never used before (prevents typos/malformed files)
Global uniqueness check:
rename_file() {
for ((i=2; i<=100; i++)); do
new_filename="${base_name}-${i}.${extension}"
# Loop until unique: 2023-liu.pdf → 2023-liu-2.pdf → 2023-liu-3.pdf ...
done
}
Format conversions:
.md→.txt(avoid compiling random Markdown snippets).jpeg→.jpg(enforce consistency).webp→.png(avoid exotic formats)
PDF handling:
compressPdf "$TARGET" || true # JBIG2 compression, OCR, PDF/A conversion
METADATA=$(crossref "$TARGET") && echo "$METADATA" & # Extract DOI metadata
Large file strategy:
SIZE_THRESHOLD=200000000 # 200MB
if [[ "$FILESIZE" -gt "$SIZE_THRESHOLD" ]]; then
echo "$TARGET" >> ./.gitignore # Skip git versioning
bold "Added large file to .gitignore (size: ...)"
fi
Image optimization:
exiftool -overwrite_original -All="" "$FILENAME" # Strip metadata from temp images
png2JPGQualityCheck ~/wiki/"$TARGET" # Suggest PNG→JPG if appropriate
Deployment:
rsync --chmod='a+r' --mkpath -q "$TARGET" gwern@176.9.41.242:"/home/gwern/gwern.net/$TARGET_DIR/"
cloudflare-expire "$TARGET_DIR/$(basename "$FILE")" > /dev/null # Clear CDN cache
"$BROWSER" "$URL" 2> /dev/null & # Open for verification
Usage
Temporary upload (no automatic expiration in script):
./upload.sh screenshot.png
# → https://gwern.net/doc/www/misc/2026-01-07-screenshot.png
# Automatically prefixed with today's date
Permanent upload with topic directory:
./upload.sh 2023-smith.pdf statistics
# → /doc/statistics/2023-smith.pdf
# → https://gwern.net/doc/statistics/2023-smith.pdf
Automatic filename rewriting:
./upload.sh smith2023.pdf reinforcement-learning
# Automatically renamed: smith2023.pdf → 2023-smith.pdf
# → /doc/reinforcement-learning/2023-smith.pdf
Tag-based directory guessing:
./upload.sh paper.pdf "deep learning"
# Calls: ./static/build/guessTag "deep learning"
# → Might resolve to /doc/ai/ or /doc/neural-network/
Multiple file upload:
./upload.sh *.pdf statistics/decision
# Last argument (if not a file) = directory
# All files uploaded to that directory
Large file handling:
./upload.sh large-dataset.pkl embeddings
# File size: 450MB
# → Uploaded to /doc/embeddings/
# → Added to .gitignore automatically
# → Not tracked in git history
Workflow example:
# Download a paper
wget https://arxiv.org/pdf/2301.12345.pdf -O transformer-paper.pdf
# Upload to appropriate topic directory
./upload.sh transformer-paper.pdf ai
# Script automatically:
# 1. Renames to match year-author.pdf pattern
# 2. Runs OCR if needed
# 3. Compresses with JBIG2
# 4. Extracts metadata via CrossRef
# 5. Deploys to server
# 6. Opens in browser for verification
Environment variables:
BROWSER- Prefers Firefox if running, else Chromium/Chrome/Brave
Exit codes:
1- File missing/empty2- Missing required dependencies3- Unsupported file extension4- Could not change to wiki directory (temp upload path)5- Failed to find unique filename after 100 attempts6- Rename failure after uniqueness attempt8- Target directory invalid and guess failed9- Could not change to wiki directory (permanent upload path)10- File already exists at exact target path
Key Dependencies
Required external tools:
firefox/chromium- Browser for verificationexiftool- Metadata strippingrsync- Server uploadcurl- URL validationgit- Version controlcompressPdf- PDF optimization wrappercloudflare-expire- CDN cache invalidationpng2JPGQualityCheck- Image format recommendationconvert(ImageMagick) - WebP conversionlocate- Filename searchcrossref- DOI metadata extractionguessTag- Tag-to-directory mapper
Sourced utilities:
. ~/wiki/static/build/bash.sh # Common functions (red, bold, etc.)
See Also
- sync.sh - Build orchestrator that may trigger bulk uploads
- gwsed.sh - Site-wide string replacement for updating URLs
- download-title.sh - Title extraction for uploaded files
- embed.sh - Embedding generation for uploaded documents
- LinkMetadata.hs - Stores metadata for uploaded papers
- Annotation.hs - Annotation system that references uploaded files
- clean-pdf.py - PDF text cleaning for uploaded documents