Initial commit: HathiTrust package automation pipeline

Automated pipeline to process TIFF images into HathiTrust-compliant submission packages (SIP/ZIP archives). Features implemented (Steps 1-3): - Per-package metadata collection via interactive prompts - Volume discovery and organization from input directory - OCR processing (plain text + hOCR coordinate data) Components: - collect_metadata.py: Interactive metadata collection for variable capture settings (DPI, color mode, compression, etc.) - volume_discovery.py: Scans input directory, groups files by barcode/ARK identifier, validates sequential numbering - ocr_processor.py: Processes TIFFs with Tesseract OCR, generates plain text (.txt) and hOCR coordinate data (.html) Testing: - Unit tests for volume discovery (7 tests) - Unit tests for OCR processing - Test data generators included Configuration: - config.yaml: Global settings (paths, patterns, OCR config) - metadata_template.json: Per-package metadata structure - requirements.txt: Python dependencies Built for content digitized via CaptureOne Cultural Heritage Edition, supporting variable capture settings per submission package. Next steps: File validation, YAML generation, MD5 checksums, package assembly, and ZIP creation.
schipp0 · Sep 30, 2025 · 40ce797 · 40ce797
commit 40ce797
Show file tree

Hide file tree

Showing 11 changed files with 1,432 additions and 0 deletions.
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,91 @@
+# Python
+__pycache__/
+*.py[cod]
+*$py.class
+*.so
+.Python
+build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+wheels/
+pip-wheel-metadata/
+share/python-wheels/
+*.egg-info/
+.installed.cfg
+*.egg
+MANIFEST
+
+# Virtual Environments
+venv/
+env/
+ENV/
+env.bak/
+venv.bak/
+pyvenv.cfg
+bin/
+include/
+
+# PyInstaller
+*.manifest
+*.spec
+
+# Unit test / coverage reports
+htmlcov/
+.tox/
+.nox/
+.coverage
+.coverage.*
+.cache
+nosetests.xml
+coverage.xml
+*.cover
+*.log
+.pytest_cache/
+
+# Project-specific working directories
+input/
+output/
+temp/
+logs/
+
+# Per-package metadata files (these are generated per submission)
+metadata_*.json
+
+# IDE and Editor files
+.vscode/
+.idea/
+*.swp
+*.swo
+*~
+.DS_Store
+
+# OS-specific
+Thumbs.db
+Desktop.ini
+
+# Jupyter Notebooks
+.ipynb_checkpoints
+
+# PyCharm
+.idea/
+
+# mypy
+.mypy_cache/
+.dmypy.json
+dmypy.json
+
+# Pyre type checker
+.pyre/
+
+# Memory bank (optional - uncomment if you don't want to track memory)
+# .memory-bank/
+# External dependencies (clone separately)
+HathiTrustYAMLgenerator/
diff --git a/DEMO_step2.md b/DEMO_step2.md
@@ -0,0 +1,41 @@
+## Step 2: Directory Discovery - DEMO
+
+### Create test files:
+```bash
+cd /home/schipp0/Digitization/HathiTrust
+
+# Create 5 test TIFF files with barcode 39015012345678
+python3 volume_discovery.py --create-test --barcode 39015012345678 --num-files 5
+
+# Create another volume with different barcode
+python3 volume_discovery.py --create-test --barcode 39015099887766 --num-files 3
+```
+
+### Discover volumes:
+```bash
+python3 volume_discovery.py input/
+```
+
+Expected output:
+```
+============================================================
+VOLUME DISCOVERY SUMMARY
+============================================================
+
+📦 Volume: 39015012345678
+   Files: 5
+   Range: 00000001 to 00000005
+   Status: ✓ Valid
+
+📦 Volume: 39015099887766
+   Files: 3
+   Range: 00000001 to 00000003
+   Status: ✓ Valid
+```
+
+### Run tests:
+```bash
+python3 test_volume_discovery.py -v
+```
+
+All 7 tests should pass ✓
diff --git a/DEMO_step3.md b/DEMO_step3.md
@@ -0,0 +1,81 @@
+## Step 3: OCR Processing Pipeline - DEMO
+
+### Prerequisites
+Ensure Tesseract is installed:
+```bash
+# Check if tesseract is installed
+tesseract --version
+
+# If not installed:
+sudo apt-get update
+sudo apt-get install tesseract-ocr tesseract-ocr-eng
+```
+
+### Test Setup
+
+#### 1. Create test TIFF files (if not already done):
+```bash
+cd /home/schipp0/Digitization/HathiTrust
+python3 volume_discovery.py --create-test --barcode 39015012345678 --num-files 3
+```
+
+#### 2. Run OCR on all discovered volumes:
+```bash
+python3 ocr_processor.py input/
+```
+
+Expected output:
+```
+📂 Discovering volumes...
+Found 1 volume(s)
+
+============================================================
+Processing Volume: 39015012345678
+============================================================
+Processing 3 files with OCR
+  [1/3] 39015012345678_00000001.tif
+  [2/3] 39015012345678_00000002.tif
+  [3/3] 39015012345678_00000003.tif
+
+✓ OCR Results:
+  Successful: 3
+  Failed: 0
+  Output: temp/39015012345678
+```
+
+#### 3. Process specific volume only:
+```bash
+python3 ocr_processor.py input/ --volume-id 39015012345678
+```
+
+#### 4. Check output files:
+```bash
+ls -l temp/39015012345678/
+```
+
+Should show:
+```
+00000001.txt   # Plain text OCR
+00000001.html  # hOCR coordinate data
+00000002.txt
+00000002.html
+00000003.txt
+00000003.html
+```
+
+### Run Tests
+```bash
+python3 test_ocr_processor.py -v
+```
+
+### Output Format
+
+**Plain Text (.txt):**
+- UTF-8 encoded
+- Control characters removed (except tab, CR, LF)
+- Raw text from Tesseract
+
+**hOCR (.html):**
+- XML/HTML format with coordinate data
+- Contains bounding box information for each word
+- Compatible with HathiTrust requirements
diff --git a/README.md b/README.md
@@ -0,0 +1,158 @@
+# HathiTrust Package Automation Pipeline
+
+## Project Structure
+```
+HathiTrust/
+├── .memory-bank/             # Project memory storage
+├── input/                    # Source TIFF files (organized by barcode/ARK)
+├── output/                   # Final ZIP packages
+├── temp/                     # Intermediate processing files
+├── logs/                     # Processing logs
+├── config.yaml               # Global configuration
+├── metadata_template.json    # Template for package metadata
+├── collect_metadata.py       # Interactive metadata collection
+├── requirements.txt          # Python dependencies
+└── README.md                 # This file
+```
+
+## Setup Instructions
+
+### 1. Install System Dependencies
+```bash
+sudo apt-get update
+sudo apt-get install tesseract-ocr tesseract-ocr-eng
+```
+
+### 2. Install Python Dependencies
+```bash
+pip install -r requirements.txt
+```
+
+### 3. Clone YAML Generator
+```bash
+cd /home/schipp0/Digitization/HathiTrust
+git clone https://github.com/moriahcaruso/HathiTrustYAMLgenerator.git
+```
+
+## Workflow: Creating a Submission Package
+
+### Step 1: Prepare TIFF Files
+Place digitized TIFF files in `input/` directory:
+- Files should follow naming: `<barcode>_00000001.tif`, `<barcode>_00000002.tif`, etc.
+- Or: `<ark_id>_00000001.tif`, `<ark_id>_00000002.tif`, etc.
+
+### Step 2: Collect Package Metadata
+Run the interactive metadata collection tool:
+```bash
+./collect_metadata.py
+```
+
+This will prompt you for:
+- **Volume identifier** (barcode or ARK)
+- **Capture info** (date, operator, CaptureOne version)
+- **Image specs** (DPI, color mode, compression)
+- **Page order** (scanning/reading order)
+- **Content type** (book, journal, manuscript, etc.)
+
+Metadata is saved as: `metadata_<identifier>.json`
+
+### Step 3: Process Package
+(Main processing script to be implemented)
+```bash
+./process_package.py --metadata metadata_<identifier>.json
+```
+
+This will:
+1. Validate TIFF files
+2. Run OCR (text + hOCR coordinates)
+3. Generate meta.yml
+4. Create checksum.md5
+5. Package into ZIP
+
+## Key Features
+
+### Per-Package Metadata
+Unlike scanner-based workflows with static settings, this pipeline supports **variable capture settings** per submission:
+- Different DPI (300, 400, 600, etc.)
+- Various color modes (bitonal, grayscale, color)
+- Multiple compression types
+- Flexible reading orders
+
+### CaptureOne Integration
+Designed for content digitized via **CaptureOne Cultural Heritage Edition**, not physical scanners.
+
+### HathiTrust Compliance
+Output packages meet all HathiTrust requirements:
+- 8-digit sequential file naming
+- Plain text OCR (.txt)
+- Coordinate OCR (.html hOCR format)
+- meta.yml metadata
+- checksum.md5 fixity file
+- Proper ZIP structure (no subdirectories)
+
+## Next Development Steps
+- [ ] Implement main processing script
+- [ ] Integrate with HathiTrustYAMLgenerator
+- [ ] Add validation checks
+- [ ] Test with sample packages
+- [ ] Add batch processing support
+
+
+## Implementation Status
+
+### ✅ Step 1: Configuration & Setup
+- Directory structure created
+- Per-package metadata collection (`collect_metadata.py`)
+- Configuration files (`config.yaml`, `metadata_template.json`)
+
+### ✅ Step 2: Directory Discovery & Organization
+- Volume discovery module (`volume_discovery.py`)
+- Barcode and ARK identifier extraction
+- Sequential file validation
+- Test suite with 7 passing tests
+- Test file generator for development
+
+**Usage:**
+```bash
+# Discover volumes in input directory
+python3 volume_discovery.py input/
+
+# Create test files
+python3 volume_discovery.py --create-test --barcode 39015012345678 --num-files 5
+
+# Run tests
+python3 test_volume_discovery.py
+```
+
+### ✅ Step 3: OCR Processing Pipeline
+- OCR processor module (`ocr_processor.py`)
+- Plain text OCR generation (.txt files)
+- Coordinate OCR generation (.html hOCR format)
+- Text sanitization (control character removal)
+- UTF-8 encoding enforcement
+- Batch processing with error handling
+- Test suite with Tesseract integration tests
+
+**Usage:**
+```bash
+# Process all volumes with OCR
+python3 ocr_processor.py input/
+
+# Process specific volume
+python3 ocr_processor.py input/ --volume-id 39015012345678
+
+# Custom language/output
+python3 ocr_processor.py input/ --language fra --output-dir /tmp/ocr
+
+# Run tests
+python3 test_ocr_processor.py
+```
+
+### 🔄 Next Steps
+- Step 4: File Validation & Naming Convention
+- Step 5: YAML Metadata Generation
+- Step 6: MD5 Checksum Generation
+- Step 7: Package Assembly
+- Step 8: ZIP Archive Creation
+- Step 9: Quality Control & Validation
+- Step 10: Main Processing Pipeline