Skip to content

Commit

Permalink
Initial commit: HathiTrust package automation pipeline
Browse files Browse the repository at this point in the history
Automated pipeline to process TIFF images into HathiTrust-compliant
submission packages (SIP/ZIP archives).

Features implemented (Steps 1-3):
- Per-package metadata collection via interactive prompts
- Volume discovery and organization from input directory
- OCR processing (plain text + hOCR coordinate data)

Components:
- collect_metadata.py: Interactive metadata collection for variable
  capture settings (DPI, color mode, compression, etc.)
- volume_discovery.py: Scans input directory, groups files by
  barcode/ARK identifier, validates sequential numbering
- ocr_processor.py: Processes TIFFs with Tesseract OCR, generates
  plain text (.txt) and hOCR coordinate data (.html)

Testing:
- Unit tests for volume discovery (7 tests)
- Unit tests for OCR processing
- Test data generators included

Configuration:
- config.yaml: Global settings (paths, patterns, OCR config)
- metadata_template.json: Per-package metadata structure
- requirements.txt: Python dependencies

Built for content digitized via CaptureOne Cultural Heritage Edition,
supporting variable capture settings per submission package.

Next steps: File validation, YAML generation, MD5 checksums, package
assembly, and ZIP creation.
  • Loading branch information
schipp0 committed Sep 30, 2025
0 parents commit 40ce797
Show file tree
Hide file tree
Showing 11 changed files with 1,432 additions and 0 deletions.
91 changes: 91 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,91 @@
# Python
__pycache__/
*.py[cod]
*$py.class
*.so
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
pip-wheel-metadata/
share/python-wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST

# Virtual Environments
venv/
env/
ENV/
env.bak/
venv.bak/
pyvenv.cfg
bin/
include/

# PyInstaller
*.manifest
*.spec

# Unit test / coverage reports
htmlcov/
.tox/
.nox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
*.log
.pytest_cache/

# Project-specific working directories
input/
output/
temp/
logs/

# Per-package metadata files (these are generated per submission)
metadata_*.json

# IDE and Editor files
.vscode/
.idea/
*.swp
*.swo
*~
.DS_Store

# OS-specific
Thumbs.db
Desktop.ini

# Jupyter Notebooks
.ipynb_checkpoints

# PyCharm
.idea/

# mypy
.mypy_cache/
.dmypy.json
dmypy.json

# Pyre type checker
.pyre/

# Memory bank (optional - uncomment if you don't want to track memory)
# .memory-bank/
# External dependencies (clone separately)
HathiTrustYAMLgenerator/
41 changes: 41 additions & 0 deletions DEMO_step2.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
## Step 2: Directory Discovery - DEMO

### Create test files:
```bash
cd /home/schipp0/Digitization/HathiTrust

# Create 5 test TIFF files with barcode 39015012345678
python3 volume_discovery.py --create-test --barcode 39015012345678 --num-files 5

# Create another volume with different barcode
python3 volume_discovery.py --create-test --barcode 39015099887766 --num-files 3
```

### Discover volumes:
```bash
python3 volume_discovery.py input/
```

Expected output:
```
============================================================
VOLUME DISCOVERY SUMMARY
============================================================
📦 Volume: 39015012345678
Files: 5
Range: 00000001 to 00000005
Status: ✓ Valid
📦 Volume: 39015099887766
Files: 3
Range: 00000001 to 00000003
Status: ✓ Valid
```

### Run tests:
```bash
python3 test_volume_discovery.py -v
```

All 7 tests should pass ✓
81 changes: 81 additions & 0 deletions DEMO_step3.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
## Step 3: OCR Processing Pipeline - DEMO

### Prerequisites
Ensure Tesseract is installed:
```bash
# Check if tesseract is installed
tesseract --version

# If not installed:
sudo apt-get update
sudo apt-get install tesseract-ocr tesseract-ocr-eng
```

### Test Setup

#### 1. Create test TIFF files (if not already done):
```bash
cd /home/schipp0/Digitization/HathiTrust
python3 volume_discovery.py --create-test --barcode 39015012345678 --num-files 3
```

#### 2. Run OCR on all discovered volumes:
```bash
python3 ocr_processor.py input/
```

Expected output:
```
📂 Discovering volumes...
Found 1 volume(s)
============================================================
Processing Volume: 39015012345678
============================================================
Processing 3 files with OCR
[1/3] 39015012345678_00000001.tif
[2/3] 39015012345678_00000002.tif
[3/3] 39015012345678_00000003.tif
✓ OCR Results:
Successful: 3
Failed: 0
Output: temp/39015012345678
```

#### 3. Process specific volume only:
```bash
python3 ocr_processor.py input/ --volume-id 39015012345678
```

#### 4. Check output files:
```bash
ls -l temp/39015012345678/
```

Should show:
```
00000001.txt # Plain text OCR
00000001.html # hOCR coordinate data
00000002.txt
00000002.html
00000003.txt
00000003.html
```

### Run Tests
```bash
python3 test_ocr_processor.py -v
```

### Output Format

**Plain Text (.txt):**
- UTF-8 encoded
- Control characters removed (except tab, CR, LF)
- Raw text from Tesseract

**hOCR (.html):**
- XML/HTML format with coordinate data
- Contains bounding box information for each word
- Compatible with HathiTrust requirements
158 changes: 158 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,158 @@
# HathiTrust Package Automation Pipeline

## Project Structure
```
HathiTrust/
├── .memory-bank/ # Project memory storage
├── input/ # Source TIFF files (organized by barcode/ARK)
├── output/ # Final ZIP packages
├── temp/ # Intermediate processing files
├── logs/ # Processing logs
├── config.yaml # Global configuration
├── metadata_template.json # Template for package metadata
├── collect_metadata.py # Interactive metadata collection
├── requirements.txt # Python dependencies
└── README.md # This file
```

## Setup Instructions

### 1. Install System Dependencies
```bash
sudo apt-get update
sudo apt-get install tesseract-ocr tesseract-ocr-eng
```

### 2. Install Python Dependencies
```bash
pip install -r requirements.txt
```

### 3. Clone YAML Generator
```bash
cd /home/schipp0/Digitization/HathiTrust
git clone https://github.com/moriahcaruso/HathiTrustYAMLgenerator.git
```

## Workflow: Creating a Submission Package

### Step 1: Prepare TIFF Files
Place digitized TIFF files in `input/` directory:
- Files should follow naming: `<barcode>_00000001.tif`, `<barcode>_00000002.tif`, etc.
- Or: `<ark_id>_00000001.tif`, `<ark_id>_00000002.tif`, etc.

### Step 2: Collect Package Metadata
Run the interactive metadata collection tool:
```bash
./collect_metadata.py
```

This will prompt you for:
- **Volume identifier** (barcode or ARK)
- **Capture info** (date, operator, CaptureOne version)
- **Image specs** (DPI, color mode, compression)
- **Page order** (scanning/reading order)
- **Content type** (book, journal, manuscript, etc.)

Metadata is saved as: `metadata_<identifier>.json`

### Step 3: Process Package
(Main processing script to be implemented)
```bash
./process_package.py --metadata metadata_<identifier>.json
```

This will:
1. Validate TIFF files
2. Run OCR (text + hOCR coordinates)
3. Generate meta.yml
4. Create checksum.md5
5. Package into ZIP

## Key Features

### Per-Package Metadata
Unlike scanner-based workflows with static settings, this pipeline supports **variable capture settings** per submission:
- Different DPI (300, 400, 600, etc.)
- Various color modes (bitonal, grayscale, color)
- Multiple compression types
- Flexible reading orders

### CaptureOne Integration
Designed for content digitized via **CaptureOne Cultural Heritage Edition**, not physical scanners.

### HathiTrust Compliance
Output packages meet all HathiTrust requirements:
- 8-digit sequential file naming
- Plain text OCR (.txt)
- Coordinate OCR (.html hOCR format)
- meta.yml metadata
- checksum.md5 fixity file
- Proper ZIP structure (no subdirectories)

## Next Development Steps
- [ ] Implement main processing script
- [ ] Integrate with HathiTrustYAMLgenerator
- [ ] Add validation checks
- [ ] Test with sample packages
- [ ] Add batch processing support


## Implementation Status

### ✅ Step 1: Configuration & Setup
- Directory structure created
- Per-package metadata collection (`collect_metadata.py`)
- Configuration files (`config.yaml`, `metadata_template.json`)

### ✅ Step 2: Directory Discovery & Organization
- Volume discovery module (`volume_discovery.py`)
- Barcode and ARK identifier extraction
- Sequential file validation
- Test suite with 7 passing tests
- Test file generator for development

**Usage:**
```bash
# Discover volumes in input directory
python3 volume_discovery.py input/

# Create test files
python3 volume_discovery.py --create-test --barcode 39015012345678 --num-files 5

# Run tests
python3 test_volume_discovery.py
```

### ✅ Step 3: OCR Processing Pipeline
- OCR processor module (`ocr_processor.py`)
- Plain text OCR generation (.txt files)
- Coordinate OCR generation (.html hOCR format)
- Text sanitization (control character removal)
- UTF-8 encoding enforcement
- Batch processing with error handling
- Test suite with Tesseract integration tests

**Usage:**
```bash
# Process all volumes with OCR
python3 ocr_processor.py input/

# Process specific volume
python3 ocr_processor.py input/ --volume-id 39015012345678

# Custom language/output
python3 ocr_processor.py input/ --language fra --output-dir /tmp/ocr

# Run tests
python3 test_ocr_processor.py
```

### 🔄 Next Steps
- Step 4: File Validation & Naming Convention
- Step 5: YAML Metadata Generation
- Step 6: MD5 Checksum Generation
- Step 7: Package Assembly
- Step 8: ZIP Archive Creation
- Step 9: Quality Control & Validation
- Step 10: Main Processing Pipeline
Loading

0 comments on commit 40ce797

Please sign in to comment.