-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Initial commit: HathiTrust package automation pipeline
Automated pipeline to process TIFF images into HathiTrust-compliant submission packages (SIP/ZIP archives). Features implemented (Steps 1-3): - Per-package metadata collection via interactive prompts - Volume discovery and organization from input directory - OCR processing (plain text + hOCR coordinate data) Components: - collect_metadata.py: Interactive metadata collection for variable capture settings (DPI, color mode, compression, etc.) - volume_discovery.py: Scans input directory, groups files by barcode/ARK identifier, validates sequential numbering - ocr_processor.py: Processes TIFFs with Tesseract OCR, generates plain text (.txt) and hOCR coordinate data (.html) Testing: - Unit tests for volume discovery (7 tests) - Unit tests for OCR processing - Test data generators included Configuration: - config.yaml: Global settings (paths, patterns, OCR config) - metadata_template.json: Per-package metadata structure - requirements.txt: Python dependencies Built for content digitized via CaptureOne Cultural Heritage Edition, supporting variable capture settings per submission package. Next steps: File validation, YAML generation, MD5 checksums, package assembly, and ZIP creation.
- Loading branch information
0 parents
commit 40ce797
Showing
11 changed files
with
1,432 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,91 @@ | ||
# Python | ||
__pycache__/ | ||
*.py[cod] | ||
*$py.class | ||
*.so | ||
.Python | ||
build/ | ||
develop-eggs/ | ||
dist/ | ||
downloads/ | ||
eggs/ | ||
.eggs/ | ||
lib/ | ||
lib64/ | ||
parts/ | ||
sdist/ | ||
var/ | ||
wheels/ | ||
pip-wheel-metadata/ | ||
share/python-wheels/ | ||
*.egg-info/ | ||
.installed.cfg | ||
*.egg | ||
MANIFEST | ||
|
||
# Virtual Environments | ||
venv/ | ||
env/ | ||
ENV/ | ||
env.bak/ | ||
venv.bak/ | ||
pyvenv.cfg | ||
bin/ | ||
include/ | ||
|
||
# PyInstaller | ||
*.manifest | ||
*.spec | ||
|
||
# Unit test / coverage reports | ||
htmlcov/ | ||
.tox/ | ||
.nox/ | ||
.coverage | ||
.coverage.* | ||
.cache | ||
nosetests.xml | ||
coverage.xml | ||
*.cover | ||
*.log | ||
.pytest_cache/ | ||
|
||
# Project-specific working directories | ||
input/ | ||
output/ | ||
temp/ | ||
logs/ | ||
|
||
# Per-package metadata files (these are generated per submission) | ||
metadata_*.json | ||
|
||
# IDE and Editor files | ||
.vscode/ | ||
.idea/ | ||
*.swp | ||
*.swo | ||
*~ | ||
.DS_Store | ||
|
||
# OS-specific | ||
Thumbs.db | ||
Desktop.ini | ||
|
||
# Jupyter Notebooks | ||
.ipynb_checkpoints | ||
|
||
# PyCharm | ||
.idea/ | ||
|
||
# mypy | ||
.mypy_cache/ | ||
.dmypy.json | ||
dmypy.json | ||
|
||
# Pyre type checker | ||
.pyre/ | ||
|
||
# Memory bank (optional - uncomment if you don't want to track memory) | ||
# .memory-bank/ | ||
# External dependencies (clone separately) | ||
HathiTrustYAMLgenerator/ |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,41 @@ | ||
## Step 2: Directory Discovery - DEMO | ||
|
||
### Create test files: | ||
```bash | ||
cd /home/schipp0/Digitization/HathiTrust | ||
|
||
# Create 5 test TIFF files with barcode 39015012345678 | ||
python3 volume_discovery.py --create-test --barcode 39015012345678 --num-files 5 | ||
|
||
# Create another volume with different barcode | ||
python3 volume_discovery.py --create-test --barcode 39015099887766 --num-files 3 | ||
``` | ||
|
||
### Discover volumes: | ||
```bash | ||
python3 volume_discovery.py input/ | ||
``` | ||
|
||
Expected output: | ||
``` | ||
============================================================ | ||
VOLUME DISCOVERY SUMMARY | ||
============================================================ | ||
📦 Volume: 39015012345678 | ||
Files: 5 | ||
Range: 00000001 to 00000005 | ||
Status: ✓ Valid | ||
📦 Volume: 39015099887766 | ||
Files: 3 | ||
Range: 00000001 to 00000003 | ||
Status: ✓ Valid | ||
``` | ||
|
||
### Run tests: | ||
```bash | ||
python3 test_volume_discovery.py -v | ||
``` | ||
|
||
All 7 tests should pass ✓ |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,81 @@ | ||
## Step 3: OCR Processing Pipeline - DEMO | ||
|
||
### Prerequisites | ||
Ensure Tesseract is installed: | ||
```bash | ||
# Check if tesseract is installed | ||
tesseract --version | ||
|
||
# If not installed: | ||
sudo apt-get update | ||
sudo apt-get install tesseract-ocr tesseract-ocr-eng | ||
``` | ||
|
||
### Test Setup | ||
|
||
#### 1. Create test TIFF files (if not already done): | ||
```bash | ||
cd /home/schipp0/Digitization/HathiTrust | ||
python3 volume_discovery.py --create-test --barcode 39015012345678 --num-files 3 | ||
``` | ||
|
||
#### 2. Run OCR on all discovered volumes: | ||
```bash | ||
python3 ocr_processor.py input/ | ||
``` | ||
|
||
Expected output: | ||
``` | ||
📂 Discovering volumes... | ||
Found 1 volume(s) | ||
============================================================ | ||
Processing Volume: 39015012345678 | ||
============================================================ | ||
Processing 3 files with OCR | ||
[1/3] 39015012345678_00000001.tif | ||
[2/3] 39015012345678_00000002.tif | ||
[3/3] 39015012345678_00000003.tif | ||
✓ OCR Results: | ||
Successful: 3 | ||
Failed: 0 | ||
Output: temp/39015012345678 | ||
``` | ||
|
||
#### 3. Process specific volume only: | ||
```bash | ||
python3 ocr_processor.py input/ --volume-id 39015012345678 | ||
``` | ||
|
||
#### 4. Check output files: | ||
```bash | ||
ls -l temp/39015012345678/ | ||
``` | ||
|
||
Should show: | ||
``` | ||
00000001.txt # Plain text OCR | ||
00000001.html # hOCR coordinate data | ||
00000002.txt | ||
00000002.html | ||
00000003.txt | ||
00000003.html | ||
``` | ||
|
||
### Run Tests | ||
```bash | ||
python3 test_ocr_processor.py -v | ||
``` | ||
|
||
### Output Format | ||
|
||
**Plain Text (.txt):** | ||
- UTF-8 encoded | ||
- Control characters removed (except tab, CR, LF) | ||
- Raw text from Tesseract | ||
|
||
**hOCR (.html):** | ||
- XML/HTML format with coordinate data | ||
- Contains bounding box information for each word | ||
- Compatible with HathiTrust requirements |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,158 @@ | ||
# HathiTrust Package Automation Pipeline | ||
|
||
## Project Structure | ||
``` | ||
HathiTrust/ | ||
├── .memory-bank/ # Project memory storage | ||
├── input/ # Source TIFF files (organized by barcode/ARK) | ||
├── output/ # Final ZIP packages | ||
├── temp/ # Intermediate processing files | ||
├── logs/ # Processing logs | ||
├── config.yaml # Global configuration | ||
├── metadata_template.json # Template for package metadata | ||
├── collect_metadata.py # Interactive metadata collection | ||
├── requirements.txt # Python dependencies | ||
└── README.md # This file | ||
``` | ||
|
||
## Setup Instructions | ||
|
||
### 1. Install System Dependencies | ||
```bash | ||
sudo apt-get update | ||
sudo apt-get install tesseract-ocr tesseract-ocr-eng | ||
``` | ||
|
||
### 2. Install Python Dependencies | ||
```bash | ||
pip install -r requirements.txt | ||
``` | ||
|
||
### 3. Clone YAML Generator | ||
```bash | ||
cd /home/schipp0/Digitization/HathiTrust | ||
git clone https://github.com/moriahcaruso/HathiTrustYAMLgenerator.git | ||
``` | ||
|
||
## Workflow: Creating a Submission Package | ||
|
||
### Step 1: Prepare TIFF Files | ||
Place digitized TIFF files in `input/` directory: | ||
- Files should follow naming: `<barcode>_00000001.tif`, `<barcode>_00000002.tif`, etc. | ||
- Or: `<ark_id>_00000001.tif`, `<ark_id>_00000002.tif`, etc. | ||
|
||
### Step 2: Collect Package Metadata | ||
Run the interactive metadata collection tool: | ||
```bash | ||
./collect_metadata.py | ||
``` | ||
|
||
This will prompt you for: | ||
- **Volume identifier** (barcode or ARK) | ||
- **Capture info** (date, operator, CaptureOne version) | ||
- **Image specs** (DPI, color mode, compression) | ||
- **Page order** (scanning/reading order) | ||
- **Content type** (book, journal, manuscript, etc.) | ||
|
||
Metadata is saved as: `metadata_<identifier>.json` | ||
|
||
### Step 3: Process Package | ||
(Main processing script to be implemented) | ||
```bash | ||
./process_package.py --metadata metadata_<identifier>.json | ||
``` | ||
|
||
This will: | ||
1. Validate TIFF files | ||
2. Run OCR (text + hOCR coordinates) | ||
3. Generate meta.yml | ||
4. Create checksum.md5 | ||
5. Package into ZIP | ||
|
||
## Key Features | ||
|
||
### Per-Package Metadata | ||
Unlike scanner-based workflows with static settings, this pipeline supports **variable capture settings** per submission: | ||
- Different DPI (300, 400, 600, etc.) | ||
- Various color modes (bitonal, grayscale, color) | ||
- Multiple compression types | ||
- Flexible reading orders | ||
|
||
### CaptureOne Integration | ||
Designed for content digitized via **CaptureOne Cultural Heritage Edition**, not physical scanners. | ||
|
||
### HathiTrust Compliance | ||
Output packages meet all HathiTrust requirements: | ||
- 8-digit sequential file naming | ||
- Plain text OCR (.txt) | ||
- Coordinate OCR (.html hOCR format) | ||
- meta.yml metadata | ||
- checksum.md5 fixity file | ||
- Proper ZIP structure (no subdirectories) | ||
|
||
## Next Development Steps | ||
- [ ] Implement main processing script | ||
- [ ] Integrate with HathiTrustYAMLgenerator | ||
- [ ] Add validation checks | ||
- [ ] Test with sample packages | ||
- [ ] Add batch processing support | ||
|
||
|
||
## Implementation Status | ||
|
||
### ✅ Step 1: Configuration & Setup | ||
- Directory structure created | ||
- Per-package metadata collection (`collect_metadata.py`) | ||
- Configuration files (`config.yaml`, `metadata_template.json`) | ||
|
||
### ✅ Step 2: Directory Discovery & Organization | ||
- Volume discovery module (`volume_discovery.py`) | ||
- Barcode and ARK identifier extraction | ||
- Sequential file validation | ||
- Test suite with 7 passing tests | ||
- Test file generator for development | ||
|
||
**Usage:** | ||
```bash | ||
# Discover volumes in input directory | ||
python3 volume_discovery.py input/ | ||
|
||
# Create test files | ||
python3 volume_discovery.py --create-test --barcode 39015012345678 --num-files 5 | ||
|
||
# Run tests | ||
python3 test_volume_discovery.py | ||
``` | ||
|
||
### ✅ Step 3: OCR Processing Pipeline | ||
- OCR processor module (`ocr_processor.py`) | ||
- Plain text OCR generation (.txt files) | ||
- Coordinate OCR generation (.html hOCR format) | ||
- Text sanitization (control character removal) | ||
- UTF-8 encoding enforcement | ||
- Batch processing with error handling | ||
- Test suite with Tesseract integration tests | ||
|
||
**Usage:** | ||
```bash | ||
# Process all volumes with OCR | ||
python3 ocr_processor.py input/ | ||
|
||
# Process specific volume | ||
python3 ocr_processor.py input/ --volume-id 39015012345678 | ||
|
||
# Custom language/output | ||
python3 ocr_processor.py input/ --language fra --output-dir /tmp/ocr | ||
|
||
# Run tests | ||
python3 test_ocr_processor.py | ||
``` | ||
|
||
### 🔄 Next Steps | ||
- Step 4: File Validation & Naming Convention | ||
- Step 5: YAML Metadata Generation | ||
- Step 6: MD5 Checksum Generation | ||
- Step 7: Package Assembly | ||
- Step 8: ZIP Archive Creation | ||
- Step 9: Quality Control & Validation | ||
- Step 10: Main Processing Pipeline |
Oops, something went wrong.