From b9209a56c1d157cc0ce2089e540e81a8048bba5f Mon Sep 17 00:00:00 2001 From: schipp0 Date: Tue, 30 Sep 2025 18:05:52 +0000 Subject: [PATCH] Remove DEMO documentation files from repo and add to .gitignore --- .gitignore | 3 ++ DEMO_step2.md | 41 --------------- DEMO_step3.md | 81 ---------------------------- DEMO_step4.md | 96 --------------------------------- DEMO_step6.md | 143 -------------------------------------------------- 5 files changed, 3 insertions(+), 361 deletions(-) delete mode 100644 DEMO_step2.md delete mode 100644 DEMO_step3.md delete mode 100644 DEMO_step4.md delete mode 100644 DEMO_step6.md diff --git a/.gitignore b/.gitignore index 4d63574..6287725 100644 --- a/.gitignore +++ b/.gitignore @@ -89,3 +89,6 @@ dmypy.json # .memory-bank/ # External dependencies (clone separately) HathiTrustYAMLgenerator/ + +# Demo and documentation files (not for public repo) +DEMO_*.md diff --git a/DEMO_step2.md b/DEMO_step2.md deleted file mode 100644 index c18c83e..0000000 --- a/DEMO_step2.md +++ /dev/null @@ -1,41 +0,0 @@ -## Step 2: Directory Discovery - DEMO - -### Create test files: -```bash -cd /home/schipp0/Digitization/HathiTrust - -# Create 5 test TIFF files with barcode 39015012345678 -python3 volume_discovery.py --create-test --barcode 39015012345678 --num-files 5 - -# Create another volume with different barcode -python3 volume_discovery.py --create-test --barcode 39015099887766 --num-files 3 -``` - -### Discover volumes: -```bash -python3 volume_discovery.py input/ -``` - -Expected output: -``` -============================================================ -VOLUME DISCOVERY SUMMARY -============================================================ - -📦 Volume: 39015012345678 - Files: 5 - Range: 00000001 to 00000005 - Status: ✓ Valid - -📦 Volume: 39015099887766 - Files: 3 - Range: 00000001 to 00000003 - Status: ✓ Valid -``` - -### Run tests: -```bash -python3 test_volume_discovery.py -v -``` - -All 7 tests should pass ✓ diff --git a/DEMO_step3.md b/DEMO_step3.md deleted file mode 100644 index 0c986db..0000000 --- a/DEMO_step3.md +++ /dev/null @@ -1,81 +0,0 @@ -## Step 3: OCR Processing Pipeline - DEMO - -### Prerequisites -Ensure Tesseract is installed: -```bash -# Check if tesseract is installed -tesseract --version - -# If not installed: -sudo apt-get update -sudo apt-get install tesseract-ocr tesseract-ocr-eng -``` - -### Test Setup - -#### 1. Create test TIFF files (if not already done): -```bash -cd /home/schipp0/Digitization/HathiTrust -python3 volume_discovery.py --create-test --barcode 39015012345678 --num-files 3 -``` - -#### 2. Run OCR on all discovered volumes: -```bash -python3 ocr_processor.py input/ -``` - -Expected output: -``` -📂 Discovering volumes... -Found 1 volume(s) - -============================================================ -Processing Volume: 39015012345678 -============================================================ -Processing 3 files with OCR - [1/3] 39015012345678_00000001.tif - [2/3] 39015012345678_00000002.tif - [3/3] 39015012345678_00000003.tif - -✓ OCR Results: - Successful: 3 - Failed: 0 - Output: temp/39015012345678 -``` - -#### 3. Process specific volume only: -```bash -python3 ocr_processor.py input/ --volume-id 39015012345678 -``` - -#### 4. Check output files: -```bash -ls -l temp/39015012345678/ -``` - -Should show: -``` -00000001.txt # Plain text OCR -00000001.html # hOCR coordinate data -00000002.txt -00000002.html -00000003.txt -00000003.html -``` - -### Run Tests -```bash -python3 test_ocr_processor.py -v -``` - -### Output Format - -**Plain Text (.txt):** -- UTF-8 encoded -- Control characters removed (except tab, CR, LF) -- Raw text from Tesseract - -**hOCR (.html):** -- XML/HTML format with coordinate data -- Contains bounding box information for each word -- Compatible with HathiTrust requirements diff --git a/DEMO_step4.md b/DEMO_step4.md deleted file mode 100644 index 7376d01..0000000 --- a/DEMO_step4.md +++ /dev/null @@ -1,96 +0,0 @@ -## Step 4: File Validation & Naming Convention - DEMO - -### Purpose -Ensures all files follow HathiTrust's strict 8-digit sequential naming convention: -- Format: `00000001.tif`, `00000001.txt`, `00000001.html` -- Sequential: No gaps allowed (1, 2, 3... not 1, 2, 4) -- Zero-padded: Always 8 digits - -### Test the Validator - -#### 1. Verify properly named files: -```bash -cd /home/schipp0/Digitization/HathiTrust - -# Check if files are properly named (no changes) -python3 file_validator.py temp/39015012345678 --verify-only -``` - -Expected output: -``` -✓ All files are properly named and sequential -``` - -#### 2. Validate and standardize files (dry run): -```bash -# See what would be renamed without actually renaming -python3 file_validator.py input/ --extension tif --dry-run -``` - -#### 3. Actually rename files to standard format: -```bash -# Rename files to match HathiTrust convention -python3 file_validator.py input/ --extension tif -``` - -Expected output: -``` -============================================================ -VALIDATION SUMMARY -============================================================ -Total files: 3 -Valid: 3 -Renamed: 3 -Errors: 0 - -✓ All files validated successfully -``` - -### Programmatic Usage - -```python -from pathlib import Path -from file_validator import FileValidator - -# Initialize validator -validator = FileValidator(dry_run=False) - -# Validate a list of files -files = sorted(Path("input").glob("*.tif")) -results = validator.validate_file_list(files, start_sequence=1) - -print(f"Valid: {results['valid']}/{results['total']}") -print(f"Renamed: {results['renamed']}") - -# Verify sequential naming -is_valid, error = FileValidator.verify_sequential_naming(files) -if not is_valid: - print(f"Error: {error}") - -# Verify matching triplets (TIFF + TXT + HTML) -tiff_files = sorted(Path("package").glob("*.tif")) -txt_files = sorted(Path("package").glob("*.txt")) -html_files = sorted(Path("package").glob("*.html")) - -is_valid, error = FileValidator.verify_matching_triplets( - tiff_files, txt_files, html_files -) -if not is_valid: - print(f"Triplet mismatch: {error}") -``` - -### Run Tests -```bash -python3 test_file_validator.py -v -``` - -All 8 tests should pass ✓ - -### Key Features -- ✅ Validates 8-digit zero-padded format -- ✅ Detects gaps in sequences -- ✅ Renames files to standard format -- ✅ Dry-run mode for safe testing -- ✅ Verifies TIFF/TXT/HTML triplet matching -- ✅ Handles case-insensitive extensions -- ✅ Detailed error reporting diff --git a/DEMO_step6.md b/DEMO_step6.md deleted file mode 100644 index ffc95c9..0000000 --- a/DEMO_step6.md +++ /dev/null @@ -1,143 +0,0 @@ -# Step 6: MD5 Checksum Generation - DEMO - -## Overview -This step implements MD5 checksum generation and verification for HathiTrust package validation. - -## Key Components - -### ChecksumGenerator Class -Located in `checksum_generator.py`, provides: -- `compute_md5(file_path)` - Calculate MD5 hash for individual files -- `generate_checksums(package_directory)` - Create checksum.md5 for all package files -- `verify_checksums(checksum_file)` - Validate checksums against actual files - -### HathiTrust Compliance -- **Format**: ` ` (two spaces between hash and filename) -- **Exclusion**: checksum.md5 does not include itself -- **Sorting**: Files listed in alphabetical order -- **Coverage**: All package files (TIFF, TXT, HTML, meta.yml) - -## Usage Example - -### Generate Checksums -```python -from checksum_generator import ChecksumGenerator - -generator = ChecksumGenerator() -result = generator.generate_checksums('/path/to/package') - -print(f"Generated checksums for {result['file_count']} files") -print(f"Checksum file: {result['checksum_file']}") -``` - -### Verify Checksums -```python -verify_result = generator.verify_checksums('/path/to/package/checksum.md5') - -print(f"Valid: {len(verify_result['valid'])}") -print(f"Invalid: {len(verify_result['invalid'])}") -print(f"Missing: {len(verify_result['missing'])}") -``` - -## Test Results -✅ **14 tests passed** (0.05s) - -### Test Coverage -1. ✅ Basic MD5 computation -2. ✅ MD5 consistency (same file → same hash) -3. ✅ Error handling (missing files) -4. ✅ Checksum.md5 file generation -5. ✅ File format compliance (hash filename) -6. ✅ Self-exclusion (checksum.md5 not in itself) -7. ✅ Sorted order verification -8. ✅ Validation of valid checksums -9. ✅ Detection of modified files -10. ✅ Detection of missing files -11. ✅ Empty directory error handling -12. ✅ Nonexistent directory error handling -13. ✅ Convenience function -14. ✅ Binary file (TIFF) checksums - - -## Sample checksum.md5 File - -``` -00000001.html a3c1f5e9d4b2c8f7e6d5a4b3c2d1e0f9 -00000001.tif b2d3e4f5c6a7b8c9d0e1f2a3b4c5d6e7 -00000001.txt c4d5e6f7a8b9c0d1e2f3a4b5c6d7e8f9 -00000002.html d5e6f7a8b9c0d1e2f3a4b5c6d7e8f9a0 -00000002.tif e6f7a8b9c0d1e2f3a4b5c6d7e8f9a0b1 -00000002.txt f7a8b9c0d1e2f3a4b5c6d7e8f9a0b1c2 -meta.yml a8b9c0d1e2f3a4b5c6d7e8f9a0b1c2d3 -``` - -## Technical Implementation - -### MD5 Computation -- **Chunk size**: 8KB for memory efficiency -- **Encoding**: Works with both binary (TIFF) and text files -- **Output**: Lowercase hexadecimal (32 characters) - -### Error Handling -- `FileNotFoundError` - File doesn't exist -- `IOError` - File cannot be read -- `NotADirectoryError` - Invalid package directory -- `ValueError` - No files found in directory - -### Verification Features -- Detects modified files (checksum mismatch) -- Identifies missing files (in checksum.md5 but not found) -- Confirms valid files (checksums match) -- Returns detailed results for reporting - -## Integration with Pipeline - -### Position in Workflow -``` -Step 5: YAML Generation → Step 6: Checksum Generation → Step 7: Package Assembly -``` - -### When to Generate Checksums -- **After** all package files are finalized (TIFF, TXT, HTML, meta.yml) -- **Before** creating ZIP archive -- **Last step** before packaging to ensure file integrity - -### Checksum Verification Use Cases -1. **Pre-transfer**: Verify package integrity before upload -2. **Post-transfer**: Validate files after network transfer -3. **Archive validation**: Periodic checks on stored packages -4. **Error recovery**: Identify corrupted files in batch processing - -## Next Steps - -### Step 7: Package Assembly -Create `package_assembler.py` to: -- Organize all files into flat directory structure -- Copy/move TIFF, TXT, HTML, meta.yml into package directory -- Validate file naming conventions -- Prepare for ZIP creation - -### Integration Points -```python -# Step 7 will use checksum_generator like this: -from checksum_generator import generate_package_checksums - -# After assembling package files... -checksum_file = generate_package_checksums(package_dir) -print(f"Package ready for ZIP: {checksum_file}") -``` - -## Dependencies Updated -Added to `requirements.txt`: -``` -pytest>=8.0.0 # Testing framework -``` - -## Files Created -- `checksum_generator.py` - Main implementation (131 lines) -- `test_checksum_generator.py` - Test suite (149 lines) -- `DEMO_step6.md` - Documentation (this file) - ---- - -**Status**: ✅ Step 6 Complete | 14/14 Tests Passing | Ready for Step 7