Add Step 4: File Validation & Naming Convention

Implements HathiTrust's 8-digit sequential naming standard and file validation to ensure compliance before package assembly. New components: - file_validator.py: Core validation and standardization module * FileValidator class with dry-run support * format_sequence_number(): Converts to 8-digit zero-padded format * validate_single_file(): Validates and renames individual files * validate_file_list(): Batch validation with statistics * verify_sequential_naming(): Detects gaps in sequences * verify_matching_triplets(): Ensures TIFF/TXT/HTML sets match - test_file_validator.py: Comprehensive test suite (8 tests) * Tests formatting, extraction, validation, gap detection * Tests triplet matching for complete file sets * All tests passing - DEMO_step4.md: Usage examples and documentation Features: - Enforces 8-digit zero-padded sequential naming (00000001.tif) - Detects and reports gaps in file sequences - Automatic file renaming to HathiTrust standard - Dry-run mode for safe preview before changes - Verify-only mode for validation without modifications - Case-insensitive extension handling - Detailed error reporting with FileValidationResult dataclass CLI usage: python3 file_validator.py <directory> [--extension tif] [--dry-run] [--verify-only] Updated README.md with Step 4 documentation. Progress: Steps 1-4 complete (40% of pipeline)
schipp0 · Sep 30, 2025 · 9f0cf76 · 9f0cf76
1 parent 40ce797
commit 9f0cf76
Show file tree

Hide file tree

Showing 4 changed files with 646 additions and 1 deletion.
diff --git a/DEMO_step4.md b/DEMO_step4.md
@@ -0,0 +1,96 @@
+## Step 4: File Validation & Naming Convention - DEMO
+
+### Purpose
+Ensures all files follow HathiTrust's strict 8-digit sequential naming convention:
+- Format: `00000001.tif`, `00000001.txt`, `00000001.html`
+- Sequential: No gaps allowed (1, 2, 3... not 1, 2, 4)
+- Zero-padded: Always 8 digits
+
+### Test the Validator
+
+#### 1. Verify properly named files:
+```bash
+cd /home/schipp0/Digitization/HathiTrust
+
+# Check if files are properly named (no changes)
+python3 file_validator.py temp/39015012345678 --verify-only
+```
+
+Expected output:
+```
+✓ All files are properly named and sequential
+```
+
+#### 2. Validate and standardize files (dry run):
+```bash
+# See what would be renamed without actually renaming
+python3 file_validator.py input/ --extension tif --dry-run
+```
+
+#### 3. Actually rename files to standard format:
+```bash
+# Rename files to match HathiTrust convention
+python3 file_validator.py input/ --extension tif
+```
+
+Expected output:
+```
+============================================================
+VALIDATION SUMMARY
+============================================================
+Total files: 3
+Valid: 3
+Renamed: 3
+Errors: 0
+
+✓ All files validated successfully
+```
+
+### Programmatic Usage
+
+```python
+from pathlib import Path
+from file_validator import FileValidator
+
+# Initialize validator
+validator = FileValidator(dry_run=False)
+
+# Validate a list of files
+files = sorted(Path("input").glob("*.tif"))
+results = validator.validate_file_list(files, start_sequence=1)
+
+print(f"Valid: {results['valid']}/{results['total']}")
+print(f"Renamed: {results['renamed']}")
+
+# Verify sequential naming
+is_valid, error = FileValidator.verify_sequential_naming(files)
+if not is_valid:
+    print(f"Error: {error}")
+
+# Verify matching triplets (TIFF + TXT + HTML)
+tiff_files = sorted(Path("package").glob("*.tif"))
+txt_files = sorted(Path("package").glob("*.txt"))
+html_files = sorted(Path("package").glob("*.html"))
+
+is_valid, error = FileValidator.verify_matching_triplets(
+    tiff_files, txt_files, html_files
+)
+if not is_valid:
+    print(f"Triplet mismatch: {error}")
+```
+
+### Run Tests
+```bash
+python3 test_file_validator.py -v
+```
+
+All 8 tests should pass ✓
+
+### Key Features
+- ✅ Validates 8-digit zero-padded format
+- ✅ Detects gaps in sequences
+- ✅ Renames files to standard format
+- ✅ Dry-run mode for safe testing
+- ✅ Verifies TIFF/TXT/HTML triplet matching
+- ✅ Handles case-insensitive extensions
+- ✅ Detailed error reporting
diff --git a/README.md b/README.md
@@ -148,8 +148,31 @@ python3 ocr_processor.py input/ --language fra --output-dir /tmp/ocr
 python3 test_ocr_processor.py
 ```
 
+### ✅ Step 4: File Validation & Naming Convention
+- File validator module (`file_validator.py`)
+- 8-digit zero-padded sequential naming enforcement
+- Gap detection in sequences
+- Automatic file renaming to HathiTrust standard
+- TIFF/TXT/HTML triplet verification
+- Dry-run mode for safe testing
+- Test suite with 8 passing tests
+
+**Usage:**
+```bash
+# Verify files are properly named
+python3 file_validator.py temp/39015012345678 --verify-only
+
+# Validate and rename files (dry-run)
+python3 file_validator.py input/ --extension tif --dry-run
+
+# Actually rename files
+python3 file_validator.py input/ --extension tif
+
+# Run tests
+python3 test_file_validator.py
+```
+
 ### 🔄 Next Steps
-- Step 4: File Validation & Naming Convention
 - Step 5: YAML Metadata Generation
 - Step 6: MD5 Checksum Generation
 - Step 7: Package Assembly