Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add Step 4: File Validation & Naming Convention
Implements HathiTrust's 8-digit sequential naming standard and file validation to ensure compliance before package assembly. New components: - file_validator.py: Core validation and standardization module * FileValidator class with dry-run support * format_sequence_number(): Converts to 8-digit zero-padded format * validate_single_file(): Validates and renames individual files * validate_file_list(): Batch validation with statistics * verify_sequential_naming(): Detects gaps in sequences * verify_matching_triplets(): Ensures TIFF/TXT/HTML sets match - test_file_validator.py: Comprehensive test suite (8 tests) * Tests formatting, extraction, validation, gap detection * Tests triplet matching for complete file sets * All tests passing - DEMO_step4.md: Usage examples and documentation Features: - Enforces 8-digit zero-padded sequential naming (00000001.tif) - Detects and reports gaps in file sequences - Automatic file renaming to HathiTrust standard - Dry-run mode for safe preview before changes - Verify-only mode for validation without modifications - Case-insensitive extension handling - Detailed error reporting with FileValidationResult dataclass CLI usage: python3 file_validator.py <directory> [--extension tif] [--dry-run] [--verify-only] Updated README.md with Step 4 documentation. Progress: Steps 1-4 complete (40% of pipeline)
- Loading branch information