-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add Step 4: File Validation & Naming Convention
Implements HathiTrust's 8-digit sequential naming standard and file validation to ensure compliance before package assembly. New components: - file_validator.py: Core validation and standardization module * FileValidator class with dry-run support * format_sequence_number(): Converts to 8-digit zero-padded format * validate_single_file(): Validates and renames individual files * validate_file_list(): Batch validation with statistics * verify_sequential_naming(): Detects gaps in sequences * verify_matching_triplets(): Ensures TIFF/TXT/HTML sets match - test_file_validator.py: Comprehensive test suite (8 tests) * Tests formatting, extraction, validation, gap detection * Tests triplet matching for complete file sets * All tests passing - DEMO_step4.md: Usage examples and documentation Features: - Enforces 8-digit zero-padded sequential naming (00000001.tif) - Detects and reports gaps in file sequences - Automatic file renaming to HathiTrust standard - Dry-run mode for safe preview before changes - Verify-only mode for validation without modifications - Case-insensitive extension handling - Detailed error reporting with FileValidationResult dataclass CLI usage: python3 file_validator.py <directory> [--extension tif] [--dry-run] [--verify-only] Updated README.md with Step 4 documentation. Progress: Steps 1-4 complete (40% of pipeline)
- Loading branch information
Showing
4 changed files
with
646 additions
and
1 deletion.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,96 @@ | ||
## Step 4: File Validation & Naming Convention - DEMO | ||
|
||
### Purpose | ||
Ensures all files follow HathiTrust's strict 8-digit sequential naming convention: | ||
- Format: `00000001.tif`, `00000001.txt`, `00000001.html` | ||
- Sequential: No gaps allowed (1, 2, 3... not 1, 2, 4) | ||
- Zero-padded: Always 8 digits | ||
|
||
### Test the Validator | ||
|
||
#### 1. Verify properly named files: | ||
```bash | ||
cd /home/schipp0/Digitization/HathiTrust | ||
|
||
# Check if files are properly named (no changes) | ||
python3 file_validator.py temp/39015012345678 --verify-only | ||
``` | ||
|
||
Expected output: | ||
``` | ||
✓ All files are properly named and sequential | ||
``` | ||
|
||
#### 2. Validate and standardize files (dry run): | ||
```bash | ||
# See what would be renamed without actually renaming | ||
python3 file_validator.py input/ --extension tif --dry-run | ||
``` | ||
|
||
#### 3. Actually rename files to standard format: | ||
```bash | ||
# Rename files to match HathiTrust convention | ||
python3 file_validator.py input/ --extension tif | ||
``` | ||
|
||
Expected output: | ||
``` | ||
============================================================ | ||
VALIDATION SUMMARY | ||
============================================================ | ||
Total files: 3 | ||
Valid: 3 | ||
Renamed: 3 | ||
Errors: 0 | ||
✓ All files validated successfully | ||
``` | ||
|
||
### Programmatic Usage | ||
|
||
```python | ||
from pathlib import Path | ||
from file_validator import FileValidator | ||
|
||
# Initialize validator | ||
validator = FileValidator(dry_run=False) | ||
|
||
# Validate a list of files | ||
files = sorted(Path("input").glob("*.tif")) | ||
results = validator.validate_file_list(files, start_sequence=1) | ||
|
||
print(f"Valid: {results['valid']}/{results['total']}") | ||
print(f"Renamed: {results['renamed']}") | ||
|
||
# Verify sequential naming | ||
is_valid, error = FileValidator.verify_sequential_naming(files) | ||
if not is_valid: | ||
print(f"Error: {error}") | ||
|
||
# Verify matching triplets (TIFF + TXT + HTML) | ||
tiff_files = sorted(Path("package").glob("*.tif")) | ||
txt_files = sorted(Path("package").glob("*.txt")) | ||
html_files = sorted(Path("package").glob("*.html")) | ||
|
||
is_valid, error = FileValidator.verify_matching_triplets( | ||
tiff_files, txt_files, html_files | ||
) | ||
if not is_valid: | ||
print(f"Triplet mismatch: {error}") | ||
``` | ||
|
||
### Run Tests | ||
```bash | ||
python3 test_file_validator.py -v | ||
``` | ||
|
||
All 8 tests should pass ✓ | ||
|
||
### Key Features | ||
- ✅ Validates 8-digit zero-padded format | ||
- ✅ Detects gaps in sequences | ||
- ✅ Renames files to standard format | ||
- ✅ Dry-run mode for safe testing | ||
- ✅ Verifies TIFF/TXT/HTML triplet matching | ||
- ✅ Handles case-insensitive extensions | ||
- ✅ Detailed error reporting |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.