Skip to content

Commit

Permalink
Add Step 4: File Validation & Naming Convention
Browse files Browse the repository at this point in the history
Implements HathiTrust's 8-digit sequential naming standard and file
validation to ensure compliance before package assembly.

New components:
- file_validator.py: Core validation and standardization module
  * FileValidator class with dry-run support
  * format_sequence_number(): Converts to 8-digit zero-padded format
  * validate_single_file(): Validates and renames individual files
  * validate_file_list(): Batch validation with statistics
  * verify_sequential_naming(): Detects gaps in sequences
  * verify_matching_triplets(): Ensures TIFF/TXT/HTML sets match

- test_file_validator.py: Comprehensive test suite (8 tests)
  * Tests formatting, extraction, validation, gap detection
  * Tests triplet matching for complete file sets
  * All tests passing

- DEMO_step4.md: Usage examples and documentation

Features:
- Enforces 8-digit zero-padded sequential naming (00000001.tif)
- Detects and reports gaps in file sequences
- Automatic file renaming to HathiTrust standard
- Dry-run mode for safe preview before changes
- Verify-only mode for validation without modifications
- Case-insensitive extension handling
- Detailed error reporting with FileValidationResult dataclass

CLI usage:
  python3 file_validator.py <directory> [--extension tif] [--dry-run] [--verify-only]

Updated README.md with Step 4 documentation.

Progress: Steps 1-4 complete (40% of pipeline)
  • Loading branch information
schipp0 committed Sep 30, 2025
1 parent 40ce797 commit 9f0cf76
Show file tree
Hide file tree
Showing 4 changed files with 646 additions and 1 deletion.
96 changes: 96 additions & 0 deletions DEMO_step4.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
## Step 4: File Validation & Naming Convention - DEMO

### Purpose
Ensures all files follow HathiTrust's strict 8-digit sequential naming convention:
- Format: `00000001.tif`, `00000001.txt`, `00000001.html`
- Sequential: No gaps allowed (1, 2, 3... not 1, 2, 4)
- Zero-padded: Always 8 digits

### Test the Validator

#### 1. Verify properly named files:
```bash
cd /home/schipp0/Digitization/HathiTrust

# Check if files are properly named (no changes)
python3 file_validator.py temp/39015012345678 --verify-only
```

Expected output:
```
✓ All files are properly named and sequential
```

#### 2. Validate and standardize files (dry run):
```bash
# See what would be renamed without actually renaming
python3 file_validator.py input/ --extension tif --dry-run
```

#### 3. Actually rename files to standard format:
```bash
# Rename files to match HathiTrust convention
python3 file_validator.py input/ --extension tif
```

Expected output:
```
============================================================
VALIDATION SUMMARY
============================================================
Total files: 3
Valid: 3
Renamed: 3
Errors: 0
✓ All files validated successfully
```

### Programmatic Usage

```python
from pathlib import Path
from file_validator import FileValidator

# Initialize validator
validator = FileValidator(dry_run=False)

# Validate a list of files
files = sorted(Path("input").glob("*.tif"))
results = validator.validate_file_list(files, start_sequence=1)

print(f"Valid: {results['valid']}/{results['total']}")
print(f"Renamed: {results['renamed']}")

# Verify sequential naming
is_valid, error = FileValidator.verify_sequential_naming(files)
if not is_valid:
print(f"Error: {error}")

# Verify matching triplets (TIFF + TXT + HTML)
tiff_files = sorted(Path("package").glob("*.tif"))
txt_files = sorted(Path("package").glob("*.txt"))
html_files = sorted(Path("package").glob("*.html"))

is_valid, error = FileValidator.verify_matching_triplets(
tiff_files, txt_files, html_files
)
if not is_valid:
print(f"Triplet mismatch: {error}")
```

### Run Tests
```bash
python3 test_file_validator.py -v
```

All 8 tests should pass ✓

### Key Features
- ✅ Validates 8-digit zero-padded format
- ✅ Detects gaps in sequences
- ✅ Renames files to standard format
- ✅ Dry-run mode for safe testing
- ✅ Verifies TIFF/TXT/HTML triplet matching
- ✅ Handles case-insensitive extensions
- ✅ Detailed error reporting
25 changes: 24 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -148,8 +148,31 @@ python3 ocr_processor.py input/ --language fra --output-dir /tmp/ocr
python3 test_ocr_processor.py
```

### ✅ Step 4: File Validation & Naming Convention
- File validator module (`file_validator.py`)
- 8-digit zero-padded sequential naming enforcement
- Gap detection in sequences
- Automatic file renaming to HathiTrust standard
- TIFF/TXT/HTML triplet verification
- Dry-run mode for safe testing
- Test suite with 8 passing tests

**Usage:**
```bash
# Verify files are properly named
python3 file_validator.py temp/39015012345678 --verify-only

# Validate and rename files (dry-run)
python3 file_validator.py input/ --extension tif --dry-run

# Actually rename files
python3 file_validator.py input/ --extension tif

# Run tests
python3 test_file_validator.py
```

### 🔄 Next Steps
- Step 4: File Validation & Naming Convention
- Step 5: YAML Metadata Generation
- Step 6: MD5 Checksum Generation
- Step 7: Package Assembly
Expand Down
Loading

0 comments on commit 9f0cf76

Please sign in to comment.