Skip to main content

ICC Book Fetcher

Overview

The ICC book fetcher has been enhanced with a more thorough download verification system that goes beyond the previous shallow check. The new system provides detailed information about the download status of each book and can detect partially downloaded or corrupted books.

Problem Solved

Previous Behavior

The original isBookAlreadyDownloaded() method only checked for the existence of:

  • api/icc/content/info/{documentId}.json (or .raw.json)
  • api/icc/content/chapters/{documentId}.json (or .raw.json)

This was insufficient because:

  1. It didn't verify that all expected chapter XML files were present
  2. It didn't check if files were corrupted or truncated
  3. It could miss cases where some chapters failed to download

New Behavior

The enhanced system now performs a comprehensive check that:

  1. Verifies metadata files exist (same as before)
  2. Parses the chapters metadata to determine expected chapter content IDs
  3. Checks each expected chapter file for existence
  4. Validates XML integrity by checking for well-formed structure
  5. Provides detailed reporting on missing, corrupted, and valid chapters

New Features

1. Thorough Download Status Checking

// New method for detailed status checking
BookDownloadStatus status = IccBookClient.checkBookDownloadStatus(documentId, fileSystemHandler);

// Legacy method still works but now uses the thorough check
boolean isDownloaded = IccBookClient.isBookAlreadyDownloaded(documentId, fileSystemHandler);

2. BookDownloadStatus Class

The new BookDownloadStatus class provides detailed information:

public class BookDownloadStatus {
private String documentId;
private boolean metadataFilesPresent;
private boolean fullyDownloaded;
private int expectedChapterCount;
private int missingChapterCount;
private int corruptedChapterCount;
private int validChapterCount;
private List<String> missingChapterIds;
private List<String> corruptedChapterIds;
private List<String> validChapterIds;
private List<String> issues;
}

3. Enhanced CLI Output

The command-line interface now provides detailed status reports:

cli/codeproof.sh icc-book-fetcher --search-result-file search-results.json

# Example output:
=== Download Status Report ===
Book 2217: ✓ Fully downloaded (100 chapters)
Book 3757: ⚠ Partially downloaded (95/100 chapters valid, 3 missing, 2 corrupted)
Missing chapters: 35712407, 35712408, 35712409
Corrupted chapters: 35712410, 35712411
Book 3100: ✗ Not downloaded

✓ Skipped 1 fully downloaded book(s): 2217
⚠ Will re-download 1 partially downloaded book(s): 3757
Will download 2 book(s) (new or to fix partial downloads)

4. Status-Only Mode

New --status-only option to check status without downloading:

cli/codeproof.sh icc-book-fetcher --search-result-file search-results.json --status-only

5. Example Outputs

Status Check Output

$ cli/codeproof.sh icc-book-fetcher 2217 3757 3100 --status-only

Checking download status for all books...

=== Download Status Report ===
Book 2217: ✓ Fully downloaded (100 chapters)
Book 3757: ⚠ Partially downloaded (95/100 chapters valid, 3 missing, 2 corrupted)
Missing chapters: 35712407, 35712408, 35712409
Corrupted chapters: 35712410, 35712411
Book 3100: ✗ Not downloaded

Status check completed. Use --status-only to check status without downloading.

Download with Status Report

$ cli/codeproof.sh icc-book-fetcher 2217 3757 3100

Checking download status for all books...

=== Download Status Report ===
Book 2217: ✓ Fully downloaded (100 chapters)
Book 3757: ⚠ Partially downloaded (95/100 chapters valid, 3 missing, 2 corrupted)
Missing chapters: 35712407, 35712408, 35712409
Corrupted chapters: 35712410, 35712411
Book 3100: ✗ Not downloaded

Starting ICC book fetch for 3 book(s) with pause range: 3000-5000 ms

✓ Skipped 1 fully downloaded book(s): 2217
⚠ Will re-download 1 partially downloaded book(s): 3757
Will download 2 book(s) (new or to fix partial downloads)

[1/2] Fetching book ID: 3757
Fetching chapter 1 of 100: Chapter 1: Scope and Administration
Pausing for 4.2 seconds before next chapter fetch
...
✓ Successfully fetched book ID: 3757

[2/2] Fetching book ID: 3100
...
✓ Successfully fetched book ID: 3100

Completed fetching 3 ICC book(s)

XML Validation

The system includes basic XML well-formedness checking that verifies:

  1. File structure: Files should start with < and end with >
  2. Root elements: Should contain <section> or <html> as root
  3. Tag balance: Opening and closing tags should be reasonably balanced
  4. Proper endings: Files should not end abruptly with incomplete tags

Validation Examples

<!-- ✅ Well-formed -->
<section><div><p>Content</p></div></section>

<!-- ❌ Malformed (missing closing p tag) -->
<section><div><p>Content</div></section>

<!-- ❌ Malformed (ends abruptly) -->
<section><div><p>Content</p></div>

Usage Examples

Check Status Only

# Check status without downloading anything
cli/codeproof.sh icc-book-fetcher \
--search-result-file api/icc/codes/united-states/california/search-results.json \
--status-only

# Check status of specific books
cli/codeproof.sh icc-book-fetcher 2217 3757 3100 --status-only

# Check status of a non-existent book (for testing)
cli/codeproof.sh icc-book-fetcher 99999 --status-only

Normal Download with Enhanced Checking

# Download all books from search results with enhanced checking
cli/codeproof.sh icc-book-fetcher \
--search-result-file api/icc/codes/united-states/california/search-results.json \
--min-pause 3000 \
--max-pause 5000 \
--filesystem LOCAL

# Download specific books with enhanced checking
cli/codeproof.sh icc-book-fetcher 2217 3757 3100

Demonstration Commands

# 1. Check what books are available in a search results file
cli/codeproof.sh icc-search --file api/icc/codes/united-states/california/search-results.json --document-ids-only

# 2. Check download status of those books
cli/codeproof.sh icc-book-fetcher \
--search-result-file api/icc/codes/united-states/california/search-results.json \
--status-only

# 3. Download only the books that need it
cli/codeproof.sh icc-book-fetcher \
--search-result-file api/icc/codes/united-states/california/search-results.json \
--min-pause 3000 \
--max-pause 5000

# 4. Verify the download was successful
cli/codeproof.sh icc-book-fetcher \
--search-result-file api/icc/codes/united-states/california/search-results.json \
--status-only

Help and Options

# Show all available options
cli/codeproof.sh icc-book-fetcher --help

# Show help for the main CLI
cli/codeproof.sh --help

Benefits

  1. Reliability: Catches partial downloads and corrupted files
  2. Transparency: Clear reporting of what's missing or broken
  3. Efficiency: Only re-downloads what's actually needed
  4. Debugging: Detailed information helps identify download issues
  5. Backward Compatibility: Existing code continues to work

Technical Details

File Structure Checked

  • api/icc/content/info/{documentId}.json - Book metadata
  • api/icc/content/chapters/{documentId}.json - Chapter index
  • api/icc/content/chapter-xml/{documentId}/{chapterId}.html - Chapter content files

Performance Considerations

  • XML validation is lightweight and doesn't require full parsing
  • Status checking reads files but doesn't make network calls
  • Chapter count is limited to 100 by default (configurable in fetchBookChapters())

Error Handling

  • Graceful handling of missing or corrupted files
  • Detailed error messages for debugging
  • Fallback behavior for parsing errors