File Structure Reorganization
📋 Implementation Issue: Issue #167 - Reorganize Project Structure: Move from pages/ to files/{file_id}/pages/ with Rich Metadata
Executive Summary
This PRD defines the requirements for reorganizing project file structure from a flat pages/ directory to a hierarchical files/{file_id}/pages/ structure with rich file metadata. This change improves file organization, enables better tracking of which pages came from which source documents, and provides rich metadata about each input file.
Key Principle: Backward compatibility is paramount. Legacy projects must continue to function without disruption, with clear upgrade paths for users who want to adopt the new structure.
Problem Statement
Current State
Current Project Structure:
projects/{projectId}/
├── project-metadata.json # Project-level metadata
├── plan-metadata.json # Flat list of all pages
├── pages/ # ALL pages mixed together (no source file tracking)
│ ├── 001/
│ │ ├── page.pdf
│ │ ├── page.md
│ │ ├── page-summary-1000char.json
│ │ └── ...
│ ├── 002/
│ └── ...
├── inputs/ # Raw uploaded files
│ ├── architectural-plans.pdf
│ ├── electrical-plans.pdf
│ └── ...
├── review/ # Compliance review artifacts
└── overview.md
Problems with Current Structure
- Loss of Source File Context: Once pages are extracted from multiple PDFs, there's no way to determine which pages came from which source file
- No File-Level Metadata: Cannot track document type (architectural vs electrical), processing status, or page count per file
- Poor Organization: All pages from all files are mixed together in a flat structure
- Difficult File Management: Cannot easily delete, reprocess, or update individual source files
- No File Classification: Cannot distinguish between architectural plans, electrical plans, inspector feedback, etc.
- Limited Search/Filter: Cannot filter pages by source file or document type
- Scalability Issues: As projects grow with more input files, the flat structure becomes unwieldy
User Impact
- Architects: Cannot easily identify which pages came from which discipline (architectural, structural, MEP)
- Project Managers: Cannot track processing status per file or identify which files need attention
- Reviewers: Cannot focus review on specific file types (e.g., only architectural plans)
- System Administrators: Cannot efficiently troubleshoot file processing issues without source file context
Relationship to Issue #227 (Project Metadata)
Orthogonal Concerns - Both features are needed and complement each other:
| Feature | Purpose | Location | Scope |
|---|---|---|---|
| Issue #227 (Project Metadata) | Project-level information (name, description, address, building codes) | projects/{projectId}/project-metadata.json | Entire project |
| Issue #167 (File Metadata - THIS PRD) | File-level information (document type, pages, processing status) | projects/{projectId}/files/{file_id}/metadata.json | Individual input file |
Visual Relationship:
projects/{projectId}/
├── project-metadata.json ← Issue #227: WHO/WHAT/WHERE is this project?
├── files/ ← Issue #167: WHICH input files, document types, pages?
│ ├── {file_id_1}/
│ │ ├── metadata.json ← Issue #167: This file's metadata
│ │ └── pages/ ← Issue #167: Pages from this file
│ └── {file_id_2}/
│ ├── metadata.json
│ └── pages/
└── inputs/ ← Raw files
Implementation Order: Issue #227 should be implemented first (simpler, immediate value), followed by Issue #167 (more complex, requires migration).
Proposed Solution
New Project Structure
projects/{projectId}/
├── project-metadata.json # Project-level metadata (Issue #227)
├── plan-metadata.json # LEGACY - Deprecated, for backward compatibility only
├── files/ # NEW - Processed input files with rich metadata
│ ├── index.json # File ID counter + page-to-file mapping (NEW)
│ ├── 1/ # Auto-increment file IDs for readable URLs
│ │ ├── metadata.json # Rich file metadata (NEW)
│ │ └── pages/ # Pages extracted from this specific file
│ │ ├── 001/
│ │ │ ├── page.pdf
│ │ │ ├── page.md
│ │ │ ├── page-summary-1000char.json
│ │ │ └── ...
│ │ ├── 002/
│ │ └── ...
│ ├── 2/ # Second file uploaded
│ │ ├── metadata.json
│ │ └── pages/
│ │ ├── 001/
│ │ ├── 002/
│ │ └── ...
│ └── ...
├── pages/ # LEGACY - Preserved for backward compatibility
│ └── [existing pages unchanged]
├── inputs/ # Raw input files uploaded into the project
│ ├── architectural-plans.pdf
│ ├── electrical-plans.pdf
│ └── ...
├── review/ # Compliance review artifacts
├── overview.md
└── project-content.md
File Metadata Schema
The files/{file_id}/metadata.json file contains the InputFileMetadata proto message:
import "google/protobuf/timestamp.proto";
message InputFileMetadata {
// Basic file information
string file_id = 1; // Unique auto-increment ID (e.g., "1", "2", "3")
string file_name = 2; // Original filename
string file_path = 3; // Path relative to inputs/
string mime_type = 4; // MIME type (e.g., "application/pdf")
int64 file_size_bytes = 5; // File size in bytes
google.protobuf.Timestamp upload_date = 6; // When file was uploaded
// Document classification
DocumentType document_type = 7; // Classified document type
int32 page_count = 8; // Number of pages (for PDFs)
// Processing metadata
ProcessingStatus processing_status = 9; // Current processing state
google.protobuf.Timestamp processed_date = 10; // When processing completed
repeated string extracted_pages = 11; // List of extracted page IDs
// Content insights
string content_summary = 12; // AI-generated summary
// Technical metadata
string checksum_md5 = 13; // File integrity check
}
Note: Proto definitions already exist in api.proto (lines 225-274) - no new proto messages needed!
Enum Naming Note: The existing enums use prefixed values (e.g., DOCUMENT_TYPE_ARCHITECTURAL_PLAN) which is not aligned with our Protocol Buffers Best Practices (should be ARCHITECTURAL_PLAN in a dedicated package). This is acceptable for now since the enums already exist in production. Future refactoring could move these to src/main/proto/file_metadata.proto with clean enum values, but that's outside the scope of this issue.
Implementation Phases
Phase 1: Infrastructure & Dual-Read Support (Backward Compatible) ✅ COMPLETED
Goal: Add new file structure alongside legacy structure without breaking existing projects
Deliverables:
- ✅ New file storage handlers supporting
files/{file_id}/structure - ✅ Dual-read logic: Try new structure first, fall back to legacy
pages/if not found - ✅ File ID generation and management utilities
- ✅
GenerateInputFileMetadataRPC implementation - ✅ Legacy project detection logic
- ✅ Comprehensive backward compatibility tests
- ✅
ProjectPathResolverwith intelligent path resolution and caching - ✅ Atomic file operations with GCS generation-based Compare-and-Set (CAS)
- ✅ Race condition prevention for concurrent metadata updates
Success Criteria: ✅ ALL MET
- All existing functionality works unchanged
- New projects can use new structure
- Legacy projects continue to read from
pages/without errors - Zero downtime during deployment
Phase 2: Frontend Integration & User Migration Tools ✅ COMPLETED
Goal: Enable users to see file metadata and upgrade legacy projects
Deliverables:
- ✅ Enhanced project settings page showing rich file metadata
- ✅ File list with document type, page count, processing status
- ✅ Legacy project detection banner in UI
- ✅ User-initiated upgrade workflow (manual migration)
- ✅ File-level operations (view pages by file, reprocess file)
- ✅ CLI tool for admin bulk upgrades
- ✅ Hierarchical table of contents with expandable file containers
- ✅ File-aware page selection and highlighting
- ✅ File-aware URL structure:
/files/{file_id}/pages/{page_number}/{tab} - ✅ Page overlap detection scoped to individual files (not project-wide)
Success Criteria: ✅ ALL MET
- Users can see which files are in their projects
- Users can view metadata per file
- Clear upgrade path with user control
- No forced migrations - users opt-in
Phase 3: New File Processing Pipeline ✅ COMPLETED
Goal: Process new uploads directly into new structure
Deliverables:
- ✅ Update
IngestArchitecturalPlanto create file metadata - ✅ Process pages directly into
files/{file_id}/pages/ - ✅ Update
plan-metadata.jsonfor backward compatibility - ✅ Automatic metadata generation on upload
- ✅ Document type classification (LLM-based)
- ✅ Thread-safe metadata updates with atomic Compare-and-Set operations
- ✅ Comprehensive logging for debugging metadata operations
- ✅ File-aware gRPC API with
file_idparameter inGetArchitecturalPlanPageRequest
Success Criteria: ✅ ALL MET
- New uploads go directly to new structure
- Legacy projects continue to work
- Metadata automatically generated
- No manual intervention required for new files
Phase 4: Migration & Deprecation (Optional Future)
Goal: Gradually migrate legacy projects and deprecate old structure
Deliverables:
- ✅ Automated migration scheduler
- ✅ Migration progress tracking
- ✅ Rollback capabilities
- ✅ Deprecation warnings in UI
- ✅ Final migration tool
Success Criteria:
- All projects migrated to new structure
- Legacy
pages/can be safely removed plan-metadata.jsondeprecated
Timeline: Phase 4 is optional and low-priority. Legacy support can remain indefinitely.
✨ Additional Features Implemented
Beyond the original PRD scope, the following enhancements were implemented during development:
Multi-File UI Enhancements ✅ COMPLETED
- Hierarchical Drawer Navigation: Table of contents displays files as expandable containers with nested pages
- File-Aware Page Selection: Only the specific page from the specific file gets highlighted (prevents cross-file highlighting)
- File-Aware URLs: URLs include file ID for proper page identification (
/files/{file_id}/pages/{page_number}/{tab}) - Enhanced File Headers: Two-line layout with document type, file ID, and visual emphasis (borders, shadows, background colors)
- Fade-out Text Truncation: Long filenames fade to transparency instead of using ellipsis
- Responsive Spacing: Optimized padding and alignment for maximum real estate usage
Backend Robustness ✅ COMPLETED
- Race Condition Prevention: Atomic metadata updates using GCS object generations
- Retry Logic: Automatic retry with exponential backoff for concurrent modification conflicts
- Thread-Safe Operations: Multiple ingestion tasks can run simultaneously without data corruption
- Enhanced Error Handling: Comprehensive logging and error recovery mechanisms
- File-Aware PDF Loading: Backend correctly identifies and serves PDFs from specific files
Developer Experience ✅ COMPLETED
- Comprehensive Debugging: Detailed logging throughout the system for troubleshooting
- Proto API Updates: Enhanced gRPC API with file-aware parameters
- Frontend Caching: Intelligent caching to prevent unnecessary data reloads
- Automatic Refresh: UI automatically updates after background ingestion tasks complete
Backward Compatibility Strategy
Core Principle: Dual-Read, Selective Write
Read Operations (Backward Compatible):
- Try new structure first: Check
files/{file_id}/pages/{page_number}/ - Fall back to legacy: If not found, check
pages/{page_number}/ - Cache lookup result: Avoid repeated filesystem checks
Write Operations (Selective):
- New projects: Write to
files/{file_id}/pages/only - Legacy projects: Continue writing to
pages/until upgraded - Upgraded projects: Write to
files/{file_id}/pages/only
Migration States
Projects exist in one of three states:
| State | pages/ | files/ | plan-metadata.json | Behavior |
|---|---|---|---|---|
| Legacy | ✅ Present | ❌ Absent | ✅ Present | Read from pages/, write to pages/ |
| Transitional | ✅ Present | ✅ Present | ✅ Present | Read from files/ first, fall back to pages/ |
| Modern | ❌ Empty/Absent | ✅ Present | ✅ Present (for compat) | Read/write to files/ only |
Detection Logic
public enum ProjectStructureVersion {
LEGACY, // Only has pages/
TRANSITIONAL, // Has both pages/ and files/
MODERN // Only has files/
}
public ProjectStructureVersion detectProjectVersion(String projectId) {
boolean hasLegacyPages = fileSystemHandler.exists("projects/" + projectId + "/pages/");
boolean hasFiles = fileSystemHandler.exists("projects/" + projectId + "/files/");
if (hasFiles && !hasLegacyPages) return MODERN;
if (hasFiles && hasLegacyPages) return TRANSITIONAL;
return LEGACY;
}
Path Resolution with Fallback
public String resolvePageFolderPath(String projectId, String fileId, int pageNumber) {
// Try new structure first
String newPath = String.format("projects/%s/files/%s/pages/%03d", projectId, fileId, pageNumber);
if (fileSystemHandler.exists(newPath)) {
return newPath;
}
// Fall back to legacy structure
String legacyPath = String.format("projects/%s/pages/%03d", projectId, pageNumber);
if (fileSystemHandler.exists(legacyPath)) {
logger.info("Using legacy page path for project {}, page {}", projectId, pageNumber);
return legacyPath;
}
throw new PageNotFoundException(projectId, pageNumber);
}
Ensuring Zero Breakage
Critical Guarantee: No existing functionality breaks
- All read operations have fallback logic
- Legacy projects write to old structure (no forced migration)
- Integration tests validate both structures
- Gradual rollout with feature flags
- Rollback plan if issues detected
User Stories
Story 1: View File Metadata in Project Settings
As a project owner
I want to view rich metadata about each input file in my project
So that I can understand what documents I've uploaded and their processing status
Acceptance Criteria:
- Project settings page shows list of input files
- Each file displays: name, document type, page count, upload date, processing status
- Files can be sorted by date, name, or document type
- File metadata is retrieved via
ListInputFileMetadataRPC
Story 2: Upgrade Legacy Project to New Structure
As a project owner
I want to upgrade my legacy project to use the new file structure
So that I can benefit from rich file metadata and better organization
Acceptance Criteria:
- When opening a legacy project, user sees an informational banner
- Banner message: "Upgrade your project to the new file structure for better organization and metadata."
- Banner has "Upgrade Project" button
- Clicking button shows upgrade dialog with:
- Explanation of benefits
- List of files that will be processed
- Estimated time and cost
- "Start Upgrade" and "Cancel" buttons
- Upgrade process:
- Analyzes files in
inputs/folder - Generates file metadata for each file
- Associates existing pages with source files (best effort)
- Migrates pages to
files/{file_id}/pages/structure - Preserves legacy
pages/folder for rollback - Updates
plan-metadata.jsonto reference new structure
- Analyzes files in
- On success, banner disappears and file metadata becomes visible
- User can dismiss banner, but it reappears on next visit until project is upgraded
- Upgrade is optional - legacy projects continue to work without upgrade
Story 3: Automatic Metadata for New Uploads
As a project owner
I want newly uploaded files to automatically get rich metadata
So that I don't have to manually classify or organize them
Acceptance Criteria:
- When uploading a new PDF file, system automatically:
- Generates unique file ID
- Creates
files/{file_id}/folder structure - Extracts file metadata (size, page count, checksum)
- Classifies document type using LLM (e.g., "Architectural Plan")
- Generates AI summary of file contents
- Saves metadata to
files/{file_id}/metadata.json
- File appears in project settings with full metadata
- Pages are processed into
files/{file_id}/pages/structure - Legacy
plan-metadata.jsonis updated for backward compatibility
Story 4: Hierarchical Page Navigation by Source File ✅ COMPLETED
As a project reviewer
I want to view pages organized hierarchically by source file in the table of contents
So that I can easily navigate pages by discipline and understand which pages came from which document
Acceptance Criteria: ✅ ALL MET
- ✅ Table of contents (TOC) displays pages in a hierarchical tree structure:
📄 architectural-plans.pdf (Architectural Plan | ID: 1) - 15 pages
└─ Page 1: First Floor Plan
└─ Page 2: Second Floor Plan
└─ ...
📄 electrical-plans.pdf (Electrical Plan | ID: 2) - 8 pages
└─ Page 1: Electrical Panel Schedule
└─ Page 2: Lighting Plan
└─ ... - ✅ Files are displayed as collapsible parent items using Angular Material expansion panels
- ✅ Each file shows:
- ✅ File name with fade-out truncation for long names
- ✅ Document type and file ID in subtitle format
- ✅ Red PDF icon for visual consistency
- ✅ Expand/collapse chevron icon with proper spacing
- ✅ Enhanced visual emphasis (background color, border, drop shadow)
- ✅ Pages are nested under their parent file without excessive indentation
- ✅ Clicking file header toggles expand/collapse of all pages in that file
- ✅ Selected page is highlighted only in the correct file (file-aware highlighting)
- ✅ File-aware URLs:
/files/{file_id}/pages/{page_number}/{tab} - ✅ Works for both legacy and modern projects:
- Legacy: Shows flat structure (backward compatibility)
- Modern: Shows hierarchical file structure
- Transitional: Shows hierarchical structure with dual-read support
Story 5: Admin Bulk Upgrade
As a system administrator
I want to bulk upgrade multiple legacy projects to the new structure via API
So that I can ensure consistency and enable new features system-wide
Acceptance Criteria:
API Requirements (Required):
- gRPC RPC:
MigrateProjectFileStructurewith support for:- Single project migration
- Dry-run mode (preview without applying)
- Preserve legacy structure option
- REST API endpoint via gRPC-Gateway/ESPv2:
POST /v1/architectural-plans/{project_id}/migrate-file-structure- JSON request body with
dry_runandpreserve_legacy_structureoptions
- Operation is idempotent (safe to call multiple times)
- Returns detailed migration result with success/failure status
- Proper error handling and validation
CLI Tool (Required):
- Admin can identify legacy projects:
./cli/codeproof.sh list-legacy-projects --user-id=ADMIN - Admin can bulk upgrade:
./cli/codeproof.sh upgrade-file-structure --user-id=ADMIN --dry-run=true - Command supports:
--dry-run: Preview changes without applying--project-ids: Specific projects to upgrade (comma-separated)--all: Upgrade all legacy projects for user--concurrency: Number of parallel upgrades (default: 1)
- CLI calls the gRPC API internally
- Logs success/failure for each project
Admin UI (Optional - Nice to Have):
/adminpage with legacy project management- List of all legacy projects with upgrade status
- Bulk selection and upgrade actions
- Progress tracking for batch operations
- Migration history and logs
Operational Requirements:
- Existing project functionality is not disrupted during migration
- Users can still access their projects during upgrade
- Migration logs include timestamp, initiator, and results for audit trail
Technical Design
📄 See: Technical Design Document
The detailed technical design, including backend implementation, migration algorithms, frontend components, and CLI tools, has been moved to a separate Technical Design Document (TDD) for better organization.
Key Technical Components:
- Backend Services:
InputFileMetadataService- Generate and manage file metadataFileStructureMigrationService- Migrate legacy projects to new structureProjectPathResolver- Transparent path resolution across legacy and modern structures
- gRPC RPCs:
GenerateInputFileMetadata- Create metadata for uploaded filesGetInputFileMetadata- Retrieve file metadataListInputFileMetadata- List all files with metadataMigrateProjectFileStructure- Upgrade legacy project (admin/user-initiated)
- Frontend Components:
FileMetadataListComponent- Display file list with rich metadata (project settings)PageTocHierarchicalComponent- Hierarchical page navigation with collapsible files (TOC sidebar)LegacyProjectUpgradeBannerComponent- Prompt users to upgradeFileStructureMigrationDialogComponent- User-initiated upgrade workflow
- CLI Tools:
UpgradeFileStructureCommand- Bulk upgrade for adminsAnalyzeLegacyProjectsCommand- Identify projects needing upgrade
For complete implementation details, refer to the TDD.
Success Metrics
User Adoption
- % of legacy projects upgraded within 3 months
- % of users who use hierarchical navigation features
- User feedback on file organization improvements
System Health
- Zero incidents caused by backward compatibility issues
- RPC latency for file metadata operations (target: < 500ms)
- Success rate of file structure migrations (target: > 99%)
Data Quality
- % of files with complete metadata
- % of files correctly classified by document type
- % of pages correctly associated with source files
Risks and Mitigations
| Risk | Impact | Likelihood | Mitigation |
|---|---|---|---|
| Breaking existing projects | Critical | Medium | Comprehensive dual-read fallback logic, extensive backward compatibility tests |
| Data loss during migration | Critical | Low | Keep legacy pages/ intact during migration, enable rollback |
| Performance degradation | High | Medium | Efficient path caching, lazy metadata loading, index file structure |
| Incorrect page-to-file associations | Medium | Medium | Best-effort heuristics, allow manual corrections, validate with integration tests |
| Users confused by upgrade process | Medium | High | Clear UI messaging, optional upgrade, admin can force if needed |
| Migration fails mid-process | High | Low | Transactional migrations, checkpoints, retry logic, rollback capability |
| Inconsistent metadata across files | Low | Medium | Schema validation, default values, metadata regeneration tool |
Non-Goals
- Automatic forced migrations: Users and admins control when projects are upgraded
- Reorganizing inputs/ folder: Raw uploaded files remain in flat
inputs/structure - Editing file metadata through UI: Phase 1 focuses on read-only display (edit in future)
- Advanced file operations: Move, rename, merge files (future enhancements)
- Multi-file upload coordination: Upload files one at a time (batch upload is future work)
- Real-time progress tracking: File processing happens asynchronously (task tracking in Issue #XX)
Future Enhancements
-
Admin UI for Bulk Operations:
- Web-based
/adminpage for legacy project management - Visual list of all legacy projects with upgrade status
- Bulk selection and upgrade actions
- Progress tracking for batch operations
- Migration history and audit logs
- Web-based
-
Advanced File Management:
- Move pages between files
- Merge multiple files
- Split files by document type
- Delete files and associated pages
-
Enhanced Metadata:
- Edit file metadata through UI
- Custom tags and labels
- File versioning and history
- Automatic re-classification
-
Batch Operations:
- Multi-file upload with coordination
- Bulk reprocessing
- Bulk document type re-classification
- Scheduled upgrades for legacy projects
-
Advanced Search and Filtering:
- Full-text search across file metadata and page content
- Filter by document type (Architectural, Electrical, etc.)
- Filter by processing status or date range
- Search by content summary
- Saved search queries
-
Analytics:
- File processing time metrics
- Document type distribution reports
- Storage usage by file
- Page extraction success rates
Open Questions
-
File ID Generation: Use UUID v4, timestamp-based, or auto-increment?
- Answer: Auto-increment integers (1, 2, 3...) for readable URLs and simplicity
- Rationale:
- ✅ Shortest possible IDs:
files/1/,files/2/, etc. - ✅ Human-readable and easy to reference ("File #1")
- ✅ Chronological by upload order
- ✅ Simple to implement with project-level counter in
files/index.json
- ✅ Shortest possible IDs:
- Implementation: Maintain
next_file_idcounter inprojects/{projectId}/files/index.json - Alternative considered:
{id}-{filename-slug}hybrid (e.g.,1-architectural-plans) for even more readability, but adds complexity
-
Metadata Generation Timing: Generate metadata immediately on upload or asynchronously?
- Answer: Hybrid - basic metadata immediately, AI analysis (document type, summary) asynchronously
-
Legacy
plan-metadata.json: Keep updating it for backward compatibility or deprecate immediately?- Answer: Keep updating indefinitely for maximum backward compatibility
-
Page Number Continuity: Should page numbers be global (001, 002...) or per-file (file1/001, file2/001)?
- Answer: Per-file for better organization. Global overview uses
{file_id}_{page_number}composite IDs.
- Answer: Per-file for better organization. Global overview uses
-
Migration Rollback: How long to keep legacy
pages/folder after successful migration?- Answer: Keep indefinitely (disk space is cheap, safety is paramount)
-
Document Type Classification: Use AI (expensive, accurate) or heuristics (cheap, less accurate)?
- Answer: Start with heuristics (filename, page count), add AI classification as optional enhancement
Related Documentation
- Project Metadata Management PRD: Complementary project-level metadata
- Developer Playbook: Build and deployment workflows
- Protocol Buffers & gRPC Best Practices: Proto-first design
- Copy Project Utility: Cross-environment project operations
Related Issues
-
Issue #227: Project Metadata Management
- Relationship: Complementary - Issue #227 covers project-level metadata, this PRD covers file-level metadata
- Implementation Order: Issue #227 first (simpler), then this PRD
-
Issue #117: Multi-tenant Support
- Relationship: Builds on multi-tenant structure proposed in #117
- Status: Closed - multi-tenant structure already implemented
References
- Protocol Buffer Definitions:
src/main/proto/api.proto - InputFileMetadata Proto:
src/main/proto/api.proto(seeInputFileMetadatamessage) - ArchitecturalPlanReviewer:
src/main/java/org/codetricks/construction/code/assistant/ArchitecturalPlanReviewer.java - File Storage Handler:
src/main/java/org/codetricks/construction/code/assistant/FileSystemHandler.java