Background Task Architecture Overview
Problem Statement
Executing lengthy background tasks within a Cloud Run Service presents a fundamental architectural challenge. The platform is designed to handle request-response cycles, and once the response is sent, the CPU is heavily throttled or completely frozen. This causes background tasks initiated from gRPC service calls to execute slowly, even when using sub-threads within the request handler. (See GitHub Issue #161 for detailed observations of this behavior.)
Solution: Hybrid Architecture
PermitProof implements a hybrid architecture that combines the best of both approaches:
- Primary Strategy: Cloud Run Jobs (separate container execution)
- Fallback Strategy: ExecutorService thread pools (in-process execution)
Architecture Decision Flow
gRPC Request Received
↓
Create Task in Firestore
↓
Cloud Run Jobs Available?
↓
Yes ────→ Trigger Cloud Run Job ────→ Separate Container
│ Full CPU Allocation
│ Optimal for Long Tasks
│
No ─────→ ExecutorService Fallback ─→ Background Thread
Same Container
CPU Throttled After Response
Best for Short Tasks (`<`60s)
When Each Approach is Used
Cloud Run Jobs (Primary)
- Trigger Condition:
CloudRunTaskTriggersuccessfully initialized - Environment: Production with proper GCP configuration
- CPU Allocation: Full, dedicated CPU resources
- Duration: Optimal for tasks > 60 seconds
- Isolation: Completely independent container
- Examples: Page ingestion, code applicability analysis, compliance report generation
ExecutorService Fallback
- Trigger Condition: Cloud Run Jobs initialization fails
- Environment: Development, testing, or degraded GCP setup
- CPU Allocation: Throttled after gRPC response sent
- Duration: Suitable for tasks < 60 seconds
- Isolation: Runs in same container as gRPC service
- Examples: Quick validations, metadata updates, short analyses
Architecture Components
1. gRPC Service Layer
- Purpose: Request handler and job orchestrator
- Responsibility: Create Firestore task, trigger background execution, return immediately
- Implementations:
ArchitecturalPlanWriteAsyncServiceImpl- Page ingestion tasksCodeApplicabilityServiceImpl- Code applicability analysisArchitecturalPlanReviewServiceImpl- Compliance report generationTaskServiceImpl- Generic task management
2. Task Tracking Service
- Purpose: Generic task management infrastructure
- Storage: Firestore
taskscollection - Features: Real-time progress, step tracking, cost analysis
- Implementation:
TaskServiceImpl
3. Background Execution Layer
- Cloud Run Jobs:
CloudRunTaskTrigger→ separate container instances - ExecutorService: Thread pools within Cloud Run Service
- Selection: Automatic fallback if Cloud Run Jobs unavailable
4. Frontend Integration
- Real-time Updates: Firestore subscriptions via
FirestoreTaskTrackingService - Progress Display:
AsyncTaskProgressComponentwith step-by-step tracking - User Experience: Non-blocking UI with live progress bars
Key Features
Graceful Degradation
The system works reliably even if Cloud Run Jobs setup fails:
Example pattern used across all async services (CodeApplicabilityServiceImpl, ArchitecturalPlanWriteAsyncServiceImpl, ArchitecturalPlanReviewServiceImpl)
if (jobTrigger != null) {
logger.info("🚀 Triggering Cloud Run Job for task: " + taskId);
backgroundExecutor.submit(() -> triggerJob(taskId, request));
} else {
logger.info("🔄 Using background processing for task: " + taskId);
executeBackgroundTask(taskId, request);
}
Real-time Progress Tracking
- Firestore Integration: Tasks stored in
taskscollection (provisioned byTaskServiceImpl) - Step History: Comprehensive progress tracking with timestamps
- Cost Metadata: Per-model LLM cost accumulation
- Status Enums: Type-safe status management (PENDING, PROCESSING, COMPLETE, FAILED)
Parallelization Strategy
- One Task per Unit: Each page/chapter gets its own task for maximum parallelism
- Cloud Run Scaling: Multiple instances process tasks simultaneously
- Independent Progress: Each task tracks progress independently
- Fault Tolerance: Failure of one task doesn't affect others
Performance Characteristics Comparison
| Aspect | Cloud Run Jobs | ExecutorService |
|---|---|---|
| CPU Allocation | Full, dedicated | Throttled after response |
| Task Duration | Minutes to hours | Seconds to ~60s |
| Scalability | High (parallel jobs) | Limited (thread pool) |
| Isolation | Complete | Shared container |
| Overhead | Container startup | Minimal |
| Production Use | ✅ Recommended | ⚠️ Fallback only |
| Development | Optional | ✅ Convenient |
Implementation Status
Currently Implemented
- ✅ Hybrid architecture with automatic fallback
- ✅ Firestore task tracking with real-time updates
- ✅ Cloud Run Jobs integration for code applicability
- ✅ ExecutorService fallback for all async operations
- ✅ Step-by-step progress tracking
- ✅ Cost analysis metadata accumulation
- ✅ Frontend real-time progress display
Architecture Evolution
The system evolved from pure ExecutorService to hybrid approach:
- Phase 1: ExecutorService only (CPU throttling issues discovered)
- Phase 2: Cloud Run Jobs added as primary strategy
- Phase 3: Hybrid approach with graceful fallback (current)
Security Considerations
- Authentication: All gRPC calls require valid Firebase authentication
- Authorization: RBAC service checks project permissions
- Firestore Rules: Secure access to task documents by user
- Data Validation: Validate all input parameters
Next Steps
For detailed implementation guides:
- Cloud Run Jobs Pattern: See Cloud Run Jobs
- ExecutorService Fallback: See ExecutorService Fallback
- Complete Implementation: See Implementation Guide