Chatbot Test Prompts & Evaluation Cases
Overview
This document tracks test prompts used during ad-hoc testing of the chatbot feature. These prompts will serve as the basis for future automated integration tests and evaluation metrics.
Purpose
- Capture real-world user queries encountered during development
- Document expected behavior for each prompt
- Track edge cases and failure modes
- Enable reproducible testing across iterations
- Feed automated test suites and evaluation frameworks
Test Case Format
Each test case should include:
- Prompt: The exact user query
- Test Project: Which project/files to use
- Expected Behavior: What the bot should do
- Verification Points: Specific things to check
- Status: ✅ Pass, ❌ Fail, ⏳ Pending
- Notes: Additional context or issues found
Test Cases
TC-001: Page Reference Query - Second Floor
Prompt: "Which files and pages have information about the Second Floor?"
Test Project: San Jose Sonora (3 files)
Expected Behavior:
- Bot should identify all files containing "Second Floor" references
- Bot should provide specific page numbers where information appears
- Response decorator should add clickable links to the referenced pages
Verification Points:
- Response mentions all relevant files
- Page numbers are accurate
- Links are properly formatted and clickable
- Links navigate to correct page in viewer
- Response is concise and well-organized
Status: ⏳ Pending
Notes:
- This tests basic document search and reference extraction
- Tests the response decorator's link generation
- Critical for user experience - links must work correctly
TC-002: Code Violations Check - Context-Aware
Prompt: "Do we have any violations of the code on this page?"
Test Project: San Jose Sonora (multi-file) Test Page: File 2, Page 3 (Compliance tab with existing reports)
Expected Behavior:
- Bot should call
GetAvailableAnalysisto check existing reports FIRST - Bot should inform user about existing analysis (e.g., "CBC 2022: 3 violations found")
- Bot should list the violations or summarize findings
- Bot should ask if user wants analysis for other codes
- Bot should NOT run expensive analysis without confirmation
Verification Points:
- Calls
GetAvailableAnalysisAPI first - Mentions existing book (e.g., "CBC 2022")
- Reports violation count accurately
- Provides details about violations found
- Asks before running new expensive analysis
- Does NOT call
StartPageSectionComplianceReportTaskwithout confirmation
Status: ⏳ Pending
Notes:
- Tests cost-awareness and existing analysis detection
- Critical for avoiding redundant expensive operations
- Should reference screenshot showing existing compliance reports
- Agent must be smart about not re-running already-completed analysis
Test Categories
1. Document Search & Reference
Tests that verify the bot can find and reference specific information in documents.
Examples:
- TC-001: Page Reference Query - Second Floor
2. Multi-File Analysis
Tests that require synthesizing information across multiple files.
Examples:
- (To be added)
3. Technical Detail Extraction
Tests that require extracting specific technical details (dimensions, materials, codes).
Examples:
- (To be added)
4. Comparative Analysis
Tests that require comparing information across sections or documents.
Examples:
- (To be added)
5. Code Compliance Questions
Tests related to building codes, regulations, and compliance.
Examples:
- TC-002: Code Violations Check - Context-Aware
6. Clarification & Ambiguity
Tests with ambiguous queries that require clarification or intelligent interpretation.
Examples:
- (To be added)
7. Edge Cases & Error Handling
Tests for unusual inputs, missing data, or error conditions.
Examples:
- (To be added)
Adding New Test Cases
When adding a new test case:
- Assign a unique ID (TC-XXX format)
- Use the exact prompt you tested with
- Document the test project and its characteristics (number of files, size, etc.)
- Be specific about expected behavior - what should happen?
- List concrete verification points - how do you know it worked?
- Update the status as you test
- Add notes about interesting findings, bugs, or improvements needed
- Categorize the test case appropriately
Future Automation
This document will be used to generate:
- Jasmine/Jest integration tests for frontend chat interactions
- JUnit tests for backend RAG pipeline
- Evaluation metrics (precision, recall, link accuracy)
- Regression test suite for releases
- Performance benchmarks (response time, token usage)
Test Data Requirements
For automated testing, we'll need:
- Sample projects with known content (ground truth)
- Expected response templates or validation rules
- Link validation test harness
- Response quality scoring rubric
- Performance baselines
Changelog
- 2025-10-27: Initial document created with TC-001
- 2025-10-27: Added TC-002 - Code violations check with context awareness