Skip to main content

Chatbot Test Prompts & Evaluation Cases

Overview

This document tracks test prompts used during ad-hoc testing of the chatbot feature. These prompts will serve as the basis for future automated integration tests and evaluation metrics.

Purpose

  • Capture real-world user queries encountered during development
  • Document expected behavior for each prompt
  • Track edge cases and failure modes
  • Enable reproducible testing across iterations
  • Feed automated test suites and evaluation frameworks

Test Case Format

Each test case should include:

  • Prompt: The exact user query
  • Test Project: Which project/files to use
  • Expected Behavior: What the bot should do
  • Verification Points: Specific things to check
  • Status: ✅ Pass, ❌ Fail, ⏳ Pending
  • Notes: Additional context or issues found

Test Cases

TC-001: Page Reference Query - Second Floor

Prompt: "Which files and pages have information about the Second Floor?"

Test Project: San Jose Sonora (3 files)

Expected Behavior:

  • Bot should identify all files containing "Second Floor" references
  • Bot should provide specific page numbers where information appears
  • Response decorator should add clickable links to the referenced pages

Verification Points:

  • Response mentions all relevant files
  • Page numbers are accurate
  • Links are properly formatted and clickable
  • Links navigate to correct page in viewer
  • Response is concise and well-organized

Status: ⏳ Pending

Notes:

  • This tests basic document search and reference extraction
  • Tests the response decorator's link generation
  • Critical for user experience - links must work correctly

TC-002: Code Violations Check - Context-Aware

Prompt: "Do we have any violations of the code on this page?"

Test Project: San Jose Sonora (multi-file) Test Page: File 2, Page 3 (Compliance tab with existing reports)

Expected Behavior:

  • Bot should call GetAvailableAnalysis to check existing reports FIRST
  • Bot should inform user about existing analysis (e.g., "CBC 2022: 3 violations found")
  • Bot should list the violations or summarize findings
  • Bot should ask if user wants analysis for other codes
  • Bot should NOT run expensive analysis without confirmation

Verification Points:

  • Calls GetAvailableAnalysis API first
  • Mentions existing book (e.g., "CBC 2022")
  • Reports violation count accurately
  • Provides details about violations found
  • Asks before running new expensive analysis
  • Does NOT call StartPageSectionComplianceReportTask without confirmation

Status: ⏳ Pending

Notes:

  • Tests cost-awareness and existing analysis detection
  • Critical for avoiding redundant expensive operations
  • Should reference screenshot showing existing compliance reports
  • Agent must be smart about not re-running already-completed analysis

Test Categories

1. Document Search & Reference

Tests that verify the bot can find and reference specific information in documents.

Examples:

  • TC-001: Page Reference Query - Second Floor

2. Multi-File Analysis

Tests that require synthesizing information across multiple files.

Examples:

  • (To be added)

3. Technical Detail Extraction

Tests that require extracting specific technical details (dimensions, materials, codes).

Examples:

  • (To be added)

4. Comparative Analysis

Tests that require comparing information across sections or documents.

Examples:

  • (To be added)

5. Code Compliance Questions

Tests related to building codes, regulations, and compliance.

Examples:

  • TC-002: Code Violations Check - Context-Aware

6. Clarification & Ambiguity

Tests with ambiguous queries that require clarification or intelligent interpretation.

Examples:

  • (To be added)

7. Edge Cases & Error Handling

Tests for unusual inputs, missing data, or error conditions.

Examples:

  • (To be added)

Adding New Test Cases

When adding a new test case:

  1. Assign a unique ID (TC-XXX format)
  2. Use the exact prompt you tested with
  3. Document the test project and its characteristics (number of files, size, etc.)
  4. Be specific about expected behavior - what should happen?
  5. List concrete verification points - how do you know it worked?
  6. Update the status as you test
  7. Add notes about interesting findings, bugs, or improvements needed
  8. Categorize the test case appropriately

Future Automation

This document will be used to generate:

  • Jasmine/Jest integration tests for frontend chat interactions
  • JUnit tests for backend RAG pipeline
  • Evaluation metrics (precision, recall, link accuracy)
  • Regression test suite for releases
  • Performance benchmarks (response time, token usage)

Test Data Requirements

For automated testing, we'll need:

  • Sample projects with known content (ground truth)
  • Expected response templates or validation rules
  • Link validation test harness
  • Response quality scoring rubric
  • Performance baselines

Changelog

  • 2025-10-27: Initial document created with TC-001
  • 2025-10-27: Added TC-002 - Code violations check with context awareness