Skip to main content

LLM Log Sanitization

This document describes the LLM log sanitization feature that automatically removes or redacts sensitive information from log traces.

Overview

The LLM log sanitization system provides configurable filtering of sensitive content from both:

  • BigQuery trace logs (via LlmLogTracer)
  • Standard application logs (via Logger)

Features

Supported Content Types

  1. Binary Content Sanitization:

    • PDF files (<pdf-binary-data-redacted>)
    • Images: JPEG, PNG, GIF, WebP (<image-binary-data-redacted>)
    • Videos: MP4, AVI (<video-binary-data-redacted>)
    • Audio: MP3, WAV (<audio-binary-data-redacted>)
  2. Personal Identifiable Information (PII) Sanitization:

    • Email addresses (<pii-data-redacted>)
    • Phone numbers (<pii-data-redacted>)
    • Social Security Numbers (<pii-data-redacted>)

Configuration

The sanitizer is configured via environment variables and system properties:

Environment Variables

# Master switch for sanitization
export LLM_LOG_SANITIZATION_ENABLED=true

# Individual content type controls
export LLM_LOG_SANITIZE_PDF_CONTENT=true
export LLM_LOG_SANITIZE_PII_CONTENT=false
export LLM_LOG_SANITIZE_IMAGE_CONTENT=true
export LLM_LOG_SANITIZE_VIDEO_CONTENT=true
export LLM_LOG_SANITIZE_AUDIO_CONTENT=true

System Properties

Alternatively, use system properties (environment variables take precedence):

-Dllm.log.sanitization.enabled=true
-Dllm.log.sanitize.pdf.content=true
-Dllm.log.sanitize.pii.content=false
-Dllm.log.sanitize.image.content=true
-Dllm.log.sanitize.video.content=true
-Dllm.log.sanitize.audio.content=true

Default Configuration

  • Sanitization Enabled: true
  • PDF Content: true (sanitized)
  • PII Content: false (not sanitized by default)
  • Image Content: true (sanitized)
  • Video Content: true (sanitized)
  • Audio Content: true (sanitized)

Preset Configurations

Development Configuration

LlmLogSanitizationConfig.createDevelopmentConfig()
// Sanitizes only PDF content for minimal impact during development

Production Configuration

LlmLogSanitizationConfig.createProductionConfig()
// Sanitizes all content types including PII for maximum security

Disabled Configuration

LlmLogSanitizationConfig.createDisabledConfig()
// Disables all sanitization

Usage Examples

Programmatic Usage

// Create a custom sanitizer
LlmLogTraceSanitizer sanitizer = new LlmLogTraceSanitizer.Builder()
.sanitizePdfContent(true)
.sanitizePiiContent(true)
.sanitizeImageContent(false)
.build();

// Sanitize JSON content
String sanitizedJson = sanitizer.sanitizeJson(originalJson);

// Sanitize plain text
String sanitizedText = sanitizer.sanitizeText(originalText);

Automatic Integration

The sanitizer is automatically integrated into:

  1. BigQuery Logging: All traces logged to BigQuery are automatically sanitized
  2. Standard Logging: Request/response JSON in application logs are sanitized

Configuration Loading

// Load configuration from environment/properties
LlmLogSanitizationConfig config = new LlmLogSanitizationConfig();
LlmLogTraceSanitizer sanitizer = config.createSanitizer();

// Check configuration
logger.info("Sanitization config: " + config.getSummary());

Security Considerations

What Gets Sanitized

  1. Base64 Encoded Binary Data: Detected by:

    • Pattern matching for base64 strings > 1000 characters
    • File signature verification (magic numbers)
    • Field name heuristics (data, content, bytes, blob)
  2. PII in Text: Detected by regex patterns for:

    • Email addresses
    • US phone numbers
    • Social Security Numbers

What Doesn't Get Sanitized

  • Small base64 strings (< 1000 characters)
  • Binary data that doesn't match known file signatures
  • PII in structured formats (when PII sanitization is disabled)
  • Non-standard PII formats

Limitations

  • Performance: Large content sanitization may impact performance
  • False Positives: Some legitimate base64 data might be sanitized
  • False Negatives: Sophisticated encoding might bypass detection
  • PII Detection: Basic regex patterns may miss complex PII formats

Testing

Run the sanitizer test to verify functionality:

mvn compile exec:java -Dexec.mainClass="org.codetricks.construction.code.assistant.ai.model.LlmLogSanitizerTest"

This will test different sanitizer configurations with sample PDF and PII data.

Monitoring

The sanitizer logs its configuration at startup:

INFO: LLM Log Sanitization Configuration loaded:
INFO: Sanitization Enabled: true
INFO: PDF Content: true
INFO: PII Content: false
INFO: Image Content: true
INFO: Video Content: true
INFO: Audio Content: true

Extending the Sanitizer

To add new content types or PII patterns:

  1. Add new detection methods in LlmLogTraceSanitizer
  2. Update configuration in LlmLogSanitizationConfig
  3. Add new patterns to the appropriate detection methods
  4. Update tests to verify new functionality

Example:

// Add credit card detection
private static final Pattern CREDIT_CARD_PATTERN = Pattern.compile(
"\\b\\d{4}[- ]?\\d{4}[- ]?\\d{4}[- ]?\\d{4}\\b"
);

Best Practices

  1. Production: Enable PII sanitization in production environments
  2. Development: Use minimal sanitization for faster development
  3. Testing: Verify sanitization doesn't break functionality
  4. Monitoring: Check logs to ensure configuration is applied correctly
  5. Performance: Monitor impact on large payloads

Troubleshooting

Issue: Sanitization not working

  • Check environment variables are set correctly
  • Verify configuration logs at startup
  • Test with the provided test class

Issue: False positives

  • Adjust detection thresholds
  • Customize patterns for your use case
  • Consider disabling specific sanitization types

Issue: Performance impact

  • Monitor processing time for large payloads
  • Consider disabling unnecessary sanitization types
  • Optimize detection patterns
  • LlmLogTraceSanitizer - Main sanitization logic
  • LlmLogSanitizationConfig - Configuration management
  • LlmLogTracer - BigQuery logging integration
  • GoogleGenAiClient - Standard logging integration