INVESTIGATION

Documentation for INVESTIGATION from the Ds ex repository.

Critical Bug Investigation: Mock API Key Contamination in Test Environment

Executive Summary

Date: June 10, 2025
Severity: HIGH - Logic Error with False Positives
Reporter: Development Team
Status: Active Investigation

A critical logical error has been identified in the test infrastructure where the system reports using “LIVE API” mode (🟢) but displays mock API keys (mock***), yet tests continue to pass. This represents a fundamental breakdown in the test environment’s state management and could mask real failures.

Problem Statement

The test output shows:

🟢 [TEST MODE] Using LIVE API for gemini (API key: mock***)

This is logically impossible and indicates a serious flaw in our testing infrastructure. Tests that should either:

Fail due to invalid API credentials, OR
Be correctly identified as mock tests

are instead passing while displaying contradictory state information.

Root Cause Analysis

Primary Issue: State Contamination in Mock Environment

The bug originates in test/support/mock_helpers.exs in the setup_adaptive_client/1 function:

def setup_adaptive_client(provider \\ :gemini) do
  if api_key_available?(provider) do
    # This branch runs when an API key is detected
    # ...gets the ACTUAL API key and shows first 4 chars
    IO.puts("\n🟢 [TEST MODE] Using LIVE API for #{provider} (API key: #{key_preview})")
  else
    # This branch runs when NO API key is detected  
    IO.puts("\n🟡 [TEST MODE] Using MOCK FALLBACK for #{provider} (no API key detected)")
    
    # BUT THEN IT SETS A MOCK API KEY!
    System.put_env(env_var, "mock-api-key-for-testing-persistent")
  end
end

The Contamination Sequence

Test 1 runs: No API key exists → Mock branch executes → Sets System.put_env(env_var, "mock-api-key-for-testing-persistent")
Test 2 runs: api_key_available?() NOW RETURNS TRUE because the mock key was set globally
Test 2 logic: Enters the “live API” branch, reads the mock key, shows mock*** but claims it’s live
Test 2 result: Creates a real ClientManager process with a fake key, but tests pass anyway

Why Tests Still Pass

The tests pass because:

Graceful Degradation: The ClientManager is designed to handle API failures gracefully
Error Masking: Invalid API keys result in network errors that are caught and handled as “expected test environment behavior”

Mock Fallback Logic: The unified request functions have fallback logic for failed API calls:

def unified_request({:mock, client}, messages, opts) do
  case DSPEx.ClientManager.request(client, messages, opts) do
    {:ok, response} -> {:ok, response}
    {:error, _reason} ->
      # Return mock response when API calls fail
      {:ok, %{answer: "Mock response for testing"}}
  end
end

Process Validation Only: Tests primarily validate that processes are alive and have valid stats, not that they’re actually making successful API calls

Impact Assessment

High Severity Issues

False Confidence: Tests appear to be testing live API integration when they’re actually testing error handling
Masked Real Issues: Actual API integration problems could be hidden by this fallback behavior
State Pollution: Earlier tests contaminate the environment for later tests
Debugging Confusion: Developers see contradictory logging making troubleshooting extremely difficult

Potential Hidden Failures

Real API integration issues being masked as “normal test behavior”
Configuration problems not being detected
Network/timeout issues being silently handled
Invalid response parsing potentially bypassed

Technical Details

Environment Variable Pollution

The root cause is global state mutation:

# This creates global contamination:
System.put_env(env_var, "mock-api-key-for-testing-persistent")

# Later tests see this and think it's a real key:
def api_key_available?(provider) do
  env_var = get_env_var_name(provider)
  env_var && System.get_env(env_var) not in [nil, ""]  # Returns TRUE for mock key!
end

Process Lifecycle Issues

The cleanup mechanism is insufficient:

on_exit(fn ->
  System.delete_env(env_var)  # This only runs AFTER the test
end)

By the time cleanup runs, subsequent tests have already been contaminated.

Evidence from Test Output

From the provided test run:

🟡 [TEST MODE] Using MOCK FALLBACK for gemini (no API key detected)  # Early tests
# ... many tests later ...
🟢 [TEST MODE] Using LIVE API for gemini (API key: mock***)           # Contaminated tests

This clearly shows the progression from clean mock state to contaminated “live” state.

Recommended Immediate Actions

1. Fix State Isolation (Critical)

Replace global environment mutation with process-local state:

def setup_adaptive_client(provider \\ :gemini) do
  # Store original state
  env_var = get_env_var_name(provider)
  original_value = System.get_env(env_var)
  
  if api_key_available?(provider) do
    # Use actual live API
    {:real, start_live_client(provider)}
  else
    # Use isolated mock without contaminating global state
    {:mock, start_mock_client(provider)}
  end
end

2. Separate Mock and Live Client Paths (Critical)

Create distinct client types instead of relying on environment variables:

defp start_mock_client(provider) do
  # Create a true mock client that doesn't touch environment variables
  MockClientManager.start_link(provider, mock_responses())
end

defp start_live_client(provider) do
  # Only use when real API keys are confirmed available
  DSPEx.ClientManager.start_link(provider)
end

3. Add State Validation (High Priority)

def setup_adaptive_client(provider \\ :gemini) do
  # Validate clean state before starting
  validate_clean_test_environment(provider)
  
  # ... rest of setup
end

defp validate_clean_test_environment(provider) do
  env_var = get_env_var_name(provider)
  current_value = System.get_env(env_var)
  
  if current_value == "mock-api-key-for-testing-persistent" do
    raise "Test environment contaminated! Mock API key found in global state."
  end
end

Long-term Architectural Changes

Process-local Configuration: Move away from global environment variables for test configuration
Explicit Mock Objects: Create dedicated mock client implementations rather than relying on real clients with fake keys
Test Isolation: Ensure each test runs in a completely isolated environment
State Validation: Add comprehensive pre-test and post-test state validation

Verification Plan

Immediate: Add logging to track environment variable changes during test runs
Short-term: Implement the recommended fixes and verify clean state isolation
Long-term: Develop comprehensive integration test suite that validates both mock and live behavior explicitly

Conclusion

This bug represents a critical failure in test reliability. While tests are passing, they’re not testing what they claim to test. The combination of global state pollution and graceful error handling has created a false sense of security where broken API integration is masked as normal behavior.

Priority: This must be fixed before any production deployment, as it fundamentally undermines confidence in the test suite’s ability to catch real integration issues.