← Back to Dev

INVESTIGATION

Documentation for INVESTIGATION from the Ds ex repository.

Critical Bug Investigation: Mock API Key Contamination in Test Environment

Executive Summary

Date: June 10, 2025
Severity: HIGH - Logic Error with False Positives
Reporter: Development Team
Status: Active Investigation

A critical logical error has been identified in the test infrastructure where the system reports using “LIVE API” mode (🟢) but displays mock API keys (mock***), yet tests continue to pass. This represents a fundamental breakdown in the test environment’s state management and could mask real failures.

Problem Statement

The test output shows:

🟢 [TEST MODE] Using LIVE API for gemini (API key: mock***)

This is logically impossible and indicates a serious flaw in our testing infrastructure. Tests that should either:

  1. Fail due to invalid API credentials, OR
  2. Be correctly identified as mock tests

are instead passing while displaying contradictory state information.

Root Cause Analysis

Primary Issue: State Contamination in Mock Environment

The bug originates in test/support/mock_helpers.exs in the setup_adaptive_client/1 function:

def setup_adaptive_client(provider \\ :gemini) do
  if api_key_available?(provider) do
    # This branch runs when an API key is detected
    # ...gets the ACTUAL API key and shows first 4 chars
    IO.puts("\n🟢 [TEST MODE] Using LIVE API for #{provider} (API key: #{key_preview})")
  else
    # This branch runs when NO API key is detected  
    IO.puts("\n🟡 [TEST MODE] Using MOCK FALLBACK for #{provider} (no API key detected)")
    
    # BUT THEN IT SETS A MOCK API KEY!
    System.put_env(env_var, "mock-api-key-for-testing-persistent")
  end
end

The Contamination Sequence

  1. Test 1 runs: No API key exists → Mock branch executes → Sets System.put_env(env_var, "mock-api-key-for-testing-persistent")
  2. Test 2 runs: api_key_available?() NOW RETURNS TRUE because the mock key was set globally
  3. Test 2 logic: Enters the “live API” branch, reads the mock key, shows mock*** but claims it’s live
  4. Test 2 result: Creates a real ClientManager process with a fake key, but tests pass anyway

Why Tests Still Pass

The tests pass because:

  1. Graceful Degradation: The ClientManager is designed to handle API failures gracefully

  2. Error Masking: Invalid API keys result in network errors that are caught and handled as “expected test environment behavior”

  3. Mock Fallback Logic: The unified request functions have fallback logic for failed API calls:

    def unified_request({:mock, client}, messages, opts) do
      case DSPEx.ClientManager.request(client, messages, opts) do
        {:ok, response} -> {:ok, response}
        {:error, _reason} ->
          # Return mock response when API calls fail
          {:ok, %{answer: "Mock response for testing"}}
      end
    end
    
  4. Process Validation Only: Tests primarily validate that processes are alive and have valid stats, not that they’re actually making successful API calls

Impact Assessment

High Severity Issues

  1. False Confidence: Tests appear to be testing live API integration when they’re actually testing error handling
  2. Masked Real Issues: Actual API integration problems could be hidden by this fallback behavior
  3. State Pollution: Earlier tests contaminate the environment for later tests
  4. Debugging Confusion: Developers see contradictory logging making troubleshooting extremely difficult

Potential Hidden Failures

  • Real API integration issues being masked as “normal test behavior”
  • Configuration problems not being detected
  • Network/timeout issues being silently handled
  • Invalid response parsing potentially bypassed

Technical Details

Environment Variable Pollution

The root cause is global state mutation:

# This creates global contamination:
System.put_env(env_var, "mock-api-key-for-testing-persistent")

# Later tests see this and think it's a real key:
def api_key_available?(provider) do
  env_var = get_env_var_name(provider)
  env_var && System.get_env(env_var) not in [nil, ""]  # Returns TRUE for mock key!
end

Process Lifecycle Issues

The cleanup mechanism is insufficient:

on_exit(fn ->
  System.delete_env(env_var)  # This only runs AFTER the test
end)

By the time cleanup runs, subsequent tests have already been contaminated.

Evidence from Test Output

From the provided test run:

🟡 [TEST MODE] Using MOCK FALLBACK for gemini (no API key detected)  # Early tests
# ... many tests later ...
🟢 [TEST MODE] Using LIVE API for gemini (API key: mock***)           # Contaminated tests

This clearly shows the progression from clean mock state to contaminated “live” state.

1. Fix State Isolation (Critical)

Replace global environment mutation with process-local state:

def setup_adaptive_client(provider \\ :gemini) do
  # Store original state
  env_var = get_env_var_name(provider)
  original_value = System.get_env(env_var)
  
  if api_key_available?(provider) do
    # Use actual live API
    {:real, start_live_client(provider)}
  else
    # Use isolated mock without contaminating global state
    {:mock, start_mock_client(provider)}
  end
end

2. Separate Mock and Live Client Paths (Critical)

Create distinct client types instead of relying on environment variables:

defp start_mock_client(provider) do
  # Create a true mock client that doesn't touch environment variables
  MockClientManager.start_link(provider, mock_responses())
end

defp start_live_client(provider) do
  # Only use when real API keys are confirmed available
  DSPEx.ClientManager.start_link(provider)
end

3. Add State Validation (High Priority)

def setup_adaptive_client(provider \\ :gemini) do
  # Validate clean state before starting
  validate_clean_test_environment(provider)
  
  # ... rest of setup
end

defp validate_clean_test_environment(provider) do
  env_var = get_env_var_name(provider)
  current_value = System.get_env(env_var)
  
  if current_value == "mock-api-key-for-testing-persistent" do
    raise "Test environment contaminated! Mock API key found in global state."
  end
end

Long-term Architectural Changes

  1. Process-local Configuration: Move away from global environment variables for test configuration
  2. Explicit Mock Objects: Create dedicated mock client implementations rather than relying on real clients with fake keys
  3. Test Isolation: Ensure each test runs in a completely isolated environment
  4. State Validation: Add comprehensive pre-test and post-test state validation

Verification Plan

  1. Immediate: Add logging to track environment variable changes during test runs
  2. Short-term: Implement the recommended fixes and verify clean state isolation
  3. Long-term: Develop comprehensive integration test suite that validates both mock and live behavior explicitly

Conclusion

This bug represents a critical failure in test reliability. While tests are passing, they’re not testing what they claim to test. The combination of global state pollution and graceful error handling has created a false sense of security where broken API integration is masked as normal behavior.

Priority: This must be fixed before any production deployment, as it fundamentally undermines confidence in the test suite’s ability to catch real integration issues.