Critical Bug Investigation: Mock API Key Contamination in Test Environment
Executive Summary
Date: June 10, 2025
Severity: HIGH - Logic Error with False Positives
Reporter: Development Team
Status: Active Investigation
A critical logical error has been identified in the test infrastructure where the system reports using “LIVE API” mode (🟢) but displays mock API keys (mock***
), yet tests continue to pass. This represents a fundamental breakdown in the test environment’s state management and could mask real failures.
Problem Statement
The test output shows:
🟢 [TEST MODE] Using LIVE API for gemini (API key: mock***)
This is logically impossible and indicates a serious flaw in our testing infrastructure. Tests that should either:
- Fail due to invalid API credentials, OR
- Be correctly identified as mock tests
are instead passing while displaying contradictory state information.
Root Cause Analysis
Primary Issue: State Contamination in Mock Environment
The bug originates in test/support/mock_helpers.exs
in the setup_adaptive_client/1
function:
def setup_adaptive_client(provider \\ :gemini) do
if api_key_available?(provider) do
# This branch runs when an API key is detected
# ...gets the ACTUAL API key and shows first 4 chars
IO.puts("\n🟢 [TEST MODE] Using LIVE API for #{provider} (API key: #{key_preview})")
else
# This branch runs when NO API key is detected
IO.puts("\n🟡 [TEST MODE] Using MOCK FALLBACK for #{provider} (no API key detected)")
# BUT THEN IT SETS A MOCK API KEY!
System.put_env(env_var, "mock-api-key-for-testing-persistent")
end
end
The Contamination Sequence
- Test 1 runs: No API key exists → Mock branch executes → Sets
System.put_env(env_var, "mock-api-key-for-testing-persistent")
- Test 2 runs:
api_key_available?()
NOW RETURNS TRUE because the mock key was set globally - Test 2 logic: Enters the “live API” branch, reads the mock key, shows
mock***
but claims it’s live - Test 2 result: Creates a real
ClientManager
process with a fake key, but tests pass anyway
Why Tests Still Pass
The tests pass because:
Graceful Degradation: The
ClientManager
is designed to handle API failures gracefullyError Masking: Invalid API keys result in network errors that are caught and handled as “expected test environment behavior”
Mock Fallback Logic: The unified request functions have fallback logic for failed API calls:
def unified_request({:mock, client}, messages, opts) do case DSPEx.ClientManager.request(client, messages, opts) do {:ok, response} -> {:ok, response} {:error, _reason} -> # Return mock response when API calls fail {:ok, %{answer: "Mock response for testing"}} end end
Process Validation Only: Tests primarily validate that processes are alive and have valid stats, not that they’re actually making successful API calls
Impact Assessment
High Severity Issues
- False Confidence: Tests appear to be testing live API integration when they’re actually testing error handling
- Masked Real Issues: Actual API integration problems could be hidden by this fallback behavior
- State Pollution: Earlier tests contaminate the environment for later tests
- Debugging Confusion: Developers see contradictory logging making troubleshooting extremely difficult
Potential Hidden Failures
- Real API integration issues being masked as “normal test behavior”
- Configuration problems not being detected
- Network/timeout issues being silently handled
- Invalid response parsing potentially bypassed
Technical Details
Environment Variable Pollution
The root cause is global state mutation:
# This creates global contamination:
System.put_env(env_var, "mock-api-key-for-testing-persistent")
# Later tests see this and think it's a real key:
def api_key_available?(provider) do
env_var = get_env_var_name(provider)
env_var && System.get_env(env_var) not in [nil, ""] # Returns TRUE for mock key!
end
Process Lifecycle Issues
The cleanup mechanism is insufficient:
on_exit(fn ->
System.delete_env(env_var) # This only runs AFTER the test
end)
By the time cleanup runs, subsequent tests have already been contaminated.
Evidence from Test Output
From the provided test run:
🟡 [TEST MODE] Using MOCK FALLBACK for gemini (no API key detected) # Early tests
# ... many tests later ...
🟢 [TEST MODE] Using LIVE API for gemini (API key: mock***) # Contaminated tests
This clearly shows the progression from clean mock state to contaminated “live” state.
Recommended Immediate Actions
1. Fix State Isolation (Critical)
Replace global environment mutation with process-local state:
def setup_adaptive_client(provider \\ :gemini) do
# Store original state
env_var = get_env_var_name(provider)
original_value = System.get_env(env_var)
if api_key_available?(provider) do
# Use actual live API
{:real, start_live_client(provider)}
else
# Use isolated mock without contaminating global state
{:mock, start_mock_client(provider)}
end
end
2. Separate Mock and Live Client Paths (Critical)
Create distinct client types instead of relying on environment variables:
defp start_mock_client(provider) do
# Create a true mock client that doesn't touch environment variables
MockClientManager.start_link(provider, mock_responses())
end
defp start_live_client(provider) do
# Only use when real API keys are confirmed available
DSPEx.ClientManager.start_link(provider)
end
3. Add State Validation (High Priority)
def setup_adaptive_client(provider \\ :gemini) do
# Validate clean state before starting
validate_clean_test_environment(provider)
# ... rest of setup
end
defp validate_clean_test_environment(provider) do
env_var = get_env_var_name(provider)
current_value = System.get_env(env_var)
if current_value == "mock-api-key-for-testing-persistent" do
raise "Test environment contaminated! Mock API key found in global state."
end
end
Long-term Architectural Changes
- Process-local Configuration: Move away from global environment variables for test configuration
- Explicit Mock Objects: Create dedicated mock client implementations rather than relying on real clients with fake keys
- Test Isolation: Ensure each test runs in a completely isolated environment
- State Validation: Add comprehensive pre-test and post-test state validation
Verification Plan
- Immediate: Add logging to track environment variable changes during test runs
- Short-term: Implement the recommended fixes and verify clean state isolation
- Long-term: Develop comprehensive integration test suite that validates both mock and live behavior explicitly
Conclusion
This bug represents a critical failure in test reliability. While tests are passing, they’re not testing what they claim to test. The combination of global state pollution and graceful error handling has created a false sense of security where broken API integration is masked as normal behavior.
Priority: This must be fixed before any production deployment, as it fundamentally undermines confidence in the test suite’s ability to catch real integration issues.