V2 Pool Remaining Errors Analysis
Executive Summary
After Phase 1 fixes, 17 tests still fail. These failures reveal deeper architectural issues beyond configuration:
- Python process not responding to init pings despite stderr capture
- Pool shutdown race conditions
- Adapter registry misconfiguration
- Test infrastructure expecting different behavior
- Missing program_id field in create_program calls
Error Pattern Analysis
Pattern 1: Port Communication Complete Failure (Critical)
Errors: #2, #9 - PortCommunicationTest
No response received within 5 seconds
Port info: [name: ~c"/home/home/.pyenv/shims/python3", ... os_pid: 1925679]
Root Cause: The Python process starts (we see the os_pid) but never sends a response. Even with stderr capture enabled, we see no Python errors, suggesting:
- Python is stuck in initialization
- Packet mode framing is broken
- Python stdout is being buffered
Theory:
The test sends raw JSON: Jason.encode!(%{"id" => 123, "command" => "ping"...})
but the port is configured with {:packet, 4}
. This means Python expects a 4-byte length header, but the test is sending raw JSON without the header.
Code Evidence (test/port_communication_test.exs
):
request = Jason.encode!(%{...}) # Just JSON, no packet header
Port.command(port, request) # Sending raw JSON to packet mode port
Recommendation:
# Use Protocol.encode_request which adds the packet header
request = Protocol.encode_request(123, :ping, %{})
Port.command(port, request)
Pattern 2: Pool Shutdown During Checkout
Errors: #3, #4, #5, #7 - Various pool operations
{:error, {:shutdown, {NimblePool, :checkout, [:test_pool_8962]}}}
Root Cause: The pool GenServer is shutting down while clients are trying to checkout. This happens when:
- A test ends and calls
stop_supervised
while operations are in flight - The pool supervisor is crashing
- Pool initialization fails but error is swallowed
Theory:
Looking at line 318: lazy: true
in session_pool_v2.ex
- the pool is STILL configured as lazy despite our config changes. The config isn’t being applied correctly.
Code Evidence (lib/dspex/python_bridge/session_pool_v2.ex:318
):
lazy: true, # Hardcoded! Ignoring config
Recommendation:
# Fix in session_pool_v2.ex init/1
lazy = Keyword.get(opts, :lazy, false) # Read from opts
pool_config = [
worker: {PoolWorkerV2, []},
pool_size: pool_size,
max_overflow: overflow,
lazy: lazy, # Use the config value
name: pool_name
]
Pattern 3: Checkout Timeout With Multiple Waiters
Error: #3 - Error handling test
Multiple warnings during init:
{:"$gen_call", {#PID<0.995.0>, ...}, {:checkout, {:session, "error_test_2"}, ...}}
Root Cause: 6 tasks try to checkout simultaneously, but only 1 worker is initializing. The other 5 timeout waiting. The init process receives their checkout requests as “unexpected messages”.
Theory: With pool_size=2 and 6 concurrent checkouts, we’re overwhelming the pool. The warnings show the worker init is receiving checkout requests meant for NimblePool.
Recommendation:
- Increase pool size for this test
- OR reduce concurrent operations to match pool size
- OR add overflow workers
Pattern 4: Function Clause Error
Error: #6
FunctionClauseError no function clause matching in PoolV2Test."test V2 Pool Architecture pool starts successfully with lazy workers"/1
Root Cause:
The test expects pool_pid
in the context, but the setup is providing pid
. Pattern matching failure.
Code Evidence (test/pool_v2_test.exs:50
):
test "pool starts successfully with lazy workers", %{
pool_pid: pool_pid, # Expects pool_pid
genserver_name: genserver_name
} do
But setup returns:
%{
pid: pid, # Returns pid, not pool_pid
genserver_name: genserver_name,
...
}
Recommendation:
Fix the pattern match to use pid
or update helper to return pool_pid
.
Pattern 5: Program ID Required
Error: #10
{:error, "Program ID is required"}
Root Cause:
The Python bridge expects a program_id
field when creating programs. The test was updated to include it, but the error persists.
Theory: Looking at the error traceback, it’s coming from the Python side. The field name might be wrong or the args aren’t being passed correctly.
Recommendation: Check Python expectations and ensure the field is passed correctly in the protocol.
Pattern 6: Adapter Registry Misconfiguration
Errors: #12, #13-17
Python bridge not available
Expected PythonPort but got PythonPool
Root Cause:
- Tests expect
PythonPort
adapter but system returnsPythonPool
- Layer 3 tests can’t find Python bridge despite it being started
Theory: The adapter registry is misconfigured. Some tests expect the single bridge adapter but the system is configured for pooling.
Recommendation:
- Update test expectations to match pooling configuration
- OR provide test-specific adapter configuration
- Ensure Python bridge supervisor starts before tests
Pattern 7: BridgeMock Not Started
Error: #1
GenServer.call(DSPex.Adapters.BridgeMock, :reset, 5000)
** (EXIT) no process
Root Cause: The BridgeMock adapter is not started as a GenServer but the test tries to call it.
Recommendation: Start BridgeMock in test setup or change test to not require GenServer calls.
Comprehensive Fix Strategy
Immediate Fixes (Do First)
- Fix Protocol in PortCommunicationTest - Use
Protocol.encode_request
- Fix hardcoded
lazy: true
- Make it configurable - Fix pattern match in pool test - Use correct field names
- Start BridgeMock in test setup - Add to supervision tree
Architectural Fixes (Do Second)
- Pool Size Management - Ensure pool size >= concurrent operations
- Adapter Registry - Fix test expectations vs reality
- Python Bridge Startup - Ensure it starts before layer_3 tests
Investigation Required
- Python Not Responding - Add Python-side logging to debug init
- Program ID Field - Verify exact field name Python expects
- Shutdown Race - Add proper cleanup coordination
Test-Specific Recommendations
PortCommunicationTest
# Change from:
request = Jason.encode!(%{...})
# To:
request = Protocol.encode_request(123, :ping, %{})
PoolV2Test
# Fix pattern match:
test "pool starts successfully", %{pid: pool_pid, genserver_name: genserver_name} do
# OR update helper to return pool_pid
SessionPoolV2
# In init/1, fix hardcoded lazy:
lazy = Keyword.get(opts, :lazy, Application.get_env(:dspex, :pool_lazy, false))
Error Handling Test
# Increase pool size for 6 concurrent operations:
pool_info = start_test_pool(
pool_size: 6, # Match concurrent operations
overflow: 0,
pre_warm: false
)
Critical Insight
The most concerning issue is that Python processes are not responding even with stderr capture enabled. This suggests:
- Python is hanging during import/initialization
- The packet mode framing is fundamentally broken
- Python output is being buffered and not flushed
Next Debugging Step: Add Python-side file logging to see if the bridge script even starts:
# At the very top of dspy_bridge.py
import sys
with open('/tmp/dspy_bridge_debug.log', 'a') as f:
f.write(f"Bridge starting: {sys.argv}\n")
f.flush()
Conclusion
These errors are not environmental - they reveal real bugs:
- Hardcoded configuration ignoring test settings
- Protocol mismatches between test and implementation
- Race conditions in pool lifecycle
- Incorrect test assumptions about adapter configuration
None of these will be fixed by Phase 2/3 architectural improvements. They need direct code fixes.