← Back to Docs

V2 POOL REMAINING ERRORS ANALYSIS

Documentation for V2_POOL_REMAINING_ERRORS_ANALYSIS from the Dspex repository.

V2 Pool Remaining Errors Analysis

Executive Summary

After Phase 1 fixes, 17 tests still fail. These failures reveal deeper architectural issues beyond configuration:

  1. Python process not responding to init pings despite stderr capture
  2. Pool shutdown race conditions
  3. Adapter registry misconfiguration
  4. Test infrastructure expecting different behavior
  5. Missing program_id field in create_program calls

Error Pattern Analysis

Pattern 1: Port Communication Complete Failure (Critical)

Errors: #2, #9 - PortCommunicationTest

No response received within 5 seconds
Port info: [name: ~c"/home/home/.pyenv/shims/python3", ... os_pid: 1925679]

Root Cause: The Python process starts (we see the os_pid) but never sends a response. Even with stderr capture enabled, we see no Python errors, suggesting:

  1. Python is stuck in initialization
  2. Packet mode framing is broken
  3. Python stdout is being buffered

Theory: The test sends raw JSON: Jason.encode!(%{"id" => 123, "command" => "ping"...}) but the port is configured with {:packet, 4}. This means Python expects a 4-byte length header, but the test is sending raw JSON without the header.

Code Evidence (test/port_communication_test.exs):

request = Jason.encode!(%{...})  # Just JSON, no packet header
Port.command(port, request)      # Sending raw JSON to packet mode port

Recommendation:

# Use Protocol.encode_request which adds the packet header
request = Protocol.encode_request(123, :ping, %{})
Port.command(port, request)

Pattern 2: Pool Shutdown During Checkout

Errors: #3, #4, #5, #7 - Various pool operations

{:error, {:shutdown, {NimblePool, :checkout, [:test_pool_8962]}}}

Root Cause: The pool GenServer is shutting down while clients are trying to checkout. This happens when:

  1. A test ends and calls stop_supervised while operations are in flight
  2. The pool supervisor is crashing
  3. Pool initialization fails but error is swallowed

Theory: Looking at line 318: lazy: true in session_pool_v2.ex - the pool is STILL configured as lazy despite our config changes. The config isn’t being applied correctly.

Code Evidence (lib/dspex/python_bridge/session_pool_v2.ex:318):

lazy: true,  # Hardcoded! Ignoring config

Recommendation:

# Fix in session_pool_v2.ex init/1
lazy = Keyword.get(opts, :lazy, false)  # Read from opts
pool_config = [
  worker: {PoolWorkerV2, []},
  pool_size: pool_size,
  max_overflow: overflow,
  lazy: lazy,  # Use the config value
  name: pool_name
]

Pattern 3: Checkout Timeout With Multiple Waiters

Error: #3 - Error handling test

Multiple warnings during init:
{:"$gen_call", {#PID<0.995.0>, ...}, {:checkout, {:session, "error_test_2"}, ...}}

Root Cause: 6 tasks try to checkout simultaneously, but only 1 worker is initializing. The other 5 timeout waiting. The init process receives their checkout requests as “unexpected messages”.

Theory: With pool_size=2 and 6 concurrent checkouts, we’re overwhelming the pool. The warnings show the worker init is receiving checkout requests meant for NimblePool.

Recommendation:

  1. Increase pool size for this test
  2. OR reduce concurrent operations to match pool size
  3. OR add overflow workers

Pattern 4: Function Clause Error

Error: #6

FunctionClauseError no function clause matching in PoolV2Test."test V2 Pool Architecture pool starts successfully with lazy workers"/1

Root Cause: The test expects pool_pid in the context, but the setup is providing pid. Pattern matching failure.

Code Evidence (test/pool_v2_test.exs:50):

test "pool starts successfully with lazy workers", %{
  pool_pid: pool_pid,     # Expects pool_pid
  genserver_name: genserver_name
} do

But setup returns:

%{
  pid: pid,  # Returns pid, not pool_pid
  genserver_name: genserver_name,
  ...
}

Recommendation: Fix the pattern match to use pid or update helper to return pool_pid.

Pattern 5: Program ID Required

Error: #10

{:error, "Program ID is required"}

Root Cause: The Python bridge expects a program_id field when creating programs. The test was updated to include it, but the error persists.

Theory: Looking at the error traceback, it’s coming from the Python side. The field name might be wrong or the args aren’t being passed correctly.

Recommendation: Check Python expectations and ensure the field is passed correctly in the protocol.

Pattern 6: Adapter Registry Misconfiguration

Errors: #12, #13-17

Python bridge not available
Expected PythonPort but got PythonPool

Root Cause:

  1. Tests expect PythonPort adapter but system returns PythonPool
  2. Layer 3 tests can’t find Python bridge despite it being started

Theory: The adapter registry is misconfigured. Some tests expect the single bridge adapter but the system is configured for pooling.

Recommendation:

  1. Update test expectations to match pooling configuration
  2. OR provide test-specific adapter configuration
  3. Ensure Python bridge supervisor starts before tests

Pattern 7: BridgeMock Not Started

Error: #1

GenServer.call(DSPex.Adapters.BridgeMock, :reset, 5000)
** (EXIT) no process

Root Cause: The BridgeMock adapter is not started as a GenServer but the test tries to call it.

Recommendation: Start BridgeMock in test setup or change test to not require GenServer calls.

Comprehensive Fix Strategy

Immediate Fixes (Do First)

  1. Fix Protocol in PortCommunicationTest - Use Protocol.encode_request
  2. Fix hardcoded lazy: true - Make it configurable
  3. Fix pattern match in pool test - Use correct field names
  4. Start BridgeMock in test setup - Add to supervision tree

Architectural Fixes (Do Second)

  1. Pool Size Management - Ensure pool size >= concurrent operations
  2. Adapter Registry - Fix test expectations vs reality
  3. Python Bridge Startup - Ensure it starts before layer_3 tests

Investigation Required

  1. Python Not Responding - Add Python-side logging to debug init
  2. Program ID Field - Verify exact field name Python expects
  3. Shutdown Race - Add proper cleanup coordination

Test-Specific Recommendations

PortCommunicationTest

# Change from:
request = Jason.encode!(%{...})
# To:
request = Protocol.encode_request(123, :ping, %{})

PoolV2Test

# Fix pattern match:
test "pool starts successfully", %{pid: pool_pid, genserver_name: genserver_name} do
  # OR update helper to return pool_pid

SessionPoolV2

# In init/1, fix hardcoded lazy:
lazy = Keyword.get(opts, :lazy, Application.get_env(:dspex, :pool_lazy, false))

Error Handling Test

# Increase pool size for 6 concurrent operations:
pool_info = start_test_pool(
  pool_size: 6,  # Match concurrent operations
  overflow: 0,
  pre_warm: false
)

Critical Insight

The most concerning issue is that Python processes are not responding even with stderr capture enabled. This suggests:

  1. Python is hanging during import/initialization
  2. The packet mode framing is fundamentally broken
  3. Python output is being buffered and not flushed

Next Debugging Step: Add Python-side file logging to see if the bridge script even starts:

# At the very top of dspy_bridge.py
import sys
with open('/tmp/dspy_bridge_debug.log', 'a') as f:
    f.write(f"Bridge starting: {sys.argv}\n")
    f.flush()

Conclusion

These errors are not environmental - they reveal real bugs:

  1. Hardcoded configuration ignoring test settings
  2. Protocol mismatches between test and implementation
  3. Race conditions in pool lifecycle
  4. Incorrect test assumptions about adapter configuration

None of these will be fixed by Phase 2/3 architectural improvements. They need direct code fixes.