V2 POOL REMAINING ERRORS ANALYSIS

Documentation for V2_POOL_REMAINING_ERRORS_ANALYSIS from the Dspex repository.

V2 Pool Remaining Errors Analysis

Executive Summary

After Phase 1 fixes, 17 tests still fail. These failures reveal deeper architectural issues beyond configuration:

Python process not responding to init pings despite stderr capture
Pool shutdown race conditions
Adapter registry misconfiguration
Test infrastructure expecting different behavior
Missing program_id field in create_program calls

Error Pattern Analysis

Pattern 1: Port Communication Complete Failure (Critical)

Errors: #2, #9 - PortCommunicationTest

No response received within 5 seconds
Port info: [name: ~c"/home/home/.pyenv/shims/python3", ... os_pid: 1925679]

Root Cause: The Python process starts (we see the os_pid) but never sends a response. Even with stderr capture enabled, we see no Python errors, suggesting:

Python is stuck in initialization
Packet mode framing is broken
Python stdout is being buffered

Theory: The test sends raw JSON: Jason.encode!(%{"id" => 123, "command" => "ping"...}) but the port is configured with {:packet, 4}. This means Python expects a 4-byte length header, but the test is sending raw JSON without the header.

Code Evidence (test/port_communication_test.exs):

request = Jason.encode!(%{...})  # Just JSON, no packet header
Port.command(port, request)      # Sending raw JSON to packet mode port

Recommendation:

# Use Protocol.encode_request which adds the packet header
request = Protocol.encode_request(123, :ping, %{})
Port.command(port, request)

Pattern 2: Pool Shutdown During Checkout

Errors: #3, #4, #5, #7 - Various pool operations

{:error, {:shutdown, {NimblePool, :checkout, [:test_pool_8962]}}}

Root Cause: The pool GenServer is shutting down while clients are trying to checkout. This happens when:

A test ends and calls stop_supervised while operations are in flight
The pool supervisor is crashing
Pool initialization fails but error is swallowed

Theory: Looking at line 318: lazy: true in session_pool_v2.ex - the pool is STILL configured as lazy despite our config changes. The config isn’t being applied correctly.

Code Evidence (lib/dspex/python_bridge/session_pool_v2.ex:318):

lazy: true,  # Hardcoded! Ignoring config

Recommendation:

# Fix in session_pool_v2.ex init/1
lazy = Keyword.get(opts, :lazy, false)  # Read from opts
pool_config = [
  worker: {PoolWorkerV2, []},
  pool_size: pool_size,
  max_overflow: overflow,
  lazy: lazy,  # Use the config value
  name: pool_name
]

Pattern 3: Checkout Timeout With Multiple Waiters

Error: #3 - Error handling test

Multiple warnings during init:
{:"$gen_call", {#PID<0.995.0>, ...}, {:checkout, {:session, "error_test_2"}, ...}}

Root Cause: 6 tasks try to checkout simultaneously, but only 1 worker is initializing. The other 5 timeout waiting. The init process receives their checkout requests as “unexpected messages”.

Theory: With pool_size=2 and 6 concurrent checkouts, we’re overwhelming the pool. The warnings show the worker init is receiving checkout requests meant for NimblePool.

Recommendation:

Increase pool size for this test
OR reduce concurrent operations to match pool size
OR add overflow workers

Pattern 4: Function Clause Error

Error: #6

FunctionClauseError no function clause matching in PoolV2Test."test V2 Pool Architecture pool starts successfully with lazy workers"/1

Root Cause: The test expects pool_pid in the context, but the setup is providing pid. Pattern matching failure.

Code Evidence (test/pool_v2_test.exs:50):

test "pool starts successfully with lazy workers", %{
  pool_pid: pool_pid,     # Expects pool_pid
  genserver_name: genserver_name
} do

But setup returns:

%{
  pid: pid,  # Returns pid, not pool_pid
  genserver_name: genserver_name,
  ...
}

Recommendation: Fix the pattern match to use pid or update helper to return pool_pid.

Pattern 5: Program ID Required

Error: #10

{:error, "Program ID is required"}

Root Cause: The Python bridge expects a program_id field when creating programs. The test was updated to include it, but the error persists.

Theory: Looking at the error traceback, it’s coming from the Python side. The field name might be wrong or the args aren’t being passed correctly.

Recommendation: Check Python expectations and ensure the field is passed correctly in the protocol.

Pattern 6: Adapter Registry Misconfiguration

Errors: #12, #13-17

Python bridge not available
Expected PythonPort but got PythonPool

Root Cause:

Tests expect PythonPort adapter but system returns PythonPool
Layer 3 tests can’t find Python bridge despite it being started

Theory: The adapter registry is misconfigured. Some tests expect the single bridge adapter but the system is configured for pooling.

Recommendation:

Update test expectations to match pooling configuration
OR provide test-specific adapter configuration
Ensure Python bridge supervisor starts before tests

Pattern 7: BridgeMock Not Started

Error: #1

GenServer.call(DSPex.Adapters.BridgeMock, :reset, 5000)
** (EXIT) no process

Root Cause: The BridgeMock adapter is not started as a GenServer but the test tries to call it.

Recommendation: Start BridgeMock in test setup or change test to not require GenServer calls.

Comprehensive Fix Strategy

Immediate Fixes (Do First)

Fix Protocol in PortCommunicationTest - Use Protocol.encode_request
Fix hardcoded lazy: true - Make it configurable
Fix pattern match in pool test - Use correct field names
Start BridgeMock in test setup - Add to supervision tree

Architectural Fixes (Do Second)

Pool Size Management - Ensure pool size >= concurrent operations
Adapter Registry - Fix test expectations vs reality
Python Bridge Startup - Ensure it starts before layer_3 tests

Investigation Required

Python Not Responding - Add Python-side logging to debug init
Program ID Field - Verify exact field name Python expects
Shutdown Race - Add proper cleanup coordination

Test-Specific Recommendations

PortCommunicationTest

# Change from:
request = Jason.encode!(%{...})
# To:
request = Protocol.encode_request(123, :ping, %{})

PoolV2Test

# Fix pattern match:
test "pool starts successfully", %{pid: pool_pid, genserver_name: genserver_name} do
  # OR update helper to return pool_pid

SessionPoolV2

# In init/1, fix hardcoded lazy:
lazy = Keyword.get(opts, :lazy, Application.get_env(:dspex, :pool_lazy, false))

Error Handling Test

# Increase pool size for 6 concurrent operations:
pool_info = start_test_pool(
  pool_size: 6,  # Match concurrent operations
  overflow: 0,
  pre_warm: false
)

Critical Insight

The most concerning issue is that Python processes are not responding even with stderr capture enabled. This suggests:

Python is hanging during import/initialization
The packet mode framing is fundamentally broken
Python output is being buffered and not flushed

Next Debugging Step: Add Python-side file logging to see if the bridge script even starts:

# At the very top of dspy_bridge.py
import sys
with open('/tmp/dspy_bridge_debug.log', 'a') as f:
    f.write(f"Bridge starting: {sys.argv}\n")
    f.flush()

Conclusion

These errors are not environmental - they reveal real bugs:

Hardcoded configuration ignoring test settings
Protocol mismatches between test and implementation
Race conditions in pool lifecycle
Incorrect test assumptions about adapter configuration

None of these will be fixed by Phase 2/3 architectural improvements. They need direct code fixes.