JULY 1 2025 PLAN

Documentation for JULY_1_2025_PLAN from the Foundation repository.

July 1, 2025 - Comprehensive Foundation/Jido Integration Plan

Executive Summary

We’re at a critical juncture with multiple intersecting concerns:

OTP Violations - Critical flaws that compromise system stability
State Persistence - Solved conceptually but not implemented
Jido Integration - Partially complete with architectural issues
Foundation Infrastructure - Missing production-critical services

This plan establishes the correct order of operations to build a production-grade system.

Current State Analysis

✅ What’s Working

Basic Jido-Foundation bridge functional
State persistence mechanisms identified (mount/shutdown/on_after_validate_state)
499 tests passing (with some intermittent failures)
Core protocols defined
Basic supervision structure in place

🔴 Critical Issues (from FLAWS analysis)

Unsupervised Task Operations - Memory leaks, orphaned processes
Blocking GenServer Operations - System-wide bottlenecks
Volatile Agent State - Guaranteed data loss on crashes
Memory Leaks - Process monitors not cleaned up
Misleading APIs - AtomicTransaction isn’t atomic
Resource Leaks - Unbounded message buffers
Test vs Production Divergence - Different supervisor strategies

🟡 Architectural Debt

Monolithic “God” agents (CoordinatorAgent)
Raw send/2 for critical communication
Race conditions in caching
Inefficient ETS operations
Missing production infrastructure services

Order of Operations

Phase 1: Critical OTP Fixes (Week 1) - MUST DO FIRST

Why First: These issues cause system instability and data loss. Everything else builds on unstable foundation if not fixed.

Day 1-2: Memory & Process Leaks

Fix Unsupervised Task.async_stream
- Remove all fallback logic in batch_operations.ex and distributed_optimization.ex
- Require TaskSupervisor always
- Test: Verify no orphaned processes under load
Fix Monitor Leak in AgentRegistry
- Add Process.demonitor/2 for all :DOWN messages
- Test: Monitor count stays stable over time
Fix Supervisor Strategy Divergence
- Use production settings (3 restarts/5s) everywhere
- Remove test-specific configurations
- Test: Fault tolerance behavior consistent

Day 3-4: Blocking Operations & Resource Management

Fix Blocking CircuitBreaker
- Move user function execution to supervised tasks
- Keep GenServer responsive
- Test: Circuit breaker doesn’t block under load
Fix Resource Leaks
- Implement message buffer draining in CoordinationManager
- Add bounded queues where needed
- Test: Memory usage stable under sustained load
Rename AtomicTransaction
- Rename to SerialOperations
- Update all documentation
- Test: No breaking changes

Phase 2: State Persistence Foundation (Week 2)

Why Second: Need stable processes before adding persistence. Quick wins that enable major improvements.

Day 5-6: PersistentFoundationAgent Pattern

Create Base PersistentFoundationAgent

defmodule JidoSystem.Agents.PersistentFoundationAgent do
  use Jido.Agent

  @callback persistence_key(agent :: t()) :: String.t()
  @callback serialize_state(state :: map()) :: {:ok, binary()} | {:error, term()}
  @callback deserialize_state(binary()) :: {:ok, map()} | {:error, term()}

  def mount(server_state, opts) do
    # Load from PersistenceStore
  end

  def shutdown(server_state, reason) do
    # Save to PersistenceStore
  end

  def on_after_validate_state(agent) do
    # Incremental save
  end
end

Implement PersistenceStore Supervisor
- ETS-backed for now (can swap to Mnesia/PostgreSQL later)
- Supervised to survive agent crashes
- Test: State survives agent crashes

Day 7: Migrate Critical Agents

Migrate TaskAgent to PersistentFoundationAgent
- Preserve task queue across restarts
- Test: No task loss on crash
Migrate CoordinatorAgent State
- Only persist critical workflow state
- Test: Workflows resume after crash

Phase 3: Complete God Agent Decomposition (Week 3)

Why Third: Need persistence before decomposing to avoid data loss during migration.

Day 8-9: WorkflowSupervisor Pattern

Complete WorkflowSupervisor Implementation
- One process per workflow
- State isolated per workflow
- Test: Individual workflow crashes don’t affect others
Migrate CoordinatorAgent to Delegation Pattern
- CoordinatorAgent becomes thin orchestration layer
- WorkflowSupervisor handles actual work
- Test: Same API, better fault isolation

Day 10: Communication Pattern Fixes

Replace Raw send/2 with GenServer Calls
- Guaranteed delivery for critical messages
- Backpressure support
- Test: No message loss under load
Fix Cache Race Conditions
- Single atomic operations
- Proper telemetry
- Test: Concurrent access safe

Phase 4: Production Infrastructure (Week 4)

Why Fourth: With stable foundation, can add production services.

Day 11-12: Core Services

Enhanced CircuitBreaker Service
- Half-open states
- Gradual recovery
- Per-service configuration
ConnectionManager Service
- HTTP connection pooling (Finch)
- Health checks
- Automatic recovery
RateLimiter Service
- API protection (Hammer)
- Per-agent limits
- Graceful degradation

Day 13-14: Monitoring & Discovery

ServiceDiscovery Service
- Dynamic registration
- Capability matching
- Health monitoring
Complete Telemetry Integration
- All services emit telemetry
- Performance monitoring
- Alert thresholds

Phase 5: Final Integration (Week 5)

Why Last: Integrate all improvements into cohesive system.

Update JidoFoundation.Bridge
- Use all new services
- Proper error propagation
- Complete telemetry
Performance Optimization
- ETS query optimization
- Resource pooling
- Load testing
Documentation & Examples
- Update all docs
- Working examples
- Deployment guide

Success Criteria

Phase 1 Complete When:

Zero unsupervised processes
No memory leaks under 24hr load test
All GenServers responsive under load
Consistent supervisor strategies

Phase 2 Complete When:

Critical agents survive crashes without data loss
State persistence adds <5ms latency
Recovery tested under various failure modes

Phase 3 Complete When:

No single point of failure
Individual workflow isolation
Message delivery guarantees
No race conditions

Phase 4 Complete When:

All production services operational
Circuit breakers protecting all external calls
Rate limiting active
Service discovery working

Phase 5 Complete When:

500+ tests passing consistently
24hr load test stable
Complete documentation
Zero architectural debt

Risk Mitigation

Incremental Approach - Each phase builds on previous
Test Everything - No untested code paths
Feature Flags - Can disable new features if issues
Rollback Plan - Each phase can be reverted independently
Performance Benchmarks - Measure impact of each change

Implementation Notes

Quick Wins First

Start each phase with quick wins (<1hr tasks) to build momentum and reduce risk surface area.

Parallel Work

Within each phase, independent tasks can be parallelized, but phases must be sequential.

Testing Strategy

Unit tests for each component
Integration tests for service interactions
Property-based tests for concurrent operations
Load tests for performance validation

Documentation Requirements

Update docs as we go, not at end
Include architecture decisions
Document gotchas and workarounds
Maintain compatibility notes

Conclusion

This plan addresses all critical issues in the correct order:

Stabilize - Fix OTP violations
Persist - Add state management
Decompose - Break up monoliths
Productionize - Add infrastructure
Integrate - Tie it all together

The system will evolve from “barely working” to “production-grade” through systematic improvements that build upon each other. Each phase delivers tangible value while setting up the next phase for success.

Total Timeline: 5 weeks of focused development Current Date: July 1, 2025 Target Completion: Early August 2025

Let’s begin with Phase 1, Day 1: Fixing unsupervised Task operations.