← Back to Foundation

JULY 1 2025 PLAN

Documentation for JULY_1_2025_PLAN from the Foundation repository.

July 1, 2025 - Comprehensive Foundation/Jido Integration Plan

Executive Summary

We’re at a critical juncture with multiple intersecting concerns:

  1. OTP Violations - Critical flaws that compromise system stability
  2. State Persistence - Solved conceptually but not implemented
  3. Jido Integration - Partially complete with architectural issues
  4. Foundation Infrastructure - Missing production-critical services

This plan establishes the correct order of operations to build a production-grade system.

Current State Analysis

✅ What’s Working

  • Basic Jido-Foundation bridge functional
  • State persistence mechanisms identified (mount/shutdown/on_after_validate_state)
  • 499 tests passing (with some intermittent failures)
  • Core protocols defined
  • Basic supervision structure in place

🔴 Critical Issues (from FLAWS analysis)

  1. Unsupervised Task Operations - Memory leaks, orphaned processes
  2. Blocking GenServer Operations - System-wide bottlenecks
  3. Volatile Agent State - Guaranteed data loss on crashes
  4. Memory Leaks - Process monitors not cleaned up
  5. Misleading APIs - AtomicTransaction isn’t atomic
  6. Resource Leaks - Unbounded message buffers
  7. Test vs Production Divergence - Different supervisor strategies

🟡 Architectural Debt

  • Monolithic “God” agents (CoordinatorAgent)
  • Raw send/2 for critical communication
  • Race conditions in caching
  • Inefficient ETS operations
  • Missing production infrastructure services

Order of Operations

Phase 1: Critical OTP Fixes (Week 1) - MUST DO FIRST

Why First: These issues cause system instability and data loss. Everything else builds on unstable foundation if not fixed.

Day 1-2: Memory & Process Leaks

  1. Fix Unsupervised Task.async_stream

    • Remove all fallback logic in batch_operations.ex and distributed_optimization.ex
    • Require TaskSupervisor always
    • Test: Verify no orphaned processes under load
  2. Fix Monitor Leak in AgentRegistry

    • Add Process.demonitor/2 for all :DOWN messages
    • Test: Monitor count stays stable over time
  3. Fix Supervisor Strategy Divergence

    • Use production settings (3 restarts/5s) everywhere
    • Remove test-specific configurations
    • Test: Fault tolerance behavior consistent

Day 3-4: Blocking Operations & Resource Management

  1. Fix Blocking CircuitBreaker

    • Move user function execution to supervised tasks
    • Keep GenServer responsive
    • Test: Circuit breaker doesn’t block under load
  2. Fix Resource Leaks

    • Implement message buffer draining in CoordinationManager
    • Add bounded queues where needed
    • Test: Memory usage stable under sustained load
  3. Rename AtomicTransaction

    • Rename to SerialOperations
    • Update all documentation
    • Test: No breaking changes

Phase 2: State Persistence Foundation (Week 2)

Why Second: Need stable processes before adding persistence. Quick wins that enable major improvements.

Day 5-6: PersistentFoundationAgent Pattern

  1. Create Base PersistentFoundationAgent

    defmodule JidoSystem.Agents.PersistentFoundationAgent do
      use Jido.Agent
    
      @callback persistence_key(agent :: t()) :: String.t()
      @callback serialize_state(state :: map()) :: {:ok, binary()} | {:error, term()}
      @callback deserialize_state(binary()) :: {:ok, map()} | {:error, term()}
    
      def mount(server_state, opts) do
        # Load from PersistenceStore
      end
    
      def shutdown(server_state, reason) do
        # Save to PersistenceStore
      end
    
      def on_after_validate_state(agent) do
        # Incremental save
      end
    end
    
  2. Implement PersistenceStore Supervisor

    • ETS-backed for now (can swap to Mnesia/PostgreSQL later)
    • Supervised to survive agent crashes
    • Test: State survives agent crashes

Day 7: Migrate Critical Agents

  1. Migrate TaskAgent to PersistentFoundationAgent

    • Preserve task queue across restarts
    • Test: No task loss on crash
  2. Migrate CoordinatorAgent State

    • Only persist critical workflow state
    • Test: Workflows resume after crash

Phase 3: Complete God Agent Decomposition (Week 3)

Why Third: Need persistence before decomposing to avoid data loss during migration.

Day 8-9: WorkflowSupervisor Pattern

  1. Complete WorkflowSupervisor Implementation

    • One process per workflow
    • State isolated per workflow
    • Test: Individual workflow crashes don’t affect others
  2. Migrate CoordinatorAgent to Delegation Pattern

    • CoordinatorAgent becomes thin orchestration layer
    • WorkflowSupervisor handles actual work
    • Test: Same API, better fault isolation

Day 10: Communication Pattern Fixes

  1. Replace Raw send/2 with GenServer Calls

    • Guaranteed delivery for critical messages
    • Backpressure support
    • Test: No message loss under load
  2. Fix Cache Race Conditions

    • Single atomic operations
    • Proper telemetry
    • Test: Concurrent access safe

Phase 4: Production Infrastructure (Week 4)

Why Fourth: With stable foundation, can add production services.

Day 11-12: Core Services

  1. Enhanced CircuitBreaker Service

    • Half-open states
    • Gradual recovery
    • Per-service configuration
  2. ConnectionManager Service

    • HTTP connection pooling (Finch)
    • Health checks
    • Automatic recovery
  3. RateLimiter Service

    • API protection (Hammer)
    • Per-agent limits
    • Graceful degradation

Day 13-14: Monitoring & Discovery

  1. ServiceDiscovery Service

    • Dynamic registration
    • Capability matching
    • Health monitoring
  2. Complete Telemetry Integration

    • All services emit telemetry
    • Performance monitoring
    • Alert thresholds

Phase 5: Final Integration (Week 5)

Why Last: Integrate all improvements into cohesive system.

  1. Update JidoFoundation.Bridge

    • Use all new services
    • Proper error propagation
    • Complete telemetry
  2. Performance Optimization

    • ETS query optimization
    • Resource pooling
    • Load testing
  3. Documentation & Examples

    • Update all docs
    • Working examples
    • Deployment guide

Success Criteria

Phase 1 Complete When:

  • Zero unsupervised processes
  • No memory leaks under 24hr load test
  • All GenServers responsive under load
  • Consistent supervisor strategies

Phase 2 Complete When:

  • Critical agents survive crashes without data loss
  • State persistence adds <5ms latency
  • Recovery tested under various failure modes

Phase 3 Complete When:

  • No single point of failure
  • Individual workflow isolation
  • Message delivery guarantees
  • No race conditions

Phase 4 Complete When:

  • All production services operational
  • Circuit breakers protecting all external calls
  • Rate limiting active
  • Service discovery working

Phase 5 Complete When:

  • 500+ tests passing consistently
  • 24hr load test stable
  • Complete documentation
  • Zero architectural debt

Risk Mitigation

  1. Incremental Approach - Each phase builds on previous
  2. Test Everything - No untested code paths
  3. Feature Flags - Can disable new features if issues
  4. Rollback Plan - Each phase can be reverted independently
  5. Performance Benchmarks - Measure impact of each change

Implementation Notes

Quick Wins First

Start each phase with quick wins (<1hr tasks) to build momentum and reduce risk surface area.

Parallel Work

Within each phase, independent tasks can be parallelized, but phases must be sequential.

Testing Strategy

  • Unit tests for each component
  • Integration tests for service interactions
  • Property-based tests for concurrent operations
  • Load tests for performance validation

Documentation Requirements

  • Update docs as we go, not at end
  • Include architecture decisions
  • Document gotchas and workarounds
  • Maintain compatibility notes

Conclusion

This plan addresses all critical issues in the correct order:

  1. Stabilize - Fix OTP violations
  2. Persist - Add state management
  3. Decompose - Break up monoliths
  4. Productionize - Add infrastructure
  5. Integrate - Tie it all together

The system will evolve from “barely working” to “production-grade” through systematic improvements that build upon each other. Each phase delivers tangible value while setting up the next phase for success.

Total Timeline: 5 weeks of focused development Current Date: July 1, 2025 Target Completion: Early August 2025

Let’s begin with Phase 1, Day 1: Fixing unsupervised Task operations.