← Back to Foundation

JULY 1 2025 PRE PHASE 2 OTP report gem 02d replaceTelemetryWithOTP

Documentation for JULY_1_2025_PRE_PHASE_2_OTP_report_gem_02d_replaceTelemetryWithOTP from the Foundation repository.

Excellent. This is the perfect mindset for mastering OTP. The journey from “this seems clever and it works” to “this is robust and idiomatic” often involves exactly this kind of feedback loop. Your telemetry-based approach is a very common and intelligent-seeming pattern, but it falls into a classic trap. Let’s shred it apart constructively and build a far more robust testing architecture.

Here is a series of documents that critiques the proposed telemetry strategy, outlines a superior OTP-aligned testing architecture, provides the code for the necessary helpers, and gives you a clear migration plan.


REVIEW_OF_TELEMETRY_STRATEGY.md

A Critical Review of the Proposed Event-Driven Testing Architecture

High-Level Verdict: The proposed architecture’s core philosophy—“Event-Driven Everything”—is powerful for production observability but fundamentally flawed when used for test synchronization and control flow. It mistakes an observability side-effect (a telemetry event) for a completion guarantee (a change in state). This leads to tests that are indirect, brittle, and can mask real race conditions.

The Core Flaw: Testing the Fire Alarm, Not the Fire

Imagine you want to test if a toaster works.

  • The Incorrect (Telemetry-Based) Way: Put bread in the toaster, push the button, and then put your ear next to the fire alarm. When you hear the fire alarm go off (the telemetry event), you assume the toast is burnt (the state has changed) and you run to check the toaster.
  • The Correct (Direct API) Way: Put bread in the toaster, push the button, and then look at the toaster. When the toast pops up (the operation completes), you check if it’s toasted correctly.

Your proposed testing architecture is listening for the fire alarm. It’s an indirect signal that is not contractually bound to the operation it’s observing.

Why This Is a Critical Anti-Pattern:

  1. Race Conditions are Guaranteed: Telemetry events are often emitted asynchronously. There is no guarantee that your test’s telemetry handler will be attached before the application code emits the event. If the event fires first, your test will hang forever. This is a classic race condition that makes tests flaky.

  2. It Tests the Implementation, Not the Behavior: Your test becomes tightly coupled to the fact that a specific telemetry event is fired. If a developer refactors the code and changes the event name—even if the application’s behavior is still correct—the test will break. You should be testing the public API and its observable state changes, not its internal telemetry emissions.

  3. It Hides the Real Problem: Untestable APIs: The only reason you need to listen for telemetry events to synchronize tests is because the underlying application code does not provide a synchronous API for testing. Instead of fixing the root cause (the untestable code), this approach builds an elaborate and fragile workaround. The application’s GenServers are using cast for fire-and-forget operations, leaving you with no way to know when they are done.

  4. Control Flow via Side-Effects: Using telemetry for control flow is like using logging to pass data. It’s a side-channel that is not designed for this purpose. It makes the test’s logic non-linear and hard to reason about.

Conclusion: The TELEMETRY.md document outlines an excellent observability strategy but a flawed testing strategy. The test helpers it proposes would codify an anti-pattern. We must reject the premise of using telemetry for synchronization and instead focus on making the application code itself directly testable.


TESTING_ARCHITECTURE.md

A Robust Testing Architecture for OTP Systems

Core Principle: A test should interact with the System Under Test (SUT) through its public API. If an API is asynchronous (cast), a corresponding synchronous version (call) must be provided for testing.

This architecture is built on three pillars:

  1. Testable Application APIs: The application code must be designed to be testable.
  2. Synchronous Test Helpers: Create a Test.Helpers module that provides synchronous wrappers for common asynchronous operations.
  3. Telemetry for Verification, Not Synchronization: Use telemetry only to assert that an event was emitted, not to wait for it.

Pillar 1: Testable Application APIs

The root cause of Process.sleep or telemetry-based waiting is a “fire-and-forget” API. To fix this, we must provide a way to know when an operation is complete.

The Pattern: Add Synchronous handle_call Handlers for Tests

For any handle_cast that triggers an asynchronous workflow, add a corresponding handle_call that performs the same work but replies when finished.

Example: CircuitBreaker State Change

  • Application Code (circuit_breaker.ex): The code might use a cast internally to update state to avoid blocking.
    # The asynchronous path used in production
    def handle_cast({:record_failure, service_id}, state) do
      # ... logic to potentially open the circuit ...
      new_state = ...
      # Emits telemetry as a side-effect
      :telemetry.execute([:state_change], ...)
      {:noreply, new_state}
    end
    
    # The NEW synchronous path for testing
    @doc "For testing purposes only."
    def handle_call({:sync_record_failure, service_id}, _from, state) do
      # ... same logic as handle_cast ...
      new_state = ...
      {:reply, :ok, new_state}
    end
    

Pillar 2: Synchronous Test Helpers

Create a dedicated module, Foundation.Test.Helpers, to encapsulate the logic of calling the synchronous test APIs and polling for state changes.

The Pattern: Wrap Async Operations in Synchronous Helpers

Old, Flaky Way (in test file):

# Anti-pattern: test listens for a side-effect
assert_telemetry_event [:state_change], %{to: :open} do
  CircuitBreaker.fail(breaker)
end
assert CircuitBreaker.status(breaker) == :open

New, Robust Way (in test file):

# Test calls a synchronous helper that hides the complexity
TestHelpers.trip_circuit_breaker(breaker)

# The state is guaranteed to be correct now.
assert CircuitBreaker.status(breaker) == :open

The helper handles the direct interaction and synchronization:

# In foundation/test/helpers.ex
def trip_circuit_breaker(breaker_pid) do
  # Directly call the synchronous test API
  :ok = GenServer.call(breaker_pid, {:sync_record_failure})
end

For cases where a synchronous API is not feasible, the helper can implement a robust, controlled poll:

def wait_for_state(pid, checker_fun, timeout \\ 1000) do
  # ... robust polling logic that calls a public API ...
end

Pillar 3: The Correct Role of Telemetry in Tests

Telemetry should be used to verify that the system is emitting the correct observational events, not to synchronize the test’s execution flow.

The Pattern: assert_emitted

The assert_emitted macro captures telemetry events during a synchronous block of code and lets you make assertions on them after the code has completed.

Correct Usage:

# Test that a successful call emits the right event
assert_emitted [:call_success] do
  # This block is synchronous and completes fully
  {:ok, :success} = CircuitBreaker.call(breaker, fn -> :success end)
end

# Test that tripping the breaker emits the right event
assert_emitted [:state_change], fn metadata -> metadata.to == :open end do
  # This helper is synchronous
  TestHelpers.trip_circuit_breaker(breaker)
end

This pattern is robust because the code block is guaranteed to have finished before the telemetry assertions are made. It correctly tests telemetry as an outcome of an action, not as a signal for an action.


TEST_HELPERS.md

A Guide to Foundation.Test.Helpers

This module provides robust, synchronous helpers for testing asynchronous OTP systems, eliminating the need for Process.sleep and fragile telemetry-based synchronization.

Core Helper: sync_operation/3

This is the foundation of our testing strategy. It sends an async message (cast) to a GenServer and then polls a public API on that server until a desired state is reached.

defmodule Foundation.Test.Helpers do
  @moduledoc "Synchronous helpers for testing asynchronous OTP systems."

  @doc """
  Casts a message to a GenServer and waits for a condition to be met.

  This is the primary tool for turning an asynchronous operation into a
  synchronous one for testing purposes.
  """
  @spec sync_operation(pid, term, (-> boolean), non_neg_integer) :: :ok | {:error, :timeout}
  def sync_operation(pid, message, checker_fun, timeout \\ 1000) do
    :ok = GenServer.cast(pid, message)

    start_time = System.monotonic_time(:millisecond)
    wait_interval = 10 # ms

    Stream.repeatedly(fn ->
      if checker_fun.() do
        :ok
      else
        if System.monotonic_time(:millisecond) - start_time > timeout do
          :timeout
        else
          :timer.sleep(wait_interval)
          :pending
        end
      end
    end)
    |> Enum.find(&(&1 != :pending))
    |> case do
      :ok -> :ok
      :timeout -> {:error, :timeout}
    end
  end
end

Specific Use-Case Helpers

We build specific, intention-revealing helpers on top of sync_operation or direct GenServer.calls to the test-only APIs.

defmodule Foundation.Test.Helpers do
  # ... (sync_operation from above) ...

  @doc "Synchronously trips a circuit breaker and waits for it to open."
  def trip_circuit_breaker(pid, fail_count \\ 5) do
    # Assuming the CB has a synchronous test API `:sync_fail`
    for _ <- 1..fail_count do
      :ok = GenServer.call(pid, :sync_fail)
    end
  end

  @doc "Waits for a supervised process to be restarted."
  def wait_for_restart(supervisor, old_pid, timeout \\ 1000) do
    sync_operation(
      supervisor,
      {:kill_child, old_pid}, # A message the test supervisor can handle
      fn ->
        new_pid = Process.whereis(Supervisor.child_spec(old_pid, []).id)
        is_pid(new_pid) and new_pid != old_pid
      end,
      timeout
    )
  end
end

Telemetry Assertion Helpers

These helpers correctly use telemetry for verification, not synchronization.

defmodule Foundation.Test.TelemetryHelpers do
  import ExUnit.Assertions

  @doc """
  Asserts that a specific telemetry event was emitted during the block's execution.
  """
  defmacro assert_emitted(event_pattern, filter_fun \\ fn _ -> true end, do: block) do
    quote do
      capture_ref = make_ref()
      handler_id = {:telemetry_capture, capture_ref}

      # Attach a handler to capture events
      :telemetry.attach(handler_id, unquote(event_pattern), &__MODULE__.capture_handler/4, %{ref: capture_ref, test_pid: self()})

      # Execute the code block
      result = unquote(block)

      # Detach the handler immediately
      :telemetry.detach(handler_id)

      # Check received messages for the event
      assert_receive {:telemetry_captured, ^capture_ref, captured_meta}, 500,
        "Expected telemetry event matching #{inspect(unquote(event_pattern))} was not emitted."

      assert unquote(filter_fun).(captured_meta),
        "The captured event metadata did not match the filter function."

      result
    end
  end

  # The handler function that sends captured events to the test process
  def capture_handler(_event, _measurements, metadata, config) do
    send(config.test_pid, {:telemetry_captured, config.ref, metadata})
  end
end

MIGRATION_STRATEGY.md

Migrating from Sleep/Telemetry-Based Tests to a Robust Architecture

This is a step-by-step guide to refactor the codebase and test suite towards a more robust, OTP-aligned architecture.

Phase 1: Build the New Foundation (No Test Changes Yet)

  1. Create Foundation.Test.Helpers: Implement the sync_operation/4 and the Foundation.Test.TelemetryHelpers with assert_emitted/3 as described in TEST_HELPERS.md.
  2. Add a Test Supervisor: Create a Foundation.Test.Supervisor that can be started in test_helper.exs. This supervisor will be the parent for any components needed during testing, ensuring proper supervision even in isolated tests.
  3. CI Integration: Add a credo check to forbid Process.sleep/1 in the /test directory to prevent new instances of the anti-pattern from being introduced.

Phase 2: Make Application Code Testable (No Test Changes Yet)

  1. Identify Asynchronous APIs: Go through key GenServers (CircuitBreaker, RateLimiter, agent modules). Identify handle_cast or send(self(), ...) patterns that trigger complex state changes.
  2. Add Synchronous Test APIs: For each identified async operation, add a corresponding handle_call that performs the exact same logic but replies upon completion. Mark these with @doc "For testing purposes only.".
    • Example: If handle_cast(:record_failure, ...) exists, add handle_call(:sync_record_failure, ...) that does the same work and then {:reply, :ok, new_state}.

Phase 3: Incrementally Refactor Tests (File by File)

This is the core migration work. Tackle one test file at a time to keep the scope manageable.

  1. Pick a Test File: Start with a simple one, like circuit_breaker_test.exs.
  2. Remove Process.sleep/1: Find every Process.sleep call. Analyze what it’s waiting for.
  3. Introduce Test Helpers:
    • If waiting for a state change, create a specific helper in Foundation.Test.Helpers (e.g., trip_circuit_breaker/1) that uses the new synchronous :sync_fail API.
    • Replace the sleep and subsequent assertions with a single call to the new helper.
  4. Refactor Telemetry Assertions:
    • Find every assert_telemetry_event that is being used for synchronization.
    • Replace it with a call to a synchronous test helper.
    • Wrap the helper call in the new assert_emitted macro to verify that the telemetry event was still fired as a side-effect.

Example Transformation:

Before:

test "circuit opens after 3 failures" do
  {:ok, breaker} = CircuitBreaker.start_link(...)
  assert_telemetry_event [:state_change], %{to: :open} do
    CircuitBreaker.fail(breaker)
    CircuitBreaker.fail(breaker)
    CircuitBreaker.fail(breaker)
  end
  assert CircuitBreaker.status(breaker) == :open
end

After:

test "circuit opens after 3 failures" do
  {:ok, breaker} = CircuitBreaker.start_link(...)

  # Assert the event is emitted *while* synchronously tripping the breaker
  assert_emitted [:state_change], &(&1.to == :open) do
    TestHelpers.trip_circuit_breaker(breaker, 3)
  end

  # The state is now guaranteed to be correct
  assert CircuitBreaker.status(breaker) == :open
end

Phase 4: Final Cleanup

  1. Once all tests are migrated, search the codebase for any remaining uses of the old, fragile patterns and remove them.
  2. Review the Foundation.TelemetryTestHelpers from the original document and delete them, as they have been replaced by the more robust patterns.
  3. Celebrate having a fast, reliable, and deterministic test suite.