Excellent. This is the perfect mindset for mastering OTP. The journey from “this seems clever and it works” to “this is robust and idiomatic” often involves exactly this kind of feedback loop. Your telemetry-based approach is a very common and intelligent-seeming pattern, but it falls into a classic trap. Let’s shred it apart constructively and build a far more robust testing architecture.
Here is a series of documents that critiques the proposed telemetry strategy, outlines a superior OTP-aligned testing architecture, provides the code for the necessary helpers, and gives you a clear migration plan.
REVIEW_OF_TELEMETRY_STRATEGY.md
A Critical Review of the Proposed Event-Driven Testing Architecture
High-Level Verdict: The proposed architecture’s core philosophy—“Event-Driven Everything”—is powerful for production observability but fundamentally flawed when used for test synchronization and control flow. It mistakes an observability side-effect (a telemetry event) for a completion guarantee (a change in state). This leads to tests that are indirect, brittle, and can mask real race conditions.
The Core Flaw: Testing the Fire Alarm, Not the Fire
Imagine you want to test if a toaster works.
- The Incorrect (Telemetry-Based) Way: Put bread in the toaster, push the button, and then put your ear next to the fire alarm. When you hear the fire alarm go off (the telemetry event), you assume the toast is burnt (the state has changed) and you run to check the toaster.
- The Correct (Direct API) Way: Put bread in the toaster, push the button, and then look at the toaster. When the toast pops up (the operation completes), you check if it’s toasted correctly.
Your proposed testing architecture is listening for the fire alarm. It’s an indirect signal that is not contractually bound to the operation it’s observing.
Why This Is a Critical Anti-Pattern:
Race Conditions are Guaranteed: Telemetry events are often emitted asynchronously. There is no guarantee that your test’s telemetry handler will be attached before the application code emits the event. If the event fires first, your test will hang forever. This is a classic race condition that makes tests flaky.
It Tests the Implementation, Not the Behavior: Your test becomes tightly coupled to the fact that a specific telemetry event is fired. If a developer refactors the code and changes the event name—even if the application’s behavior is still correct—the test will break. You should be testing the public API and its observable state changes, not its internal telemetry emissions.
It Hides the Real Problem: Untestable APIs: The only reason you need to listen for telemetry events to synchronize tests is because the underlying application code does not provide a synchronous API for testing. Instead of fixing the root cause (the untestable code), this approach builds an elaborate and fragile workaround. The application’s
GenServer
s are usingcast
for fire-and-forget operations, leaving you with no way to know when they are done.Control Flow via Side-Effects: Using telemetry for control flow is like using logging to pass data. It’s a side-channel that is not designed for this purpose. It makes the test’s logic non-linear and hard to reason about.
Conclusion:
The TELEMETRY.md
document outlines an excellent observability strategy but a flawed testing strategy. The test helpers it proposes would codify an anti-pattern. We must reject the premise of using telemetry for synchronization and instead focus on making the application code itself directly testable.
TESTING_ARCHITECTURE.md
A Robust Testing Architecture for OTP Systems
Core Principle: A test should interact with the System Under Test (SUT) through its public API. If an API is asynchronous (cast
), a corresponding synchronous version (call
) must be provided for testing.
This architecture is built on three pillars:
- Testable Application APIs: The application code must be designed to be testable.
- Synchronous Test Helpers: Create a
Test.Helpers
module that provides synchronous wrappers for common asynchronous operations. - Telemetry for Verification, Not Synchronization: Use telemetry only to assert that an event was emitted, not to wait for it.
Pillar 1: Testable Application APIs
The root cause of Process.sleep
or telemetry-based waiting is a “fire-and-forget” API. To fix this, we must provide a way to know when an operation is complete.
The Pattern: Add Synchronous handle_call
Handlers for Tests
For any handle_cast
that triggers an asynchronous workflow, add a corresponding handle_call
that performs the same work but replies when finished.
Example: CircuitBreaker
State Change
- Application Code (
circuit_breaker.ex
): The code might use acast
internally to update state to avoid blocking.# The asynchronous path used in production def handle_cast({:record_failure, service_id}, state) do # ... logic to potentially open the circuit ... new_state = ... # Emits telemetry as a side-effect :telemetry.execute([:state_change], ...) {:noreply, new_state} end # The NEW synchronous path for testing @doc "For testing purposes only." def handle_call({:sync_record_failure, service_id}, _from, state) do # ... same logic as handle_cast ... new_state = ... {:reply, :ok, new_state} end
Pillar 2: Synchronous Test Helpers
Create a dedicated module, Foundation.Test.Helpers
, to encapsulate the logic of calling the synchronous test APIs and polling for state changes.
The Pattern: Wrap Async Operations in Synchronous Helpers
Old, Flaky Way (in test file):
# Anti-pattern: test listens for a side-effect
assert_telemetry_event [:state_change], %{to: :open} do
CircuitBreaker.fail(breaker)
end
assert CircuitBreaker.status(breaker) == :open
New, Robust Way (in test file):
# Test calls a synchronous helper that hides the complexity
TestHelpers.trip_circuit_breaker(breaker)
# The state is guaranteed to be correct now.
assert CircuitBreaker.status(breaker) == :open
The helper handles the direct interaction and synchronization:
# In foundation/test/helpers.ex
def trip_circuit_breaker(breaker_pid) do
# Directly call the synchronous test API
:ok = GenServer.call(breaker_pid, {:sync_record_failure})
end
For cases where a synchronous API is not feasible, the helper can implement a robust, controlled poll:
def wait_for_state(pid, checker_fun, timeout \\ 1000) do
# ... robust polling logic that calls a public API ...
end
Pillar 3: The Correct Role of Telemetry in Tests
Telemetry should be used to verify that the system is emitting the correct observational events, not to synchronize the test’s execution flow.
The Pattern: assert_emitted
The assert_emitted
macro captures telemetry events during a synchronous block of code and lets you make assertions on them after the code has completed.
Correct Usage:
# Test that a successful call emits the right event
assert_emitted [:call_success] do
# This block is synchronous and completes fully
{:ok, :success} = CircuitBreaker.call(breaker, fn -> :success end)
end
# Test that tripping the breaker emits the right event
assert_emitted [:state_change], fn metadata -> metadata.to == :open end do
# This helper is synchronous
TestHelpers.trip_circuit_breaker(breaker)
end
This pattern is robust because the code block is guaranteed to have finished before the telemetry assertions are made. It correctly tests telemetry as an outcome of an action, not as a signal for an action.
TEST_HELPERS.md
A Guide to Foundation.Test.Helpers
This module provides robust, synchronous helpers for testing asynchronous OTP systems, eliminating the need for Process.sleep
and fragile telemetry-based synchronization.
Core Helper: sync_operation/3
This is the foundation of our testing strategy. It sends an async message (cast
) to a GenServer and then polls a public API on that server until a desired state is reached.
defmodule Foundation.Test.Helpers do
@moduledoc "Synchronous helpers for testing asynchronous OTP systems."
@doc """
Casts a message to a GenServer and waits for a condition to be met.
This is the primary tool for turning an asynchronous operation into a
synchronous one for testing purposes.
"""
@spec sync_operation(pid, term, (-> boolean), non_neg_integer) :: :ok | {:error, :timeout}
def sync_operation(pid, message, checker_fun, timeout \\ 1000) do
:ok = GenServer.cast(pid, message)
start_time = System.monotonic_time(:millisecond)
wait_interval = 10 # ms
Stream.repeatedly(fn ->
if checker_fun.() do
:ok
else
if System.monotonic_time(:millisecond) - start_time > timeout do
:timeout
else
:timer.sleep(wait_interval)
:pending
end
end
end)
|> Enum.find(&(&1 != :pending))
|> case do
:ok -> :ok
:timeout -> {:error, :timeout}
end
end
end
Specific Use-Case Helpers
We build specific, intention-revealing helpers on top of sync_operation
or direct GenServer.call
s to the test-only APIs.
defmodule Foundation.Test.Helpers do
# ... (sync_operation from above) ...
@doc "Synchronously trips a circuit breaker and waits for it to open."
def trip_circuit_breaker(pid, fail_count \\ 5) do
# Assuming the CB has a synchronous test API `:sync_fail`
for _ <- 1..fail_count do
:ok = GenServer.call(pid, :sync_fail)
end
end
@doc "Waits for a supervised process to be restarted."
def wait_for_restart(supervisor, old_pid, timeout \\ 1000) do
sync_operation(
supervisor,
{:kill_child, old_pid}, # A message the test supervisor can handle
fn ->
new_pid = Process.whereis(Supervisor.child_spec(old_pid, []).id)
is_pid(new_pid) and new_pid != old_pid
end,
timeout
)
end
end
Telemetry Assertion Helpers
These helpers correctly use telemetry for verification, not synchronization.
defmodule Foundation.Test.TelemetryHelpers do
import ExUnit.Assertions
@doc """
Asserts that a specific telemetry event was emitted during the block's execution.
"""
defmacro assert_emitted(event_pattern, filter_fun \\ fn _ -> true end, do: block) do
quote do
capture_ref = make_ref()
handler_id = {:telemetry_capture, capture_ref}
# Attach a handler to capture events
:telemetry.attach(handler_id, unquote(event_pattern), &__MODULE__.capture_handler/4, %{ref: capture_ref, test_pid: self()})
# Execute the code block
result = unquote(block)
# Detach the handler immediately
:telemetry.detach(handler_id)
# Check received messages for the event
assert_receive {:telemetry_captured, ^capture_ref, captured_meta}, 500,
"Expected telemetry event matching #{inspect(unquote(event_pattern))} was not emitted."
assert unquote(filter_fun).(captured_meta),
"The captured event metadata did not match the filter function."
result
end
end
# The handler function that sends captured events to the test process
def capture_handler(_event, _measurements, metadata, config) do
send(config.test_pid, {:telemetry_captured, config.ref, metadata})
end
end
MIGRATION_STRATEGY.md
Migrating from Sleep/Telemetry-Based Tests to a Robust Architecture
This is a step-by-step guide to refactor the codebase and test suite towards a more robust, OTP-aligned architecture.
Phase 1: Build the New Foundation (No Test Changes Yet)
- Create
Foundation.Test.Helpers
: Implement thesync_operation/4
and theFoundation.Test.TelemetryHelpers
withassert_emitted/3
as described inTEST_HELPERS.md
. - Add a Test Supervisor: Create a
Foundation.Test.Supervisor
that can be started intest_helper.exs
. This supervisor will be the parent for any components needed during testing, ensuring proper supervision even in isolated tests. - CI Integration: Add a
credo
check to forbidProcess.sleep/1
in the/test
directory to prevent new instances of the anti-pattern from being introduced.
Phase 2: Make Application Code Testable (No Test Changes Yet)
- Identify Asynchronous APIs: Go through key
GenServer
s (CircuitBreaker
,RateLimiter
, agent modules). Identifyhandle_cast
orsend(self(), ...)
patterns that trigger complex state changes. - Add Synchronous Test APIs: For each identified async operation, add a corresponding
handle_call
that performs the exact same logic but replies upon completion. Mark these with@doc "For testing purposes only."
.- Example: If
handle_cast(:record_failure, ...)
exists, addhandle_call(:sync_record_failure, ...)
that does the same work and then{:reply, :ok, new_state}
.
- Example: If
Phase 3: Incrementally Refactor Tests (File by File)
This is the core migration work. Tackle one test file at a time to keep the scope manageable.
- Pick a Test File: Start with a simple one, like
circuit_breaker_test.exs
. - Remove
Process.sleep/1
: Find everyProcess.sleep
call. Analyze what it’s waiting for. - Introduce Test Helpers:
- If waiting for a state change, create a specific helper in
Foundation.Test.Helpers
(e.g.,trip_circuit_breaker/1
) that uses the new synchronous:sync_fail
API. - Replace the
sleep
and subsequent assertions with a single call to the new helper.
- If waiting for a state change, create a specific helper in
- Refactor Telemetry Assertions:
- Find every
assert_telemetry_event
that is being used for synchronization. - Replace it with a call to a synchronous test helper.
- Wrap the helper call in the new
assert_emitted
macro to verify that the telemetry event was still fired as a side-effect.
- Find every
Example Transformation:
Before:
test "circuit opens after 3 failures" do
{:ok, breaker} = CircuitBreaker.start_link(...)
assert_telemetry_event [:state_change], %{to: :open} do
CircuitBreaker.fail(breaker)
CircuitBreaker.fail(breaker)
CircuitBreaker.fail(breaker)
end
assert CircuitBreaker.status(breaker) == :open
end
After:
test "circuit opens after 3 failures" do
{:ok, breaker} = CircuitBreaker.start_link(...)
# Assert the event is emitted *while* synchronously tripping the breaker
assert_emitted [:state_change], &(&1.to == :open) do
TestHelpers.trip_circuit_breaker(breaker, 3)
end
# The state is now guaranteed to be correct
assert CircuitBreaker.status(breaker) == :open
end
Phase 4: Final Cleanup
- Once all tests are migrated, search the codebase for any remaining uses of the old, fragile patterns and remove them.
- Review the
Foundation.TelemetryTestHelpers
from the original document and delete them, as they have been replaced by the more robust patterns. - Celebrate having a fast, reliable, and deterministic test suite.