Foundation Telemetry Dashboard Setup Guide
This guide explains how to set up monitoring dashboards for Foundation telemetry using Prometheus and Grafana.
Prerequisites
- Docker and Docker Compose (recommended)
- Or manually installed Prometheus and Grafana
Quick Start with Docker Compose
- Create a
docker-compose.yml
file:
version: '3.8'
services:
prometheus:
image: prom/prometheus:latest
container_name: foundation-prometheus
volumes:
- ./config/prometheus.yml:/etc/prometheus/prometheus.yml
- ./config/alerts:/etc/prometheus/alerts
- prometheus_data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--web.console.libraries=/etc/prometheus/console_libraries'
- '--web.console.templates=/etc/prometheus/consoles'
- '--web.enable-lifecycle'
ports:
- "9090:9090"
restart: unless-stopped
grafana:
image: grafana/grafana:latest
container_name: foundation-grafana
volumes:
- grafana_data:/var/lib/grafana
- ./config/grafana/provisioning:/etc/grafana/provisioning
environment:
- GF_SECURITY_ADMIN_PASSWORD=admin
- GF_USERS_ALLOW_SIGN_UP=false
ports:
- "3000:3000"
restart: unless-stopped
depends_on:
- prometheus
alertmanager:
image: prom/alertmanager:latest
container_name: foundation-alertmanager
volumes:
- ./config/alertmanager.yml:/etc/alertmanager/alertmanager.yml
- alertmanager_data:/alertmanager
command:
- '--config.file=/etc/alertmanager/alertmanager.yml'
- '--storage.path=/alertmanager'
ports:
- "9093:9093"
restart: unless-stopped
volumes:
prometheus_data:
grafana_data:
alertmanager_data:
- Start the monitoring stack:
docker-compose up -d
- Access the services:
- Prometheus: http://localhost:9090
- Grafana: http://localhost:3000 (admin/admin)
- Alertmanager: http://localhost:9093
Elixir Application Setup
1. Add Dependencies
Add to your mix.exs
:
defp deps do
[
{:telemetry, "~> 1.2"},
{:telemetry_metrics, "~> 0.6"},
{:telemetry_metrics_prometheus, "~> 1.1"},
{:telemetry_poller, "~> 1.0"}
]
end
2. Configure Telemetry Metrics
Add to your application supervision tree:
defmodule MyApp.Application do
use Application
def start(_type, _args) do
children = [
# ... other children ...
# Telemetry metrics reporter
{TelemetryMetricsPrometheus,
[
metrics: Foundation.Telemetry.Metrics.metrics(),
port: 9568,
path: "/metrics",
name: :foundation_metrics
]},
# Telemetry poller for VM metrics
{:telemetry_poller,
measurements: [
{Foundation.Telemetry, :dispatch_vm_metrics, []}
],
period: 10_000,
name: :foundation_poller}
]
opts = [strategy: :one_for_one, name: MyApp.Supervisor]
Supervisor.start_link(children, opts)
end
end
3. Add VM Metrics Collection
Create a module to collect VM metrics:
defmodule Foundation.Telemetry do
def dispatch_vm_metrics do
memory = :erlang.memory()
:telemetry.execute(
[:vm, :memory],
%{
total: memory[:total],
processes: memory[:processes],
system: memory[:system],
atom: memory[:atom],
binary: memory[:binary],
ets: memory[:ets]
},
%{}
)
:telemetry.execute(
[:vm, :system_info],
%{
process_count: :erlang.system_info(:process_count),
port_count: :erlang.system_info(:port_count),
ets_count: :erlang.system_info(:ets_count)
},
%{}
)
end
end
Grafana Dashboard Import
Log into Grafana (http://localhost:3000)
Add Prometheus data source:
- Go to Configuration → Data Sources
- Add data source → Prometheus
- URL: http://prometheus:9090
- Save & Test
Import the Foundation dashboard:
- Go to Create → Import
- Upload
config/telemetry_dashboard.json
- Select the Prometheus data source
- Import
Customizing Metrics
Adding Custom Metrics
# In your application code
:telemetry.execute(
[:my_app, :custom, :metric],
%{value: 42, duration: 1000},
%{tag: "important"}
)
# Add to Foundation.Telemetry.Metrics
Metrics.counter("my_app.custom.metric.total",
event_name: [:my_app, :custom, :metric],
tags: [:tag]
)
Creating Custom Dashboards
Use Grafana’s query builder with these common patterns:
# Rate of events
rate(foundation_cache_hit_total[5m])
# Histogram percentiles
histogram_quantile(0.95, rate(foundation_cache_duration_microseconds_bucket[5m]))
# Gauge values
foundation_resource_manager_active_tokens
# Aggregations
sum(rate(jido_system_task_completed_total[5m])) by (task_type)
Alerting Setup
Configure Alertmanager
Create config/alertmanager.yml
:
global:
resolve_timeout: 5m
route:
group_by: ['alertname', 'cluster', 'service']
group_wait: 10s
group_interval: 10s
repeat_interval: 12h
receiver: 'default'
receivers:
- name: 'default'
webhook_configs:
- url: 'http://your-webhook-url'
send_resolved: true
Testing Alerts
- Check active alerts in Prometheus: http://localhost:9090/alerts
- View alert history in Alertmanager: http://localhost:9093
- Test webhook delivery with a test receiver
Performance Considerations
Metric Cardinality: Avoid high-cardinality labels (e.g., user IDs)
Sampling: For high-volume events, consider sampling:
if :rand.uniform() < 0.1 do # 10% sampling :telemetry.execute([:high, :volume, :event], %{}, %{}) end
Metric Retention: Configure Prometheus retention:
# In prometheus.yml global: external_labels: monitor: 'foundation' # Start with --storage.tsdb.retention.time=30d
Troubleshooting
No Metrics Appearing
Check if metrics endpoint is accessible:
curl http://localhost:9568/metrics
Verify Prometheus is scraping:
- Go to http://localhost:9090/targets
- Check if your target is UP
Check application logs for telemetry errors
Missing Metrics
Ensure events are being emitted:
:telemetry.attach("debug", [:foundation, :cache, :hit], fn event, measures, meta, _ -> IO.inspect({event, measures, meta}) end, nil)
Verify metric definitions match event names
High Memory Usage
- Reduce metric cardinality
- Increase Prometheus scrape interval
- Implement metric aggregation in application
Best Practices
Use Consistent Naming: Follow the pattern
app.component.action.unit
Add Descriptions: Always include metric descriptions
Choose Appropriate Types:
- Counter: For counts that only increase
- Gauge: For values that can go up and down
- Histogram: For measuring distributions (latencies, sizes)
- Summary: For pre-calculated percentiles
Label Guidelines:
- Keep label cardinality low
- Use static labels for grouping
- Avoid user-specific or request-specific labels
Dashboard Organization:
- Group related metrics
- Use consistent time ranges
- Add helpful annotations
- Include alert thresholds on graphs