For most of my career, I was the “code-first” engineer. I lived in the IDE, wrote features, debugged issues with logs, and occasionally patched together CI/CD pipelines when a team didn’t have a dedicated DevOps person.
That worked…until distributed systems entered my world. When I talk about distributed system in this case I mean managing the entire infrastructure setup and monitoring end-to-end.
Suddenly, a failed request could hop across four different microservices, hit Kafka, touch Postgres or redis and vanish without a trace. Debugging with logs alone was like trying to find a needle in five different haystacks.
That was my turning point. I needed observability — not just logs, but also metrics and traces.
For my setup, I needed a way to monitor and observe the system while also optimising for resuablity and easy debugging.
The initial idea and probably what most people do it is to have every service export logs directly to Loki, expose a /metrics endpoint from which Prometheus will scrape data and the add Open Telemetry for Traces. Given the number of services in our backend stack, this did not quite sit well with me. I can be lazy most times and that in itself is a good thing because it forces me to look for the most efficient and resilient way to solve problems so that I do not have to come back and fix the same issue again. If it is in fact necessary to fix anything, I would love for the problem to not be scattered all over the place which would mean that I would have to do a lot of context-switching.
I went with a better Idea…Open Telemetry Collector
Why OpenTelemetry Collector at the Center
Two months before this journey, I tried to wire Loki, Prometheus, and Tempo directly into the services. The result?
- Loki worked.
- Prometheus worked partially.
- Tempo? Absolutely refused.
As a perfectionist(sometimes), I wasn’t going to settle for half measures. I shelved the effort.
But this time, I approached it differently. Instead of coupling each service to three different observability tools, I made OpenTelemetry Collector (OTEL Collector) the central gateway.
Why this design?
- Decoupling → services export once (OTLP), Collector fans out.
- Flexibility → swap or add backends without touching app code.
- Consistency → all services speak the same language.
In other words: do observability right.
Getting Started
Before we get started, you need to have these tool installed:
If you intend to follow along in this guide, here is the directory structure that I am currently using:
├── config
│ ├── loki-config.yaml
│ ├── otel-collector-config.yaml
│ ├── prometheus.yml
│ └── tempo.yaml
├── docker
│ └── docker-compose.yml
└── provisioning
└── datasources
└── datasources.yaml
Before we get into each observability data, you have to download and run Loki
, Prometheus
& Traces
, then you make them available to Grafana
for data visualisation.
You also need to setup most important component that will enable data to flow to the individual services and be available for visualisation, and that component is our OTEL Collector
I used docker compose for configuring each of these components and here is what my docker compose looks like:
version: "3.9"
services:
grafana:
image: grafana/grafana:latest
container_name: grafana
ports:
- "3000:3000"
environment:
GF_SECURITY_ADMIN_USER: admin
GF_SECURITY_ADMIN_PASSWORD: admin
volumes:
- ../provisioning:/etc/grafana/provisioning
depends_on:
- loki
networks:
- observability-network
loki:
image: grafana/loki:3.0.0
container_name: loki
command: ["-config.file=/etc/loki/config.yaml"]
volumes:
- ../config/loki-config.yaml:/etc/loki/config.yaml:ro
- loki-data:/loki
ports:
- "3100:3100"
networks:
- observability-network
tempo:
image: grafana/tempo:latest
container_name: tempo
command: ["-config.file=/etc/tempo.yaml"]
volumes:
- ../config/tempo.yaml:/etc/tempo.yaml
- ./tempo/data:/var/tempo
ports:
- "3200:3200" # query
# - "4318:4318" # OTLP http ingest
networks:
- observability-network
otel-collector:
container_name: otel-collector
image: otel/opentelemetry-collector-contrib:latest
command: ["--config=/etc/otelcol/config.yaml"]
environment:
# Ensure the Collector's own SDK doesn't try to export via OTLP
OTEL_TRACES_EXPORTER: none
OTEL_METRICS_EXPORTER: none
OTEL_LOGS_EXPORTER: none
# (Optional) belt-and-suspenders: clear any inherited endpoint
OTEL_EXPORTER_OTLP_ENDPOINT: ""
volumes:
- ../config/otel-collector-config.yaml:/etc/otelcol/config.yaml:ro
ports:
- "4317:4317" # OTLP gRPC in
- "4318:4318" # OTLP HTTP in
- "9464:9464" # Prometheus exposition (from received metrics)
- "8888:8888" # Collector's own metrics
restart: unless-stopped
depends_on:
- loki
networks:
- observability-network
prometheus:
image: prom/prometheus:latest
container_name: prometheus
volumes:
- ../config/prometheus.yml:/etc/prometheus/prometheus.yml:ro
command:
- "--config.file=/etc/prometheus/prometheus.yml"
ports:
- "9090:9090"
depends_on:
- otel-collector
networks:
- observability-network
volumes:
loki-data:
networks:
observability-network:
name: observability-network
driver: bridge
This Docker Compose file sets up an observability stack with Grafana running on port 3000, Loki will run on port 3100, Tempo on port 3200, and the OpenTelemetry Collector. The Otel Collector is central here:
- 4317 (OTLP gRPC in) → receives telemetry (traces, metrics, logs) in gRPC format.
- 4318 (OTLP HTTP in) → receives telemetry over HTTP for clients that don’t use gRPC.
- 9464 (Prometheus exposition) → exposes metrics scraped by Prometheus from the Collector itself or translated from other sources.
- 8888 (Collector’s own metrics/health) → provides internal diagnostics about the Collector’s performance and health.
All services are linked via the observability-network
, with Loki data persisted in volumes. The Collector essentially acts as the traffic hub, receiving telemetry from applications and routing it to Loki, Tempo, Prometheus, and Grafana for unified observability.
Run docker compose up -d
and let’s get started
Open Telemetry Collector Config
# config/otel-config.yaml
receivers:
# Accept OTLP from your apps for all three signals
otlp:
protocols:
http:
endpoint: 0.0.0.0:4318
grpc:
endpoint: 0.0.0.0:4317
processors:
# Prevents OOMs; this setting will tune limits based on pod resources
memory_limiter:
check_interval: 1s
limit_percentage: 75
spike_limit_percentage: 25
batch:
timeout: 1s
send_batch_size: 10
exporters:
# For quick human inspection - you can always remove later in prod.
debug:
verbosity: detailed
# Expose Prometheus-scrape endpoint with OpenMetrics
prometheus:
endpoint: 0.0.0.0:9464
enable_open_metrics: true
send_timestamps: true
metric_expiration: 5m
resource_to_telemetry_conversion:
enabled: true # ← this makes resource attrs become Prom labels
# If we later prefer push instead of scrape:
# prometheusremotewrite:
# endpoint: http://mimir:9009/api/v1/push
# Traces → Tempo (start with HTTP; switch to gRPC as needed)
otlphttp/tempo:
endpoint: http://tempo:4318
tls:
insecure: true
retry_on_failure:
enabled: true
initial_interval: 5s
max_interval: 30s
max_elapsed_time: 300s
# Logs → Loki (native OTLP ingestion at /otlp)
otlphttp/loki:
endpoint: http://loki:3100/otlp
tls:
insecure: true
retry_on_failure:
enabled: true
initial_interval: 5s
max_interval: 30s
max_elapsed_time: 300s
sending_queue:
enabled: true
num_consumers: 4
queue_size: 100
timeout: 30s
read_buffer_size: 512
write_buffer_size: 512
service:
telemetry: # Collector's own telemetry (handy for debugging)
logs:
level: debug
encoding: json
metrics:
level: detailed
pipelines:
metrics:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [debug, prometheus] # or [prometheusremotewrite]
traces:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [debug, otlphttp/tempo]
logs:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [debug, otlphttp/loki]
This OTEL Collector config defines how telemetry flows through the stack:
- Receivers: Accept OTLP data on 4317 (gRPC) and 4318 (HTTP) for metrics, traces, and logs from apps.
- Processors: Use memory_limiter to prevent overloads and batch to optimize throughput before exporting.
- Exporters:
- debug → prints data locally for inspection.
- prometheus (9464) → exposes metrics in Prometheus/OpenMetrics format.
- otlphttp/tempo (→ Tempo:4318) → sends traces to Tempo.
- otlphttp/loki (→ Loki:3100/otlp) → sends logs to Loki, with retries, queuing, and buffer tuning for reliability.
- Service: Defines pipelines:
- Metrics → OTLP → processed → sent to debug + Prometheus.
- Traces → OTLP → processed → sent to debug + Tempo.
- Logs → OTLP → processed → sent to debug + Loki.
In short: this config turns the OTEL Collector into a central router, receiving telemetry, stabilizing it with processors, then exporting metrics to Prometheus, traces to Tempo, logs to Loki, while also exposing its own telemetry for observability.
Before we move fully into the different observablity data, I build a library that uses the Open Telemetry SDKs and tools to export all these observability data to Open Telemetry Collector. It is not much but all it does is expose 3 functions that looks for the 3 different environment variables which are basically URLs pointing to the otel collector. With the URL, we use the OTEL SDKs to export the observability data that we need to the otel collector.
Here are the function signatures:
// creates transporters, by default it creates a console transporter
// using the **nest-winston** npm library, if `process.env.OTEL_EXPORTER_OTLP_LOGS_ENDPOINT`
// is available it also creates a Transporter for OTLP(I wrote a custom
// class that leverages an inbuilt class of the library I mentioned earlier
function setupLogger(serviceName: string): 'nest-winston'.LoggerService;
// Metrics - uses process.env.OTEL_EXPORTER_OTLP_METRICS_ENDPOINT
function setupMetrics(serviceName: string): '@opentelemetry/api'.Meter
//Traces
// This functions sets up instrumentations. I enabled the
// ones we need in our services only using the
// `getNodeAutoInstrumentations` from **@opentelemetry/auto-instrumentations-node**
// it looks for the env variable process.env.OTEL_EXPORTER_OTLP_TRACE_ENDPOINT
function setupOpenTelemetry(serviceName: string): void
Apart from Logs, if you call any of the other 2 functions without setting those environment variables, your application breaks. I mean if you want to observe your services, why don’t you just add those environment variables.
if every container in your docker compose started successfully, those environment variables should be
# If your app is NOT part of the `observability-network`
OTEL_EXPORTER_OTLP_TRACE_ENDPOINT=http://host.docker.internal:4318/v1/traces
OTEL_EXPORTER_OTLP_METRICS_ENDPOINT=http://host.docker.internal:4318/v1/metrics
OTEL_EXPORTER_OTLP_LOGS_ENDPOINT=http://host.docker.internal:4318/v1/logs
# If your app is part of the `observability-network`
OTEL_EXPORTER_OTLP_TRACE_ENDPOINT=http://otel-collector:4318/v1/traces
OTEL_EXPORTER_OTLP_METRICS_ENDPOINT=http://otel-collector:4318/v1/metrics
OTEL_EXPORTER_OTLP_LOGS_ENDPOINT=http://otel-collector:4318/v1/logs
Part 1: Logs with Loki
I started with logs because they’re the most immediate. If it works, you see results quickly.
To ensure that Loki started successfully, Run curl -sS http://localhost:3100/ready
it should return the text ready
Here’s the Loki config for logs:
# config/loki-config.yaml
auth_enabled: false
server:
http_listen_port: 3100
common:
instance_addr: 127.0.0.1
path_prefix: /loki
storage:
filesystem:
chunks_directory: /loki/chunks
rules_directory: /loki/rules
replication_factor: 1
ring:
kvstore:
store: inmemory
schema_config:
configs:
- from: 2020-05-15
store: tsdb
object_store: filesystem
schema: v13
index:
prefix: index_
period: 24h
ruler:
alertmanager_url: http://localhost:9093
This Loki config sets up Loki as a log storage and query engine listening on port 3100.
- Authentication: disabled (
auth_enabled: false
) for local/dev use. - Server: serves HTTP API for log ingestion and queries.
- Storage: uses the local filesystem (
/loki/chunks
for log data,/loki/rules
for recording/alerting rules). - Replication & Ring: single-instance (
replication_factor: 1
) with in-memory ring state, so no clustering here. - Schema: stores logs in TSDB format (v13) with daily index periods for efficient queries.
- Ruler: integrates with an Alertmanager (on
http://localhost:9093
) to trigger alerts from log queries.
In short: this config makes Loki a standalone log aggregator, storing logs on disk, exposing them over 3100, and ready to send alerts to Alertmanager.
Caveats
- Loki doesn’t accept OTLP logs out of the box. You must configure the
/otlp
endpoint in Loki in the exporters section of the config like so:
exporters:
otlphttp/loki:
endpoint: http://loki:3100/otlp
tls:
insecure: true
Otherwise, you’ll push logs into a void.
- Sometimes attributes like service.name don’t show up as labels. You need to map them (or query results will look empty). In the processor section of your
OTEL Collector
config, you may need to enable explicit mapping like so:
processors:
# existing processor configs
resource:
attributes:
- key: service.name
action: insert
Part 2: Metrics with Prometheus
After logs, the next step was metrics. Here’s why this mattered: logs told me when a request happened, but not how many, how fast, or how resource-hungry.
Prometheus config
# config/prometheus.yaml
global:
scrape_interval: 10s # how often to scrape
evaluation_interval: 10s
scrape_configs:
- job_name: "otel-collector"
static_configs:
- targets: ["otel-collector:9464"]
This Prometheus config sets up metrics scraping for the stack:
- Global settings: scrape and evaluate rules every 10 seconds, ensuring fresh metric collection.
- Scrape job: defines a single job named
otel-collector
, targetingotel-collector:9464
.- Port 9464 is where the OTEL Collector exposes metrics in Prometheus/OpenMetrics format (transformed from incoming telemetry).
In short: this config makes Prometheus continuously scrape the OTEL Collector’s 9464 endpoint, pulling in metrics from applications (via the Collector) for storage, querying, and visualization in Grafana.
Caveat
- Scrape vs Push confusion
- Prometheus doesn’t “receive” metrics like Loki or Tempo. It scrapes them.
- Exposing :9464 in the Collector is critical because that’s the scrape target.
- If you forget to point Prometheus at this target, metrics vanish silently.
Service metrics to start with
requests_total
(count of requests by method + path)request_duration_seconds
(histogram of latencies)- Runtime metrics (CPU, heap usage, GC pauses, event loop lag)
When I was setting this up runtime metrics like cpu usage and memory usage showed up instantly. HTTP request metrics took longer — partly because my proxy middleware short-circuited Express internals. If you use proxies, ensure your middleware runs before the request ends, otherwise spans and counters won’t fire.
Part 3: Traces with Tempo
This was the hardest part. And the most rewarding.
Traces are what tie everything together — the single request flowing across services, through Kafka, into Postgres, and back to the client.
Tempo Config
# config/tempo.yaml
server:
http_listen_port: 3200
distributor:
receivers:
otlp:
protocols:
http:
endpoint: 0.0.0.0:4318
storage:
trace:
backend: local
local:
path: /var/tempo/traces
compactor:
compaction:
block_retention: 24h
Caveat
- Port conflicts - Both Collector and Tempo want to bind to
4318
. Solution is to only expose it on Collector. Tempo’s 4318 should be internal-only. Collector forwards there. - Instrumentations - Auto-instrumentations (NestJS, HTTP, Pg, KafkaJS) save a ton of work But they’re noisy. Add filters (ignore /health, ignore static assets).
Grafana – the Single Pane of Glass
After wiring up Loki for logs, Prometheus for metrics, and Tempo for traces, I still needed a single place to see everything. That’s where Grafana came in.
Grafana isn’t about collecting data; it’s about making sense of it. It pulls from Loki, Prometheus, and Tempo, then lets you query, correlate, and visualize.
Grafana provisioning config
# provisioning/datasources/datasources.yaml
apiVersion: 1
datasources:
- name: Loki
type: loki
access: proxy
url: http://loki:3100
- name: Prometheus
type: prometheus
access: proxy
url: http://prometheus:9090
- name: Tempo
type: tempo
access: proxy
url: http://tempo:3200
This tells Grafana about the three data sources in the stack:
- Loki → for querying logs (
{service_name="gateway-service"}
). - Prometheus → for metrics (request rates, latencies, runtime stats).
- Tempo → for distributed traces (spans across microservices).
Caveat
- Datasource URLs must match your Docker/K8s network.
- If Grafana can’t resolve
http://loki:3100
, you’ll see silent failures. - Always use service names (like
loki
,prometheus
,tempo
) instead of localhost. Inside Docker/K8s, localhost refers to the container itself, not the service.
- If Grafana can’t resolve
- Tempo port confusion: Grafana queries Tempo on :3200, not :4318. :4318 is the OTLP ingest port, :3200 is the query port.
Why Grafana made it real
The moment I wired Grafana, the observability puzzle clicked:
- Logs: I could filter by service, path, or even custom attributes.
- Metrics: dashboards showed CPU/memory spikes alongside traffic.
- Traces: I could follow a request across gateway → user-service → database.
Before, each tool worked in isolation. With Grafana, You have a unified observability cockpit.
Two months ago, I gave up because Tempo didn’t show anything. The fix wasn’t in the code — it was in understanding the OTLP pipeline. Once the Collector sent spans with the right resource attributes, Grafana lit up beautifully.
Here is how everything looks when it all comes together:
┌───────────────────────────┐
│ Applications │
│ (send metrics, logs, traces)│
└─────────────┬─────────────┘
│ (OTLP gRPC/HTTP → 4317/4318)
▼
┌─────────────────────────┐
│ OpenTelemetry Collector│
│ - Receives telemetry │
│ - Processes (batch, mem) │
│ - Exposes metrics (9464) │
│ - Internal metrics (8888)│
└──────┬─────────┬────────┘
│ │
┌──────────┘ └──────────┐
▼ ▼
┌──────────────┐ ┌──────────────┐
│ Prometheus │◄───────────│ OTEL Export │
│ (scrapes 9464│ │ Metrics │
│ every 10s) │ └──────────────┘
│ (port 9090) │
└─────┬────────┘
│
▼
┌──────────────┐
│ Grafana │
│ (port 3000) │
│ Dashboards │
└──────────────┘
Logs Path:
Apps → OTEL Collector → Loki (3100) → Grafana
Traces Path:
Apps → OTEL Collector → Tempo (4318 ingest / 3200 query) → Grafana
Metrics Path:
Apps → OTEL Collector → Prometheus (9464 scrape) → Grafana
Closing Thoughts
With Loki, Prometheus, and Tempo wired through the OTEL Collector, I now have:
- Logs → searchable, labeled by service.
- Metrics → counts, latencies, resource usage.
- Traces → end-to-end visibility across microservices.
The result is a holy grail of observability.
Not perfect, but close:
- Collector is a single point of failure (you’ll want HA in prod).
- Dashboards take effort to tune.
- Traces will flood you if you don’t sample wisely.
But compared to where I started — debugging one service at a time with logs — this feels like turning the lights on in a dark room.
If you’re stepping into Cloud + Distributed Systems, start here. Build the Collector pipeline. Send everything through it. Then connect the three musketeers.
And if you’ve already done this — what were your biggest pitfalls or “aha” moments? Tell me here