Multi Service Observability Setup with Open Telemetry, Grafana, Loki, Tempo and Prometheus

Observability and Monitoring done right
Observability And Monitoring Done Right

For most of my career, I was the “code-first” engineer. I lived in the IDE, wrote features, debugged issues with logs, and occasionally patched together CI/CD pipelines when a team didn’t have a dedicated DevOps person.

That worked…until distributed systems entered my world. When I talk about distributed system in this case I mean managing the entire infrastructure setup and monitoring end-to-end.

Suddenly, a failed request could hop across four different microservices, hit Kafka, touch Postgres or redis and vanish without a trace. Debugging with logs alone was like trying to find a needle in five different haystacks.

That was my turning point. I needed observability — not just logs, but also metrics and traces.

For my setup, I needed a way to monitor and observe the system while also optimising for resuablity and easy debugging.

The initial idea and probably what most people do it is to have every service export logs directly to Loki, expose a /metrics endpoint from which Prometheus will scrape data and the add Open Telemetry for Traces. Given the number of services in our backend stack, this did not quite sit well with me. I can be lazy most times and that in itself is a good thing because it forces me to look for the most efficient and resilient way to solve problems so that I do not have to come back and fix the same issue again. If it is in fact necessary to fix anything, I would love for the problem to not be scattered all over the place which would mean that I would have to do a lot of context-switching.

I went with a better Idea…Open Telemetry Collector

Why OpenTelemetry Collector at the Center

Two months before this journey, I tried to wire Loki, Prometheus, and Tempo directly into the services. The result?

  • Loki worked.
  • Prometheus worked partially.
  • Tempo? Absolutely refused.

As a perfectionist(sometimes), I wasn’t going to settle for half measures. I shelved the effort.

But this time, I approached it differently. Instead of coupling each service to three different observability tools, I made OpenTelemetry Collector (OTEL Collector) the central gateway.

Why this design?

  • Decoupling → services export once (OTLP), Collector fans out.
  • Flexibility → swap or add backends without touching app code.
  • Consistency → all services speak the same language.

In other words: do observability right.

Getting Started

Before we get started, you need to have these tool installed:

If you intend to follow along in this guide, here is the directory structure that I am currently using:

├── config
   ├── loki-config.yaml
   ├── otel-collector-config.yaml
   ├── prometheus.yml
   └── tempo.yaml
├── docker
   └── docker-compose.yml
└── provisioning
    └── datasources
        └── datasources.yaml

Before we get into each observability data, you have to download and run Loki, Prometheus & Traces, then you make them available to Grafana for data visualisation. You also need to setup most important component that will enable data to flow to the individual services and be available for visualisation, and that component is our OTEL Collector

I used docker compose for configuring each of these components and here is what my docker compose looks like:

version: "3.9"
services:
  grafana:
    image: grafana/grafana:latest
    container_name: grafana
    ports:
      - "3000:3000"
    environment:
      GF_SECURITY_ADMIN_USER: admin
      GF_SECURITY_ADMIN_PASSWORD: admin
    volumes:
      - ../provisioning:/etc/grafana/provisioning
    depends_on:
      - loki
    networks:
      - observability-network
  loki:
    image: grafana/loki:3.0.0
    container_name: loki
    command: ["-config.file=/etc/loki/config.yaml"]
    volumes:
      - ../config/loki-config.yaml:/etc/loki/config.yaml:ro
      - loki-data:/loki
    ports:
      - "3100:3100"
    networks:
      - observability-network
  tempo:
    image: grafana/tempo:latest
    container_name: tempo
    command: ["-config.file=/etc/tempo.yaml"]
    volumes:
      - ../config/tempo.yaml:/etc/tempo.yaml
      - ./tempo/data:/var/tempo
    ports:
      - "3200:3200" # query
      # - "4318:4318" # OTLP http ingest
    networks:
      - observability-network
  otel-collector:
    container_name: otel-collector
    image: otel/opentelemetry-collector-contrib:latest
    command: ["--config=/etc/otelcol/config.yaml"]
    environment:
      # Ensure the Collector's own SDK doesn't try to export via OTLP
      OTEL_TRACES_EXPORTER: none
      OTEL_METRICS_EXPORTER: none
      OTEL_LOGS_EXPORTER: none
      # (Optional) belt-and-suspenders: clear any inherited endpoint
      OTEL_EXPORTER_OTLP_ENDPOINT: ""
    volumes:
      - ../config/otel-collector-config.yaml:/etc/otelcol/config.yaml:ro
    ports:
      - "4317:4317" # OTLP gRPC in
      - "4318:4318" # OTLP HTTP in
      - "9464:9464" # Prometheus exposition (from received metrics)
      - "8888:8888" # Collector's own metrics
    restart: unless-stopped
    depends_on:
      - loki
    networks:
      - observability-network
  prometheus:
    image: prom/prometheus:latest
    container_name: prometheus
    volumes:
      - ../config/prometheus.yml:/etc/prometheus/prometheus.yml:ro
    command:
      - "--config.file=/etc/prometheus/prometheus.yml"
    ports:
      - "9090:9090"
    depends_on:
      - otel-collector
    networks:
      - observability-network

volumes:
  loki-data:

networks:
  observability-network:
    name: observability-network
    driver: bridge

This Docker Compose file sets up an observability stack with Grafana running on port 3000, Loki will run on port 3100, Tempo on port 3200, and the OpenTelemetry Collector. The Otel Collector is central here:

  • 4317 (OTLP gRPC in) → receives telemetry (traces, metrics, logs) in gRPC format.
  • 4318 (OTLP HTTP in) → receives telemetry over HTTP for clients that don’t use gRPC.
  • 9464 (Prometheus exposition) → exposes metrics scraped by Prometheus from the Collector itself or translated from other sources.
  • 8888 (Collector’s own metrics/health) → provides internal diagnostics about the Collector’s performance and health.

All services are linked via the observability-network, with Loki data persisted in volumes. The Collector essentially acts as the traffic hub, receiving telemetry from applications and routing it to Loki, Tempo, Prometheus, and Grafana for unified observability.

Run docker compose up -d and let’s get started

Open Telemetry Collector Config

# config/otel-config.yaml
receivers:
  # Accept OTLP from your apps for all three signals
  otlp:
    protocols:
      http:
        endpoint: 0.0.0.0:4318
      grpc:
        endpoint: 0.0.0.0:4317

processors:
  # Prevents OOMs; this setting will tune limits based on pod resources
  memory_limiter:
    check_interval: 1s
    limit_percentage: 75
    spike_limit_percentage: 25
  batch:
    timeout: 1s
    send_batch_size: 10

exporters:
  # For quick human inspection - you can always remove later in prod.
  debug:
    verbosity: detailed
  # Expose Prometheus-scrape endpoint with OpenMetrics
  prometheus:
    endpoint: 0.0.0.0:9464
    enable_open_metrics: true
    send_timestamps: true
    metric_expiration: 5m
    resource_to_telemetry_conversion:
      enabled: true # ← this makes resource attrs become Prom labels

    # If we later prefer push instead of scrape:
  # prometheusremotewrite:
  #   endpoint: http://mimir:9009/api/v1/push

  # Traces → Tempo (start with HTTP; switch to gRPC as needed)
  otlphttp/tempo:
    endpoint: http://tempo:4318
    tls:
      insecure: true
    retry_on_failure:
      enabled: true
      initial_interval: 5s
      max_interval: 30s
      max_elapsed_time: 300s

  # Logs → Loki (native OTLP ingestion at /otlp)
  otlphttp/loki:
    endpoint: http://loki:3100/otlp
    tls:
      insecure: true
    retry_on_failure:
      enabled: true
      initial_interval: 5s
      max_interval: 30s
      max_elapsed_time: 300s
    sending_queue:
      enabled: true
      num_consumers: 4
      queue_size: 100
    timeout: 30s
    read_buffer_size: 512
    write_buffer_size: 512

service:
  telemetry: # Collector's own telemetry (handy for debugging)
    logs:
      level: debug
      encoding: json
    metrics:
      level: detailed
  pipelines:
    metrics:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [debug, prometheus] # or [prometheusremotewrite]
    traces:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [debug, otlphttp/tempo]
    logs:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [debug, otlphttp/loki]

This OTEL Collector config defines how telemetry flows through the stack:

  • Receivers: Accept OTLP data on 4317 (gRPC) and 4318 (HTTP) for metrics, traces, and logs from apps.
  • Processors: Use memory_limiter to prevent overloads and batch to optimize throughput before exporting.
  • Exporters:
    • debug → prints data locally for inspection.
    • prometheus (9464) → exposes metrics in Prometheus/OpenMetrics format.
    • otlphttp/tempo (→ Tempo:4318) → sends traces to Tempo.
    • otlphttp/loki (→ Loki:3100/otlp) → sends logs to Loki, with retries, queuing, and buffer tuning for reliability.
  • Service: Defines pipelines:
    • Metrics → OTLP → processed → sent to debug + Prometheus.
    • Traces → OTLP → processed → sent to debug + Tempo.
    • Logs → OTLP → processed → sent to debug + Loki.

In short: this config turns the OTEL Collector into a central router, receiving telemetry, stabilizing it with processors, then exporting metrics to Prometheus, traces to Tempo, logs to Loki, while also exposing its own telemetry for observability.

Before we move fully into the different observablity data, I build a library that uses the Open Telemetry SDKs and tools to export all these observability data to Open Telemetry Collector. It is not much but all it does is expose 3 functions that looks for the 3 different environment variables which are basically URLs pointing to the otel collector. With the URL, we use the OTEL SDKs to export the observability data that we need to the otel collector.

Here are the function signatures:

// creates transporters, by default it creates a console transporter
// using the **nest-winston** npm library, if `process.env.OTEL_EXPORTER_OTLP_LOGS_ENDPOINT`
//  is available it also creates a Transporter for OTLP(I wrote a custom
// class that leverages an inbuilt class of the library I mentioned earlier
function setupLogger(serviceName: string): 'nest-winston'.LoggerService;

// Metrics - uses  process.env.OTEL_EXPORTER_OTLP_METRICS_ENDPOINT
function setupMetrics(serviceName: string): '@opentelemetry/api'.Meter

//Traces
// This functions sets up instrumentations. I enabled the
// ones we need in our services only using the
// `getNodeAutoInstrumentations` from **@opentelemetry/auto-instrumentations-node**
// it looks for the env variable process.env.OTEL_EXPORTER_OTLP_TRACE_ENDPOINT
function setupOpenTelemetry(serviceName: string): void

Apart from Logs, if you call any of the other 2 functions without setting those environment variables, your application breaks. I mean if you want to observe your services, why don’t you just add those environment variables.

if every container in your docker compose started successfully, those environment variables should be

# If your app is NOT part of the `observability-network`
OTEL_EXPORTER_OTLP_TRACE_ENDPOINT=http://host.docker.internal:4318/v1/traces
OTEL_EXPORTER_OTLP_METRICS_ENDPOINT=http://host.docker.internal:4318/v1/metrics
OTEL_EXPORTER_OTLP_LOGS_ENDPOINT=http://host.docker.internal:4318/v1/logs

# If your app is part of the `observability-network`
OTEL_EXPORTER_OTLP_TRACE_ENDPOINT=http://otel-collector:4318/v1/traces
OTEL_EXPORTER_OTLP_METRICS_ENDPOINT=http://otel-collector:4318/v1/metrics
OTEL_EXPORTER_OTLP_LOGS_ENDPOINT=http://otel-collector:4318/v1/logs

Part 1: Logs with Loki

I started with logs because they’re the most immediate. If it works, you see results quickly. To ensure that Loki started successfully, Run curl -sS http://localhost:3100/ready it should return the text ready

Here’s the Loki config for logs:

# config/loki-config.yaml
auth_enabled: false

server:
  http_listen_port: 3100

common:
  instance_addr: 127.0.0.1
  path_prefix: /loki
  storage:
    filesystem:
      chunks_directory: /loki/chunks
      rules_directory: /loki/rules
  replication_factor: 1
  ring:
    kvstore:
      store: inmemory

schema_config:
  configs:
    - from: 2020-05-15
      store: tsdb
      object_store: filesystem
      schema: v13
      index:
        prefix: index_
        period: 24h

ruler:
  alertmanager_url: http://localhost:9093

This Loki config sets up Loki as a log storage and query engine listening on port 3100.

  • Authentication: disabled (auth_enabled: false) for local/dev use.
  • Server: serves HTTP API for log ingestion and queries.
  • Storage: uses the local filesystem (/loki/chunks for log data, /loki/rules for recording/alerting rules).
  • Replication & Ring: single-instance (replication_factor: 1) with in-memory ring state, so no clustering here.
  • Schema: stores logs in TSDB format (v13) with daily index periods for efficient queries.
  • Ruler: integrates with an Alertmanager (on http://localhost:9093) to trigger alerts from log queries.

In short: this config makes Loki a standalone log aggregator, storing logs on disk, exposing them over 3100, and ready to send alerts to Alertmanager.

Caveats

  • Loki doesn’t accept OTLP logs out of the box. You must configure the /otlp endpoint in Loki in the exporters section of the config like so:
exporters:
  otlphttp/loki:
    endpoint: http://loki:3100/otlp
    tls:
      insecure: true

Otherwise, you’ll push logs into a void.

  • Sometimes attributes like service.name don’t show up as labels. You need to map them (or query results will look empty). In the processor section of your OTEL Collector config, you may need to enable explicit mapping like so:
processors:
  # existing processor configs
  resource:
    attributes:
      - key: service.name
        action: insert

Part 2: Metrics with Prometheus

After logs, the next step was metrics. Here’s why this mattered: logs told me when a request happened, but not how many, how fast, or how resource-hungry.

Prometheus config

# config/prometheus.yaml
global:
  scrape_interval: 10s # how often to scrape
  evaluation_interval: 10s

scrape_configs:
  - job_name: "otel-collector"
    static_configs:
      - targets: ["otel-collector:9464"]

This Prometheus config sets up metrics scraping for the stack:

  • Global settings: scrape and evaluate rules every 10 seconds, ensuring fresh metric collection.
  • Scrape job: defines a single job named otel-collector, targeting otel-collector:9464.
    • Port 9464 is where the OTEL Collector exposes metrics in Prometheus/OpenMetrics format (transformed from incoming telemetry).

In short: this config makes Prometheus continuously scrape the OTEL Collector’s 9464 endpoint, pulling in metrics from applications (via the Collector) for storage, querying, and visualization in Grafana.

Caveat

  • Scrape vs Push confusion
    • Prometheus doesn’t “receive” metrics like Loki or Tempo. It scrapes them.
    • Exposing :9464 in the Collector is critical because that’s the scrape target.
    • If you forget to point Prometheus at this target, metrics vanish silently.

Service metrics to start with

  • requests_total (count of requests by method + path)
  • request_duration_seconds (histogram of latencies)
  • Runtime metrics (CPU, heap usage, GC pauses, event loop lag)

When I was setting this up runtime metrics like cpu usage and memory usage showed up instantly. HTTP request metrics took longer — partly because my proxy middleware short-circuited Express internals. If you use proxies, ensure your middleware runs before the request ends, otherwise spans and counters won’t fire.

Part 3: Traces with Tempo

This was the hardest part. And the most rewarding.

Traces are what tie everything together — the single request flowing across services, through Kafka, into Postgres, and back to the client.

Tempo Config

# config/tempo.yaml
server:
  http_listen_port: 3200

distributor:
  receivers:
    otlp:
      protocols:
        http:
          endpoint: 0.0.0.0:4318

storage:
  trace:
    backend: local
    local:
      path: /var/tempo/traces

compactor:
  compaction:
    block_retention: 24h

Caveat

  • Port conflicts - Both Collector and Tempo want to bind to 4318. Solution is to only expose it on Collector. Tempo’s 4318 should be internal-only. Collector forwards there.
  • Instrumentations - Auto-instrumentations (NestJS, HTTP, Pg, KafkaJS) save a ton of work But they’re noisy. Add filters (ignore /health, ignore static assets).

Grafana – the Single Pane of Glass

After wiring up Loki for logs, Prometheus for metrics, and Tempo for traces, I still needed a single place to see everything. That’s where Grafana came in.

Grafana isn’t about collecting data; it’s about making sense of it. It pulls from Loki, Prometheus, and Tempo, then lets you query, correlate, and visualize.

Grafana provisioning config

# provisioning/datasources/datasources.yaml
apiVersion: 1
datasources:
  - name: Loki
    type: loki
    access: proxy
    url: http://loki:3100

  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus:9090

  - name: Tempo
    type: tempo
    access: proxy
    url: http://tempo:3200

This tells Grafana about the three data sources in the stack:

  • Loki → for querying logs ({service_name="gateway-service"}).
  • Prometheus → for metrics (request rates, latencies, runtime stats).
  • Tempo → for distributed traces (spans across microservices).

Caveat

  • Datasource URLs must match your Docker/K8s network.
    • If Grafana can’t resolve http://loki:3100, you’ll see silent failures.
    • Always use service names (like loki, prometheus, tempo) instead of localhost. Inside Docker/K8s, localhost refers to the container itself, not the service.
  • Tempo port confusion: Grafana queries Tempo on :3200, not :4318. :4318 is the OTLP ingest port, :3200 is the query port.

Why Grafana made it real

The moment I wired Grafana, the observability puzzle clicked:

  • Logs: I could filter by service, path, or even custom attributes.
  • Metrics: dashboards showed CPU/memory spikes alongside traffic.
  • Traces: I could follow a request across gateway → user-service → database.

Before, each tool worked in isolation. With Grafana, You have a unified observability cockpit.


Two months ago, I gave up because Tempo didn’t show anything. The fix wasn’t in the code — it was in understanding the OTLP pipeline. Once the Collector sent spans with the right resource attributes, Grafana lit up beautifully.

Here is how everything looks when it all comes together:

   ┌───────────────────────────┐
   │        Applications        │
   │ (send metrics, logs, traces)│
   └─────────────┬─────────────┘
                 │  (OTLP gRPC/HTTP → 4317/4318)

       ┌─────────────────────────┐
       │   OpenTelemetry Collector│
       │ - Receives telemetry     │
       │ - Processes (batch, mem) │
       │ - Exposes metrics (9464) │
       │ - Internal metrics (8888)│
       └──────┬─────────┬────────┘
              │         │
   ┌──────────┘         └──────────┐
   ▼                               ▼
┌──────────────┐            ┌──────────────┐
│   Prometheus │◄───────────│  OTEL Export │
│ (scrapes 9464│            │   Metrics    │
│ every 10s)   │            └──────────────┘
│  (port 9090) │
└─────┬────────┘


┌──────────────┐
│   Grafana    │
│ (port 3000)  │
│ Dashboards   │
└──────────────┘

Logs Path:
Apps → OTEL Collector → Loki (3100) → Grafana

Traces Path:
Apps → OTEL Collector → Tempo (4318 ingest / 3200 query) → Grafana

Metrics Path:
Apps → OTEL Collector → Prometheus (9464 scrape) → Grafana

Closing Thoughts

With Loki, Prometheus, and Tempo wired through the OTEL Collector, I now have:

  • Logs → searchable, labeled by service.
  • Metrics → counts, latencies, resource usage.
  • Traces → end-to-end visibility across microservices.

The result is a holy grail of observability.

Not perfect, but close:

  • Collector is a single point of failure (you’ll want HA in prod).
  • Dashboards take effort to tune.
  • Traces will flood you if you don’t sample wisely.

But compared to where I started — debugging one service at a time with logs — this feels like turning the lights on in a dark room.

If you’re stepping into Cloud + Distributed Systems, start here. Build the Collector pipeline. Send everything through it. Then connect the three musketeers.

And if you’ve already done this — what were your biggest pitfalls or “aha” moments? Tell me here

Resources