Observability คือสิ่งที่ทำให้คุณตอบคำถามที่คุณไม่รู้มาก่อนว่าจะต้องถามได้ — ทำไม checkout ของ user คนเดียวถึงใช้เวลา 14 วินาที ตอนตี 3:42 — โดยไม่ต้อง ship code ใหม่ OpenTelemetry คือวิธีแบบ vendor-neutral ที่ติดตั้ง Instrumentation ครั้งเดียว แล้วส่ง traces, metrics และ logs ที่ได้ไปยัง Backend ใดก็ตามที่คุณ (หรือ CFO ของคุณ) ชอบ คู่มือนี้จะพา Node.js Service จริงเดินผ่าน Setup ทั้งหมด รวมถึง Production gotchas ที่มักเจอเป็นอย่างแรก

TL;DR

Monitoring บอกคุณ ว่า มีอะไรพัง; observability บอกคุณ ทำไม — กุญแจในการ join คือ trace ID

สามเสาหลักของ OTel (traces, metrics, logs) จะคุ้มค่าก็ต่อเมื่อ correlate กัน; อย่าทำ trace ID หลุดข้าม boundary เด็ดขาด

Auto-instrumentation ให้คุณ 80% ของคุณค่าด้วยความพยายาม 5%; เพิ่ม manual span เฉพาะสำหรับ business operations เท่านั้น

ให้รัน Collector คั่นระหว่าง services กับ Backend เสมอ เพื่อให้คุณเปลี่ยน vendor, sampling หรือกฎ PII ได้โดยไม่ต้อง redeploy

ID ที่ cardinality สูง (user, order, request) ควรอยู่บน span ไม่ใช่บน metric labels — เพราะนั่นคือวิธีทำให้ metrics backend ของคุณล้มละลาย

ใช้ tail-based sampling ที่เก็บ errors และ slow requests ไว้ 100%; sample traffic ปกติแค่ 1–5%

Telemetry bootstrap ต้อง เป็น import แรก ไม่อย่างนั้น auto-instrumentation จะเงียบหายไปเฉย ๆ

Observability หมายความว่าอะไรกันแน่

Monitoring คือการรู้ว่า system พัง Observability คือการที่สามารถถาม ทำไม ได้ — โดยไม่ต้อง ship code ใหม่ ความแตกต่างนี้สำคัญ เพราะเครื่องมือที่คุณจะหยิบมาใช้นั้นต่างกัน

Monitoring stack ตอบคำถามที่คุณรู้ว่าต้องถามล่วงหน้า: CPU, memory, 5xx rate, queue depth Observability stack ทำให้คุณตอบคำถามที่คุณไม่ได้คาดไว้: ทำไม checkout ของ user คนนี้คนเดียวเมื่อวานตอนตี 3:42 ถึงใช้เวลา 14 วินาที? แบบแรกคือ dashboards แบบที่สองคือ event data ที่ cardinality สูง ซึ่งคุณ slice ได้ตามใจ

OpenTelemetry (OTel) คือ SDK และ wire format แบบ vendor-neutral ที่ทำให้ระบบแบบที่สองเป็นไปได้ โดยไม่ผูกคุณไว้กับ Backend เดียว คุณติดตั้ง Instrumentation ครั้งเดียว แล้วส่งข้อมูลไปยัง Tempo, Jaeger, Honeycomb, Datadog, New Relic หรือสามตัวพร้อมกันก็ได้ การ decouple นี้คือประเด็นทั้งหมด

Post นี้คือภาคต่อแบบ concrete ของบทความก่อนหน้าที่พูดเรื่อง Monitoring stack ในระดับ high-level ที่นี่เราจะเขียน code จริง: Node.js Service ที่ Instrumented แบบ end-to-end พร้อม Production gotchas ที่ผมหวังว่าจะรู้ตั้งแต่ครั้งแรก

สามเสาหลัก — และทำไม Correlation ถึงเหนือกว่าเสาใดเสาเดียว

ทุก talk ของ observability ย้ำสามเสาหลักนี้:

Traces — เส้นทางของ request เดียวผ่านทุก service พร้อม timing ในแต่ละ hop
Metrics — numerical time series ที่ aggregate แล้ว: counters, gauges, histograms
Logs — events ที่มี timestamp มักเป็น text หรือ structured JSON

สิ่งที่ talks มักเล่าน้อยไปคือ เสาทั้งสามจะมีประโยชน์ก็ต่อเมื่อมัน link กัน Trace เดี่ยว ๆ บอกคุณว่ามี request หนึ่งช้า Log เดี่ยว ๆ บอกคุณว่ามี error หนึ่งเกิดขึ้น แต่การ join — “ขอ logs และ DB span จาก request เดียวกันกับ slow trace ตัวนี้” — นี่แหละที่เปลี่ยน data sources สามตัวให้กลายเป็นพลังพิเศษในการ debug

กุญแจในการ join คือ trace ID ทุกบรรทัด log, ทุก metric exemplar, ทุก span จะถือ trace ID 128-bit เดียวกันสำหรับ request นั้น ๆ ถ้า Instrumentation ของคุณทำ ID นี้หล่นที่ใดก็ตาม — ที่ queue boundary, ข้าม async job, ใน structured logger — คุณก็ทำพังสิ่งเดียวที่ทำให้ observability เป็น observability

ทำให้ trace ID ไหลต่อเนื่องไว้ ที่เหลือคือรายละเอียด

OTel Architecture: SDK, Collector, Exporters

Data path ของ OTel มีสาม layer SDK อยู่ใน application ของคุณและสร้าง signals Collector เป็น sidecar หรือ daemon ที่ optional (แต่แนะนำอย่างแรง) ทำหน้าที่รับ batch และ forward Backend คือที่ที่คุณเก็บและ query data จริง ๆ

flowchart LR
    A[Your Service<br/>OTel SDK] -->|OTLP/gRPC| B[OTel Collector]
    C[Another Service<br/>OTel SDK] -->|OTLP/gRPC| B
    D[Kafka Consumer<br/>OTel SDK] -->|OTLP/gRPC| B
    B --> E[Tempo<br/>Traces]
    B --> F[Prometheus<br/>Metrics]
    B --> G[Loki<br/>Logs]
    B --> H[Honeycomb / Datadog<br/>SaaS]
    E & F & G --> I[Grafana]

ทำไมต้องมี Collector: มัน decouple app ของคุณออกจาก Backend จะเปลี่ยน vendor เพิ่มกฎ sampling ลบ PII หรือ fan out ไปสอง Backend ระหว่าง migration — ทำได้ใน Collector config โดยไม่ต้อง redeploy service เลย ถ้า ship โดยไม่มี Collector การเปลี่ยนแปลงทุกอย่างเหล่านี้จะกลายเป็น application release

Wire format ระหว่าง SDK กับ Collector คือ OTLP — OpenTelemetry Protocol โดยทั่วไปวิ่งบน gRPC เป็นชิ้นเดียวของ stack ที่ทั้ง ecosystem เห็นพ้องต้องกัน

Auto-Instrumentation เทียบกับ Manual Spans

OTel มี auto-instrumentation packages ที่ monkey-patch library ยอดนิยม — Express, Fastify, HTTP, pg, Redis, ioredis, Kafka, AWS SDK และอื่น ๆ Install, require, จบ: คุณได้ HTTP server spans, outbound HTTP client spans, DB query spans ที่ correlate กันหมด โดยแทบไม่ต้องเขียน code

Auto-instrumentation คือ default ที่ถูกต้อง มันให้คุณ 80% ของคุณค่าด้วยความพยายาม 5% และจะอัปเดตตาม ecosystem ที่เปลี่ยนไป

Manual spans มีไว้สำหรับ 20% ที่ auto-instrumentation มองไม่เห็น:

Business operations — “process checkout”, “reconcile invoice”, “generate report” — เหล่านี้คือหน่วยเชิง logic ไม่ใช่ library calls
งาน in-process ที่ไม่ trivial — loop ที่กิน resource, CPU-bound transforms, cache warmups
Internal subsystems ที่ไม่มี library — job queue ของคุณเอง, RPC wrapper ของคุณเอง

Pattern ที่ผิดคือการ wrap ทุก function ด้วย span Span มี cost ต่อ span ทั้งบน wire และบน Backend ให้ Instrument สิ่งที่มนุษย์จะสนใจบน trace waterfall ไม่ใช่ทุก function ใน call stack

Step-by-Step: การ Instrument Node.js Service

ลอง Instrument service จริงกัน ตัวอย่างใช้ Express แต่ setup เหมือนกันสำหรับ Fastify, Koa, NestJS หรืออะไรก็ตาม — OTel hook ที่ HTTP module ข้างใต้

Install

npm install \
  @opentelemetry/api \
  @opentelemetry/sdk-node \
  @opentelemetry/auto-instrumentations-node \
  @opentelemetry/exporter-trace-otlp-grpc \
  @opentelemetry/exporter-metrics-otlp-grpc \
  @opentelemetry/exporter-logs-otlp-grpc \
  @opentelemetry/resources \
  @opentelemetry/semantic-conventions

Bootstrap file

สร้าง src/telemetry.ts มันต้องถูก import ก่อน module อื่นทุกตัว — auto-instrumentation ทำงานโดยการ patch require/import ดังนั้นอะไรก็ตามที่ load ก่อน OTel จะมองไม่เห็น

// src/telemetry.ts
import { NodeSDK } from "@opentelemetry/sdk-node";
import { getNodeAutoInstrumentations } from "@opentelemetry/auto-instrumentations-node";
import { OTLPTraceExporter } from "@opentelemetry/exporter-trace-otlp-grpc";
import { OTLPMetricExporter } from "@opentelemetry/exporter-metrics-otlp-grpc";
import { OTLPLogExporter } from "@opentelemetry/exporter-logs-otlp-grpc";
import { PeriodicExportingMetricReader } from "@opentelemetry/sdk-metrics";
import { BatchLogRecordProcessor } from "@opentelemetry/sdk-logs";
import { resourceFromAttributes } from "@opentelemetry/resources";
import {
  ATTR_SERVICE_NAME,
  ATTR_SERVICE_VERSION,
  ATTR_DEPLOYMENT_ENVIRONMENT_NAME,
} from "@opentelemetry/semantic-conventions/incubating";

const endpoint = process.env.OTEL_EXPORTER_OTLP_ENDPOINT ?? "http://localhost:4317";

const sdk = new NodeSDK({
  resource: resourceFromAttributes({
    [ATTR_SERVICE_NAME]: process.env.OTEL_SERVICE_NAME ?? "checkout-api",
    [ATTR_SERVICE_VERSION]: process.env.APP_VERSION ?? "0.0.0",
    [ATTR_DEPLOYMENT_ENVIRONMENT_NAME]: process.env.NODE_ENV ?? "development",
    "service.instance.id": process.env.HOSTNAME ?? "local",
  }),
  traceExporter: new OTLPTraceExporter({ url: endpoint }),
  metricReader: new PeriodicExportingMetricReader({
    exporter: new OTLPMetricExporter({ url: endpoint }),
    exportIntervalMillis: 15_000,
  }),
  logRecordProcessors: [
    new BatchLogRecordProcessor(new OTLPLogExporter({ url: endpoint })),
  ],
  instrumentations: [getNodeAutoInstrumentations({
    // HTTP instrumentation is noisy by default — filter health checks.
    "@opentelemetry/instrumentation-http": {
      ignoreIncomingRequestHook: (req) =>
        req.url === "/health" || req.url === "/metrics",
    },
    // fs is almost always noise.
    "@opentelemetry/instrumentation-fs": { enabled: false },
  })],
});

sdk.start();

process.on("SIGTERM", () => {
  sdk.shutdown().finally(() => process.exit(0));
});

# telemetry.py — FastAPI equivalent
from opentelemetry import trace, metrics
from opentelemetry.sdk.resources import Resource
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.exporter.otlp.proto.grpc.metric_exporter import OTLPMetricExporter
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from opentelemetry.instrumentation.requests import RequestsInstrumentor
from opentelemetry.instrumentation.psycopg2 import Psycopg2Instrumentor
import os

def init_telemetry(app):
    endpoint = os.environ.get("OTEL_EXPORTER_OTLP_ENDPOINT", "http://localhost:4317")
    resource = Resource.create({
        "service.name": os.environ.get("OTEL_SERVICE_NAME", "checkout-api"),
        "service.version": os.environ.get("APP_VERSION", "0.0.0"),
        "deployment.environment.name": os.environ.get("ENV", "development"),
    })

    provider = TracerProvider(resource=resource)
    provider.add_span_processor(BatchSpanProcessor(OTLPSpanExporter(endpoint=endpoint)))
    trace.set_tracer_provider(provider)

    reader = PeriodicExportingMetricReader(OTLPMetricExporter(endpoint=endpoint), export_interval_millis=15_000)
    metrics.set_meter_provider(MeterProvider(resource=resource, metric_readers=[reader]))

    FastAPIInstrumentor.instrument_app(app)
    RequestsInstrumentor().instrument()
    Psycopg2Instrumentor().instrument()

โมเดลเดียวกันทั้งสองภาษา: ตั้ง resource เพื่อระบุ service, configure exporters, เปิด auto-instrumentations ที่คุณใช้จริง

Wire เข้ากับ entry point

Entry point ต้อง import telemetry ก่อน อย่างอื่นทำ patch พัง:

// src/index.ts
import "./telemetry";        // MUST be first import
import express from "express";
import { checkoutRouter } from "./routes/checkout";

const app = express();
app.use(express.json());
app.use("/api/checkout", checkoutRouter);
app.get("/health", (_req, res) => res.json({ ok: true }));

app.listen(3000, () => console.log("listening on 3000"));

ถ้าคุณใช้ bundler หรือ transpiler ให้เช็ค code ที่ออกมา: bundler บางตัวเรียง imports ตามลำดับตัวอักษร ซึ่งจะปิด auto-instrumentation อย่างเงียบ ๆ วิธีแก้ที่ปลอดภัยที่สุดคือ node --require ./dist/telemetry.js dist/index.js — load ผ่าน --require รับประกันลำดับไม่ว่าจะใช้ module system ไหน

Traceparent propagation (มันทำงานอยู่แล้ว)

OTel ตั้งค่า W3C traceparent header propagator เป็น default ทุก HTTP call ที่คุณเรียกผ่าน fetch, axios หรือ http จะแบก header แบบนี้อัตโนมัติ:

traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01

Auto-instrumentation ของ downstream service จะอ่าน header นั้นและต่อ trace เดิม End-to-end correlation ข้าม services เกิดขึ้นโดยไม่ต้องเขียนอะไรเพิ่ม — ถ้าทุก service ใน chain รัน OTel และ คุณไม่ strip header ที่ gateway เช็ค config ของ reverse proxy ของคุณ

Custom Spans และ Attributes

Auto-instrumentation ให้ HTTP span และ DB query span สิ่งที่มันให้ไม่ได้คือ code ของคุณกำลังพยายามทำอะไร นั่นคือที่ที่ manual spans คุ้มค่า

import { trace, SpanStatusCode } from "@opentelemetry/api";

const tracer = trace.getTracer("checkout");

export async function processCheckout(order: Order) {
  return tracer.startActiveSpan("checkout.process", async (span) => {
    span.setAttributes({
      "checkout.order_id": order.id,
      "checkout.customer_id": order.customerId,
      "checkout.item_count": order.items.length,
      "checkout.amount_cents": order.totalCents,
      "checkout.currency": order.currency,
    });

    try {
      const inventory = await reserveInventory(order);
      const payment = await chargeCustomer(order);
      await persistOrder(order, payment.id);
      span.setAttribute("checkout.payment_id", payment.id);
      return { ok: true, paymentId: payment.id };
    } catch (err) {
      span.recordException(err as Error);
      span.setStatus({
        code: SpanStatusCode.ERROR,
        message: (err as Error).message,
      });
      throw err;
    } finally {
      span.end();
    }
  });
}

สิ่งสามอย่างที่ code นี้ทำและสำคัญ:

Business context เป็น attributes. order_id และ customer_id ให้คุณดึงทุก span สำหรับ user หรือ order ที่กำหนดได้ นั่นคือ query แบบ high-cardinality ที่ observability มีไว้เพื่อสิ่งนี้
บันทึก exception. span.recordException แนบ stack เป็น span event; setStatus(ERROR) ทำให้ span เป็นสีแดงใน trace viewer อย่า throw exception ออกจาก function ที่ Instrumented โดยไม่ทำทั้งสองอย่างนี้
end() ใน finally. Span ที่ไม่เคยถูก end จะ leak และ batches จะกองอยู่ใน memory จน process ตาย ปิดใน finally เสมอ

การเสริม DB query

Auto-instrumentation สำหรับ pg ให้ span ต่อ query พร้อม SQL text ใน Production คุณมักอยากได้มากกว่านั้น — operation เชิง logic, cache hit/miss, จำนวน row

async function findActiveOrdersForCustomer(customerId: string) {
  return tracer.startActiveSpan("db.orders.find_active", async (span) => {
    span.setAttribute("db.operation.name", "find_active");
    span.setAttribute("customer.id", customerId);

    const cached = await redis.get(`orders:${customerId}`);
    if (cached) {
      span.setAttribute("cache.hit", true);
      span.end();
      return JSON.parse(cached);
    }
    span.setAttribute("cache.hit", false);

    const rows = await pg.query(
      "SELECT * FROM orders WHERE customer_id = $1 AND status = 'active'",
      [customerId],
    );
    span.setAttribute("db.rows_returned", rows.rowCount ?? 0);
    await redis.setex(`orders:${customerId}`, 60, JSON.stringify(rows.rows));
    span.end();
    return rows.rows;
  });
}

ตอนนี้ trace waterfall จะแสดงไม่ใช่แค่ “query นี้ใช้เวลา 42ms” แต่เป็น “cache miss, 142 rows, customer abc-123” คุณ slice อัตรา error ตาม cache.hit=false หรือหา query ทุกตัวที่คืน 0 row ได้

การตั้งชื่อ attribute — ตามหลัก semconv

OTel มี semantic conventions (semconv): ชื่อ attribute มาตรฐาน เช่น http.request.method, db.system.name, messaging.kafka.topic ใช้มันซะ ถ้าคุณคิดเอง httpMethod ทีมของคุณจะเป็นทีมเดียวที่ dashboard ใช้กับ Grafana panel ของ upstream ไม่ได้ และ analyses ที่ติดมากับ Backend ของคุณก็จะไม่ทำงาน

สำหรับ business attributes ของคุณเอง ให้ namespace ตาม domain: checkout.*, billing.*, user.* เก็บ schema ไว้ใน document ที่ทีมของคุณหาเจอ

Metrics: Counters, Histograms, Cardinality

Metrics มีไว้สำหรับคำถามที่ต้อง aggregate ได้ราคาถูกข้าม billions ของ events: “p99 latency ของ /checkout เท่าไหร่?” คุณไม่อยาก scan ทุก trace สำหรับคำถามนี้ — คุณอยากได้ histogram ที่ pre-aggregate แล้ว

import { metrics } from "@opentelemetry/api";

const meter = metrics.getMeter("checkout");

const checkoutCounter = meter.createCounter("checkout.requests", {
  description: "Total checkout requests",
  unit: "1",
});

const checkoutDuration = meter.createHistogram("checkout.duration", {
  description: "Checkout processing duration",
  unit: "ms",
});

const activeCarts = meter.createUpDownCounter("checkout.active_carts", {
  description: "Currently active cart sessions",
  unit: "1",
});

export async function instrumentedCheckout(order: Order) {
  const start = performance.now();
  checkoutCounter.add(1, { currency: order.currency });
  activeCarts.add(1);
  try {
    const result = await processCheckout(order);
    checkoutDuration.record(performance.now() - start, {
      currency: order.currency,
      outcome: "success",
    });
    return result;
  } catch (err) {
    checkoutDuration.record(performance.now() - start, {
      currency: order.currency,
      outcome: "error",
    });
    throw err;
  } finally {
    activeCarts.add(-1);
  }
}

กับดัก cardinality

ทุกชุดค่า attribute ที่ unique บน metric หนึ่ง คือ time series หนึ่งชุด currency มีค่าราว 10 ค่า, outcome มี 2 — รวม 20 time series โอเค

การเพิ่ม customer_id ลงใน histogram คือวิธีทำลาย metrics backend ของคุณ Customer สิบล้านคน คูณ 20 combination อื่น คือ 200 ล้าน series Prometheus ตาย; SaaS vendors เก็บคุณวันละหลักพันดอลลาร์

กฎ: ถ้า attribute มีค่าได้มากกว่าไม่กี่ร้อย มันไม่ควรอยู่บน metric Dimension ที่ cardinality สูงควรอยู่บน spans ที่ query ได้โดยไม่ต้อง pre-aggregate Customer ID, order ID, request ID — เหล่านี้คือ trace attributes ไม่ใช่ metric labels

Metric labels ที่ดี: route, method, status_code, outcome, region, tenant_tier Metric labels ที่ไม่ดี: user_id, order_id, session_id, trace_id, request_path (ถ้ามี ID อยู่ในนั้น)

Structured Logs พร้อม Trace Correlation

บรรทัด log ที่ไม่แบก trace ID คือบรรทัด log ที่คุณ join กับอะไรไม่ได้ เป้าหมายคือ ship structured JSON logs ที่มี field trace_id และ span_id ตรงกับสิ่งที่อยู่ใน traces backend

ใช้ pino:

import pino from "pino";
import { trace, context } from "@opentelemetry/api";

export const logger = pino({
  level: process.env.LOG_LEVEL ?? "info",
  formatters: {
    log(obj) {
      const span = trace.getSpan(context.active());
      if (span) {
        const ctx = span.spanContext();
        return { ...obj, trace_id: ctx.traceId, span_id: ctx.spanId };
      }
      return obj;
    },
  },
});

ตอนนี้ log call ทุกตัวที่อยู่ภายใน active span จะแบก trace ID อัตโนมัติ ใน Grafana หรือ log aggregator ของคุณ ปุ่ม “ดู logs สำหรับ trace นี้” ก็จะใช้งานได้ทันที

ถ้าคุณอยาก ship logs ผ่าน OTLP (OTel logs API) แทน stdout คุณ wire pino กับ OTel logs exporter ผ่าน package @opentelemetry/instrumentation-pino ก็ได้ — แต่ถ้า infrastructure ของคุณไม่ได้บังคับ stdout + log shipper (Fluent Bit, Vector, Promtail) นั้นง่ายกว่าและ debug ง่ายกว่า

อะไรควรเป็น log อะไรควรเป็น span

หลักรวบรัด:

Span attributes — ข้อมูล structured เกี่ยวกับ operation (order_id, จำนวน row, cache hit)
Span events — เครื่องหมายเฉพาะจุดเวลาภายใน span (“retry attempted”, “rate limit hit”)
Logs — narrative ที่คุณอยากอ่านเป็น text, warnings, errors, operational events (“leader election complete”, “connection pool exhausted”)

เมื่อคุณกำลังถกกันว่า “อันนี้ควรเป็น log หรือ span attribute?” — ถ้ามันอธิบาย operation มันคือ attribute ถ้ามันอธิบายสิ่งที่ เกิดขึ้น ระหว่าง operation มันคือ span event ถ้ามันอธิบายอะไรเกี่ยวกับ process โดยรวม มันคือ log

Context Propagation: HTTP, Kafka, Async Jobs

HTTP propagation เป็นอัตโนมัติ อย่างอื่นต้องลงแรง

ข้าม HTTP — เสร็จเรียบร้อยแล้ว

ทุก outbound fetch/axios/http call ภายใน active span ได้ traceparent inject ให้ ทุก inbound request ได้ extract Zero code

ข้าม Kafka

Auto-instrumentation package สำหรับ kafkajs inject traceparent ลงใน message headers ตอน produce และ extract ตอน consume ติดตั้งแล้วใช้งานได้เลย Pattern สำหรับ messaging layer ที่กำหนดเองคือ:

import { propagation, context, trace } from "@opentelemetry/api";

// Producer
const tracer = trace.getTracer("orders");
await tracer.startActiveSpan("kafka.send orders", async (span) => {
  const carrier: Record<string, string> = {};
  propagation.inject(context.active(), carrier);
  await producer.send({
    topic: "orders",
    messages: [{
      value: JSON.stringify(order),
      headers: carrier, // includes traceparent
    }],
  });
  span.end();
});

// Consumer
const parentCtx = propagation.extract(context.active(), message.headers ?? {});
await context.with(parentCtx, async () => {
  await tracer.startActiveSpan("kafka.process orders", async (span) => {
    await handleOrder(JSON.parse(message.value!.toString()));
    span.end();
  });
});

propagation.inject serialize trace context ปัจจุบันลงใน carrier object ใด ๆ propagation.extract ทำตรงกันข้าม ใช้ pattern นี้สำหรับ transport ใดก็ตามที่ไม่มี library ครอบคลุม

ข้าม async jobs และ cron

Job ที่ kick off จาก cron scheduler เริ่ม trace ใหม่เป็น default — มันไม่มี inbound HTTP context ปกติแล้วก็ถูก: nightly reconciliation ไม่ใช่ continuation ของ user request แต่ถ้า job ถูก kick off จาก user action (เช่น “generate report ของฉัน ส่ง email มาเมื่อเสร็จ”) คุณอยากให้ trace ครอบทั้ง original request และ job ที่จะตามมา

มีสองวิธี:

Links. เก็บ trace ID ต้นทางไว้กับ job เมื่อ job รัน มันสร้าง root span ใหม่แต่เพิ่ม Link ชี้ไปที่ span ของ original request Backend จะ render link เป็น “ดู trace ที่เกี่ยวข้อง”
Carried context. Serialize traceparent ลงใน job payload, extract ที่ตอนเริ่ม job และต่อ trace เดิม ใช้ได้ดีที่สุดเมื่อ job รันเร็วหลัง enqueue — เกินไม่กี่นาที Backend ส่วนใหญ่จะถือว่า trace ค้างเก่า

เลือกตามว่า delay ยาวแค่ไหน และคุณอยากได้ trace เดียวหรือสอง trace ที่ link กัน

Exporters: OTLP ไป Collector และการเลือก Backend

SDK ต้องรู้แค่ endpoint เดียว: Collector ที่เหลือคือ Collector configuration ซึ่งเป็นที่ที่คุณเลือก vendor

sequenceDiagram
    participant App as Service (SDK)
    participant Col as OTel Collector
    participant Tempo as Tempo
    participant HC as Honeycomb
    participant DD as Datadog

    App->>Col: OTLP/gRPC (traces, metrics, logs)
    Col->>Col: batch, sample, scrub PII
    par fanout
        Col->>Tempo: OTLP
    and
        Col->>HC: OTLP
    and
        Col->>DD: Datadog exporter
    end

Shortlist ของ Backend พร้อม tradeoff:

Tempo + Grafana + Loki + Prometheus. Self-hosted, open source, รวมใน UI เดียว ราคาถูกในทุก scale เมื่อคุณยอมจ่ายค่า ops แล้ว เหมาะที่สุดสำหรับทีมที่รัน Grafana อยู่แล้วและมี platform engineer สักคนสองคน
Jaeger. Traces ล้วน ๆ ผ่านสนามรบมาแล้ว เรียบง่าย จับคู่กับ Prometheus และ Loki ถ้าอยากได้เสาอื่น เป็นตัวเลือกที่ดีสำหรับทีมเล็กที่อยาก operate น้อย ๆ
Honeycomb. Product model ที่ชัดเจนที่สุดสำหรับการวิเคราะห์ trace ที่ cardinality สูง BubbleUp และ query engine แตกต่างจริง คุณจ่ายค่ามัน
Datadog. One-stop-shop สำหรับทีมที่อยากได้ traces, metrics, logs, RUM และ synthetic checks ในบิลเดียว ราคาขยับเร็วตาม cardinality — ระวัง custom metrics
Grafana Cloud. Tempo/Loki/Mimir แบบ managed service จุดกลางระหว่าง self-hosted กับ Honeycomb/Datadog

เพราะคุณ Instrument ด้วย OTel การเปลี่ยนใจคือการแก้ Collector config อย่าให้การเลือก Backend มาขวางการเริ่ม — เลือกสักตัว ship เปลี่ยนถ้ามันไม่เวิร์ก

อะไรพังใน Production

5 failure modes ที่ผมเคยเจอ เรียงตามความถี่ที่กัด:

1. Sampling รุนแรงเกินไป (หรือไม่พอ). ที่ 1 req/sec คุณเก็บทุกอย่าง ที่ 10,000 req/sec คุณทำไม่ได้ — บิล Backend จะพอง Pattern ที่ถูกคือ tail-based sampling ใน Collector: เก็บ errors และ slow requests ไว้ 100% sample ที่เหลือที่ 1–5% Head-based (ฝั่ง SDK) sampling ถูกกว่า แต่ทำให้ “อันนี้คือ error” เป็น decision เก็บไม่ได้ เพราะ error ยังไม่เกิดตอนตัดสินใจ

2. Span เสียงดังกลบสัญญาณ. Health checks, metrics scrapes, fs reads และ internal gossip traffic สร้าง span มากกว่า user traffic หลายเท่า กรองที่ instrumentation config (ignoreIncomingRequestHook ข้างบน) หรือใน Collector ทีมหนึ่งที่ผมเคยร่วมงาน 94% ของ span volume มาจาก /health — แก้ตรงนี้ทำให้บิลลดเกือบเป็นสัดส่วน

3. PII รั่วลงใน span. Query parameters, request bodies, headers — auto-instrumentation ระมัดระวังแต่ไม่ได้ paranoid Configure redaction processor ใน Collector เพื่อลบ Authorization, Cookie, email, phone และอะไรก็ตามที่ทีม compliance ของคุณสนใจ ทำที่ Collector ไม่ใช่ที่ SDK เพื่อให้การเปลี่ยน policy ไม่ต้อง redeploy app

4. Collector คือ single point of failure. ถ้ามันล่ม signals หล่น รันเป็น daemonset หรือ sidecar ไม่ใช่ replica เดียว ใช้ processor batch และ memory_limiter Configure sending_queue พร้อม persistent storage ถ้าคุณยอมเสีย span ตอน restart ไม่ได้

5. Cardinality explosion ใน metrics. มีคนเพิ่ม user_id เป็น label แล้ว Prometheus remote-write เริ่ม OOM ตั้ง alert บน series count ต่อ metric ของ metrics backend และ review metric labels ใหม่ใน code review Cardinality เป็นปัญหาเชิงวัฒนธรรม ไม่ใช่ปัญหาเชิงเทคนิค

Local Stack แบบมินิมอล

สำหรับ local development หรือ demo นี่คือ docker-compose.yml ที่ใช้ได้พร้อม Collector, Tempo, Prometheus, Loki และ Grafana:

services:
  otel-collector:
    image: otel/opentelemetry-collector-contrib:latest
    command: ["--config=/etc/otelcol/config.yaml"]
    volumes:
      - ./otel-collector.yaml:/etc/otelcol/config.yaml
    ports:
      - "4317:4317"  # OTLP gRPC
      - "4318:4318"  # OTLP HTTP

  tempo:
    image: grafana/tempo:latest
    command: ["-config.file=/etc/tempo.yaml"]
    volumes:
      - ./tempo.yaml:/etc/tempo.yaml

  prometheus:
    image: prom/prometheus:latest
    command:
      - "--config.file=/etc/prometheus/prometheus.yml"
      - "--web.enable-remote-write-receiver"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml

  loki:
    image: grafana/loki:latest
    command: ["-config.file=/etc/loki/local-config.yaml"]

  grafana:
    image: grafana/grafana:latest
    environment:
      GF_AUTH_ANONYMOUS_ENABLED: "true"
      GF_AUTH_ANONYMOUS_ORG_ROLE: Admin
    ports:
      - "3001:3000"
    depends_on: [tempo, prometheus, loki]

พร้อม otel-collector.yaml:

receivers:
  otlp:
    protocols:
      grpc: { endpoint: 0.0.0.0:4317 }
      http: { endpoint: 0.0.0.0:4318 }

processors:
  batch: {}
  memory_limiter:
    check_interval: 1s
    limit_mib: 512

exporters:
  otlp/tempo:
    endpoint: tempo:4317
    tls: { insecure: true }
  prometheusremotewrite:
    endpoint: http://prometheus:9090/api/v1/write
  otlphttp/loki:
    endpoint: http://loki:3100/otlp

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [otlp/tempo]
    metrics:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [prometheusremotewrite]
    logs:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [otlphttp/loki]

ชี้ OTEL_EXPORTER_OTLP_ENDPOINT ของ service ไปที่ http://localhost:4317 เปิด Grafana ที่ localhost:3001 เพิ่ม Tempo/Prometheus/Loki เป็น data source แล้วคุณจะได้ stack สามเสาที่ correlate กันเต็มรูปแบบบน laptop ของคุณ

สรุปเช็คลิสต์

ก่อนเรียก service ว่า “observable”:

Observability ไม่ใช่ project ครั้งเดียว แต่เป็น practice Instrumentation ที่คุณ ship วันนี้จะตอบ outage ของวันพรุ่งนี้ — แต่ก็ต่อเมื่อ trace ID ผ่านไปได้ตลอดสาย ทำให้มันไหล แล้วที่เหลือคือรายละเอียด

อ่านเพิ่มเติม

Observability Engineering — Majors, Fong-Jones, Miranda (2022) หนังสือ canonical เรื่อง observability แบบ event-based ที่ cardinality สูง
Distributed Tracing in Practice — Parker, Spoonhower, Mace, Sigelman (2020) ทฤษฎีเบื้องหลัง protocol
ตัว OpenTelemetry specification เอง — แน่น แต่เป็น source of truth สำหรับ semantic conventions และพฤติกรรมของ SDK

Instrument ครั้งเดียว query ที่ไหนก็ได้ นั่นคือข้อตกลงที่ OTel เสนอ — และในโลกที่ทุกทีมต้องเปลี่ยน Backend ในที่สุด นี่คือข้อตกลงที่ดีที่สุดในตลาด observability

Observability ด้วย OpenTelemetry: คู่มือปฏิบัติแบบ End-to-End