Monitoring vs Observability

คำศัพท์เหล่านี้ถูกใช้แทนกัน แต่ความแตกต่างสำคัญ Monitoring คือการดู metrics ที่กำหนดไว้ล่วงหน้า และ alert เมื่อพวกมันข้าม thresholds - CPU usage above 90% error rate above 1% response time above 500ms มันตอบ “มีบางอย่างผิดหรือไม่?”

Observability คือความสามารถในการเข้าใจ internal state ของระบบของคุณโดยการตรวจสอบ outputs มันตอบ “เพราะเหตุใด request เฉพาะนี้จึงช้า” และ “สิ่งใดเปลี่ยนแปลงระหว่างเมื่อวานและวันนี้” คุณต้องการทั้งสอง แต่ observability คือสิ่งที่ให้คุณสามารถ debug novel problems ที่ monitoring dashboards ของคุณไม่ได้ออกแบบให้จับ

Three Pillars

Metrics

Metrics คือการวัดตัวเลขเมื่อเวลาผ่านไป: request count error rate latency percentiles queue depth CPU usage พวกมันราคาถูกเพื่อ store เร็วเพื่อ query และ essential สำหรับ alerting และ dashboards

ฉันใช้ RED method สำหรับ service metrics: Rate (requests per second) Errors (error rate) และ Duration (latency distribution) สำหรับ infrastructure ฉันใช้ USE method: Utilization Saturation และ Errors ระหว่าง 2 frameworks เหล่านี้ คุณครอบคลุม critical health indicators สำหรับ system ใด ๆ

Instrument application ของคุณด้วย 4 metric types: counters (always increasing - total requests total errors) gauges (point-in-time values - active connections queue size) histograms (distribution of values - request latency payload size) และ summaries (pre-calculated percentiles)

Logs

Logs คือ narrative ของระบบของคุณ - timestamped events อธิบายสิ่งที่เกิดขึ้น ส่วนสำคัญของ useful logs ที่ scale คือ structure: JSON-formatted logs พร้อม consistent fields (timestamp level service trace_id message และ contextual metadata)

หลีกเลี่ยง string concatenation ใน log messages แทนที่จะ “User 123 placed order 456 for $78.90” emit structured log พร้อม fields: user_id=123, order_id=456, amount=78.90, event=order_placed นี่ให้คุณสามารถ query logs โดย any field โดยไม่ต้อง parse text

Log levels สำคัญที่ scale DEBUG เป็นสำหรับ development เท่านั้น - อย่า enable ใน production เว้นแต่ active investigating issue INFO captures business events ปกติ WARN captures recoverable anomalies ERROR captures failures ที่ต้อง attention ถูกวินัยเกี่ยวกับ levels; system ที่ logs ทุกอย่าง INFO คือ useless เช่นเดียวกับสิ่งที่ logs ไม่มี

Traces

Distributed traces ตามคำขอของคุณผ่าน entire system - จาก API gateway ผ่าน microservices message queues และ databases แต่ละบริการเพิ่ม span พร้อม timing metadata และ status ผลลัพธ์คือ complete timeline แสดง exactly ที่ใช้เวลาไป

Traces เป็น essential สำหรับ debugging latency ใน distributed systems เมื่อคำขอใช้เวลา 3 วินาที trace แสดงคุณว่า 50ms อยู่ใน service A 100ms อยู่ใน service B 2800ms รอ database query ใน service C และ 50ms คือ serialization overhead โดยไม่มี traces คุณ guessing

ใช้ trace propagation โดยการ pass trace ID ผ่าน all service calls ผ่าน headers OpenTelemetry provides standardized libraries สำหรับสิ่งนี้ across languages

Stack Recommendations

Metrics: Prometheus + Grafana

Prometheus คือ industry standard สำหรับ metrics collection มัน scrapes endpoints store time-series data และ provide powerful query language (PromQL) Grafana visualizes metrics เหล่านั้น ใน dashboards

สำหรับ Kubernetes environments kube-prometheus-stack ให้คุณ complete monitoring setup: Prometheus Grafana AlertManager และ pre-built dashboards สำหรับ cluster health node metrics และ common workloads

Logs: Loki or Elasticsearch

Grafana Loki คือ current recommendation ของฉันสำหรับ most teams มันถูกกว่า Elasticsearch integrates natively กับ Grafana และ uses label-based indexing ที่ efficient สำหรับ query patterns ที่ teams ใช้จริง ๆ

Elasticsearch (ผ่าน ELK stack) relevant สำหรับ teams ที่ต้อง full-text search ตามหา logs หรือ have complex aggregation requirements แต่สำหรับ most use cases Loki’s simpler operational model ชนะ

Traces: Jaeger or Tempo

Grafana Tempo integrates กับ Grafana stack และ supports OpenTelemetry protocol natively Jaeger คือ mature alternative พร้อม strong UI สำหรับ trace exploration

ทำการเลือกขึ้นอยู่กับ existing stack ของคุณ: ถ้าคุณ already running Grafana และ Loki Tempo คือ natural fit ถ้าคุณ on different stack Jaeger’s standalone deployment ตรงไปตรงมา

Alerting ที่ Effective

Alert Fatigue เป็นจริง

วิธีที่เร็วที่สุดในการทำให้ on-call miserable คือ alert ทุกอย่าง ถ้าทีมของคุณได้รับ 50 alerts ต่อวัน พวกเขาจะ ignore ทั้งหมด - รวมถึง critical ones ฉันเห็นทีมที่ “mute all alerts” เป็น unofficial policy เพราะ signal-to-noise ratio ไม่ดี

Alert Design Principles

Alert on symptoms ไม่ใช่ causes Alert on “error rate above 1% for 5 minutes” ไม่ใช่ “CPU above 80%” High CPU อาจพอใจทั้งหมดเมื่อ batch job; high error rate เป็นปัญหาเสมอ

ทุก alert ต้องการ 3 สิ่ง: clear description ของสิ่งที่ผิด runbook link อธิบาย how ตรวจสอบและ remediate และ appropriate severity (page สำหรับ customer-facing issues ticket สำหรับ everything else)

ใช้ alert windows และ thresholds ที่ป้องกัน flapping brief spike ไป 2% errors สำหรับ 30 วินาที อาจเป็น normal transient - alert เสียม attention ใช้ multi-window burn rate alerting: alert เมื่อ short-term burn rate (last 5 minutes) AND long-term burn rate (last hour) exceed thresholds

Severity Levels

ฉันใช้ 4 severity levels: P1 (pages on-call engineer ทันที - service ลง หรือ data loss occur) P2 (pages ระหว่าง business hours - significant degradation ส่งผลกระทบ users) P3 (สร้าง ticket - non-urgent issue ที่ต้อง attention ภายใน few days) P4 (logged และ reviewed weekly - optimization opportunities หรือ cosmetic issues)

P1 alerts ควร fire less than once per week on average ถ้าคุณ getting P1 alerts daily either system genuinely unreliable หรือ alert thresholds ผิด

SLOs และ Error Budgets

Service Level Objectives (SLOs) formalize reliability targets ของคุณ แทนที่จะ vague goals เช่น “service ควร fast” คุณ define specific, measurable objectives: “99.9% of requests complete within 500ms over 30-day window”

Error budgets ทำให้ SLOs actionable ถ้า SLO ของคุณคือ 99.9% availability คุณ have budget ของ 0.1% downtime ต่อ month (about 43 minutes) เมื่อ budget healthy คุณ ship features aggressively เมื่อมัน burning fast คุณ prioritize reliability work

Framework นี้ turn reliability จากปัญหาตัวรองเสีย ไปเป็น data-driven decision: “We’ve used 60% of our error budget this month so let’s hold off on risky deployments until next month” เป็น concrete defensible decision

Dashboard Design

Four Dashboard Layers

Layer 1: Executive dashboard - one screen แสดง overall system health SLO status และ key business metrics Layer 2: Service dashboards - one per service แสดง RED metrics resource usage และ deployment markers Layer 3: Investigation dashboards - detailed views สำหรับ debugging specific issues (database performance cache hit rates queue depths) Layer 4: Infrastructure dashboards - node health disk usage network metrics

Engineers most ควร live on Layer 2 และ only drill into Layer 3 เมื่อ investigating issues ถ้าทีมของคุณใช้เวลา most บน infrastructure dashboards มีบางอย่างผิดกับ platform ของคุณ

Getting Started

ถ้าคุณ have no monitoring today เริ่มต้นกับ 3 สิ่งเหล่านี้: instrument application ของคุณด้วย RED metrics (request rate error rate duration) ตั้ง structured logging พร้อม correlation IDs และ create 1 alert สำหรับแต่ละ service’s error rate exceeding 1% for 5 minutes setup minimal นี้จับ majority of production issues และให้คุณ foundation เพื่อ build on

สร้าง Production Monitoring และ Observability Stack