Kubernetes可观测性体系构建:全面监控与故障排查指南
2026/5/24 23:06:20 网站建设 项目流程

Kubernetes可观测性体系构建:全面监控与故障排查指南

一、可观测性概述

可观测性(Observability)是指通过系统产生的数据来理解系统内部状态的能力。在Kubernetes中,可观测性体系包含三个核心维度:指标(Metrics)、日志(Logs)和追踪(Tracing)。

1.1 可观测性三要素

要素用途工具
指标(Metrics)实时监控、告警、趋势分析Prometheus、Grafana
日志(Logs)问题排查、审计、合规ELK Stack、Loki
追踪(Tracing)分布式链路追踪、性能分析Jaeger、Zipkin

1.2 可观测性架构

应用层 ↓ Metrics → Prometheus → Grafana ↓ Logs → Fluentd/Loki → Grafana ↓ Tracing → Jaeger → Grafana ↓ 告警 → Alertmanager → PagerDuty/Email

二、指标监控体系

2.1 Prometheus部署

apiVersion: v1 kind: ServiceAccount metadata: name: prometheus namespace: monitoring --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: name: prometheus rules: - apiGroups: [""] resources: ["nodes", "services", "endpoints", "pods"] verbs: ["get", "list", "watch"] - apiGroups: [""] resources: ["nodes/metrics"] verbs: ["get"] --- apiVersion: apps/v1 kind: Deployment metadata: name: prometheus namespace: monitoring spec: replicas: 1 selector: matchLabels: app: prometheus template: metadata: labels: app: prometheus spec: serviceAccountName: prometheus containers: - name: prometheus image: prom/prometheus:v2.45.0 ports: - containerPort: 9090 volumeMounts: - name: config mountPath: /etc/prometheus - name: data mountPath: /prometheus volumes: - name: config configMap: name: prometheus-config - name: data persistentVolumeClaim: claimName: prometheus-pvc

2.2 Prometheus配置

apiVersion: v1 kind: ConfigMap metadata: name: prometheus-config namespace: monitoring data: prometheus.yml: | global: scrape_interval: 15s evaluation_interval: 15s scrape_configs: - job_name: 'kubernetes-apiservers' kubernetes_sd_configs: - role: endpoints scheme: https tls_config: ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token relabel_configs: - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name] action: keep regex: default;kubernetes;https - job_name: 'kubernetes-nodes' kubernetes_sd_configs: - role: node scheme: https tls_config: ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token relabel_configs: - action: labelmap regex: __meta_kubernetes_node_label_(.+) - job_name: 'kubernetes-pods' kubernetes_sd_configs: - role: pod relabel_configs: - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape] action: keep regex: true - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path] action: replace target_label: __metrics_path__ regex: (.+)

2.3 自定义指标暴露

from flask import Flask from prometheus_client import Counter, Histogram, generate_latest app = Flask(__name__) REQUEST_COUNT = Counter('app_requests_total', 'Total requests') REQUEST_LATENCY = Histogram('app_request_duration_seconds', 'Request duration') @app.route('/') def hello(): REQUEST_COUNT.inc() with REQUEST_LATENCY.time(): return 'Hello World' @app.route('/metrics') def metrics(): return generate_latest(), 200, {'Content-Type': 'text/plain; version=0.0.4; charset=utf-8'} if __name__ == '__main__': app.run(host='0.0.0.0', port=8080)

三、日志管理体系

3.1 Loki部署

apiVersion: apps/v1 kind: StatefulSet metadata: name: loki namespace: monitoring spec: serviceName: loki replicas: 1 selector: matchLabels: app: loki template: metadata: labels: app: loki spec: containers: - name: loki image: grafana/loki:2.8.0 ports: - containerPort: 3100 volumeMounts: - name: data mountPath: /loki args: - -config.file=/etc/loki/config.yaml volumeClaimTemplates: - metadata: name: data spec: accessModes: ["ReadWriteOnce"] resources: requests: storage: 100Gi

3.2 Fluentd配置

apiVersion: apps/v1 kind: DaemonSet metadata: name: fluentd namespace: monitoring spec: selector: matchLabels: app: fluentd template: metadata: labels: app: fluentd spec: containers: - name: fluentd image: fluent/fluentd-kubernetes-daemonset:v1.15.3 env: - name: LOKI_URL value: http://loki:3100 volumeMounts: - name: varlog mountPath: /var/log - name: varlibdockercontainers mountPath: /var/lib/docker/containers readOnly: true volumes: - name: varlog hostPath: path: /var/log - name: varlibdockercontainers hostPath: path: /var/lib/docker/containers

3.3 日志查询示例

# 查询指定Pod的日志 kubectl logs <pod-name> # 查询指定命名空间的日志 kubectl logs -n <namespace> <pod-name> # 流式日志 kubectl logs -f <pod-name> # 使用Loki查询 {namespace="default", app="my-app"} |= "error" | tail 10

四、分布式追踪体系

4.1 Jaeger部署

apiVersion: jaegertracing.io/v1 kind: Jaeger metadata: name: jaeger namespace: monitoring spec: strategy: production storage: type: elasticsearch options: es: server-urls: http://elasticsearch:9200

4.2 追踪集成代码

from opentelemetry import trace from opentelemetry.sdk.trace import TracerProvider from opentelemetry.sdk.trace.export import BatchSpanProcessor from opentelemetry.exporter.jaeger.thrift import JaegerExporter trace.set_tracer_provider(TracerProvider()) tracer = trace.get_tracer(__name__) jaeger_exporter = JaegerExporter( collector_endpoint='http://jaeger-collector:14268/api/traces' ) trace.get_tracer_provider().add_span_processor( BatchSpanProcessor(jaeger_exporter) ) @tracer.start_as_current_span("my-operation") def my_function(): with tracer.start_as_current_span("inner-operation"): print("Inside inner operation")

4.3 追踪查询

# 查看Jaeger UI kubectl port-forward -n monitoring svc/jaeger-query 16686:16686 # 通过trace ID查询 curl http://localhost:16686/api/traces/<trace-id>

五、告警与通知体系

5.1 Alertmanager配置

apiVersion: v1 kind: ConfigMap metadata: name: alertmanager-config namespace: monitoring data: config.yml: | global: resolve_timeout: 5m route: group_by: ['alertname'] group_wait: 10s group_interval: 10s repeat_interval: 1h receiver: 'email' receivers: - name: 'email' email_configs: - to: 'admin@example.com' send_resolved: true inhibit_rules: - source_match: severity: 'critical' target_match: severity: 'warning' equal: ['alertname', 'dev', 'instance']

5.2 告警规则配置

apiVersion: monitoring.coreos.com/v1 kind: PrometheusRule metadata: name: example-rules namespace: monitoring spec: groups: - name: node.rules rules: - alert: HighCPUUsage expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[1m])) * 100) > 90 for: 5m labels: severity: critical annotations: summary: "High CPU usage on {{ $labels.instance }}" description: "CPU usage is above 90% for 5 minutes" - alert: HighMemoryUsage expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 85 for: 5m labels: severity: warning annotations: summary: "High memory usage on {{ $labels.instance }}" description: "Memory usage is above 85% for 5 minutes"

六、Grafana可视化

6.1 Grafana部署

apiVersion: apps/v1 kind: Deployment metadata: name: grafana namespace: monitoring spec: replicas: 1 selector: matchLabels: app: grafana template: metadata: labels: app: grafana spec: containers: - name: grafana image: grafana/grafana:10.1.0 ports: - containerPort: 3000 env: - name: GF_SECURITY_ADMIN_PASSWORD valueFrom: secretKeyRef: name: grafana-secret key: admin-password volumeMounts: - name: data mountPath: /var/lib/grafana volumes: - name: data persistentVolumeClaim: claimName: grafana-pvc

6.2 仪表盘配置

apiVersion: grafana.integreatly.org/v1beta1 kind: GrafanaDashboard metadata: name: kubernetes-dashboard namespace: monitoring spec: configMapRef: name: kubernetes-dashboard-config datasources: - inputName: "DS_PROMETHEUS" datasourceName: "Prometheus"

七、可观测性最佳实践

7.1 指标命名规范

# 指标命名格式 # <namespace>_<component>_<metric>_<unit> # 示例 api_http_requests_total # 总请求数 api_request_duration_seconds # 请求持续时间 database_connection_pool_size # 数据库连接池大小

7.2 日志结构化

{ "timestamp": "2024-01-15T10:30:00Z", "level": "INFO", "service": "order-service", "trace_id": "abc-123", "span_id": "def-456", "message": "Order created successfully", "data": { "order_id": "ORD-001", "customer_id": "CUS-123" } }

7.3 采样策略

apiVersion: v1 kind: ConfigMap metadata: name: tracing-config data: tracing.yaml: | sampling: rate: 0.1 # 10%采样率 max_samples_per_second: 100

八、可观测性监控

8.1 监控指标

# 查看Prometheus状态 kubectl get pods -n monitoring -l app=prometheus # 检查告警状态 kubectl get alerts -n monitoring # 查看Grafana状态 kubectl get pods -n monitoring -l app=grafana

8.2 健康检查

apiVersion: v1 kind: Pod metadata: name: observability-check spec: containers: - name: health-check image: busybox:1.28 command: - /bin/sh - -c - | curl -f http://prometheus:9090/-/ready || exit 1 curl -f http://loki:3100/ready || exit 1 curl -f http://jaeger-query:16686/ || exit 1

九、总结

构建完善的Kubernetes可观测性体系需要:

  1. 指标监控:使用Prometheus收集关键指标
  2. 日志管理:使用Loki和Fluentd聚合日志
  3. 分布式追踪:使用Jaeger追踪请求链路
  4. 可视化:使用Grafana展示数据
  5. 告警通知:配置Alertmanager发送告警

建议根据业务需求选择合适的工具组合,并持续优化监控策略。


参考资料

  • Prometheus官方文档
  • Grafana官方文档
  • Jaeger官方文档
  • Loki官方文档

需要专业的网站建设服务?

联系我们获取免费的网站建设咨询和方案报价,让我们帮助您实现业务目标

立即咨询