AI 服务的 K8s 调度策略：从资源争抢到智能编排，云原生环境下的推理服务治理-港品优选

AI 服务的 K8s 调度策略：从资源争抢到智能编排，云原生环境下的推理服务治理

一、AI 推理服务的调度困境：GPU 资源争抢与弹性伸缩的矛盾

AI 推理服务在 Kubernetes 集群中的调度面临独特的挑战。与普通 Web 服务不同，推理服务对 GPU 资源有强依赖，而 GPU 是集群中最稀缺、最昂贵的资源。当多个推理服务共享 GPU 节点时，资源争抢导致推理延迟抖动；当推理服务独占 GPU 节点时，资源利用率极低，成本居高不下。

更复杂的是推理服务的流量特征。大模型推理的请求耗时长（数百毫秒到数秒），内存占用高（模型权重动辄数十 GB），且不同模型的资源需求差异巨大。一个 7B 参数模型需要约 14GB 显存，一个 70B 模型需要约 140GB 显存。传统的 K8s 调度器基于 CPU/内存的调度策略无法有效处理 GPU 资源的分配问题。

弹性伸缩同样棘手。推理服务的冷启动耗时远超普通服务——加载模型权重需要数十秒到数分钟。当流量突增触发 HPA 扩容时，新 Pod 在模型加载完成前无法服务，导致请求排队超时。

二、AI 推理服务的调度架构与资源模型

AI 推理服务的调度需要在三个层面优化：Pod 级别的 GPU 资源分配、节点级别的 GPU 共享策略、集群级别的弹性伸缩与调度。

flowchart TD A[推理服务部署请求] --> B[调度决策层] B --> B1[GPU 资源需求评估] B --> B2[模型加载时间预估] B --> B3[流量预测与预热] B1 --> C{调度策略选择} B2 --> C B3 --> C C -->|独占模式| D1[GPU 独占: 高优先级服务] C -->|共享模式| D2[GPU 时间片: 低优先级服务] C -->|MPS 模式| D3[多进程共享: 中等优先级] D1 --> E[Pod 调度与绑定] D2 --> E D3 --> E E --> F[模型预热与就绪检查] F --> G[流量接入] G --> H[运行时监控] H --> H1[GPU 利用率] H --> H2[推理延迟 P99] H --> H3[队列深度] H1 --> I{是否需要扩缩容?} H2 --> I H3 --> I I -->|扩容| J[预热新 Pod 后接入流量] I -->|缩容| K[流量排空后终止 Pod] style B fill:#e1f5fe style I fill:#fff3e0

2.1 GPU 资源模型与调度策略

# gpu-resource-model.yaml — AI 推理服务的 GPU 资源定义 # 设计意图：通过自定义资源定义精确描述 GPU 需求， # 支持独占、共享、MPS 三种模式，为调度器提供决策依据 apiVersion: scheduling.ai/v1alpha1 kind: InferenceService metadata: name: llm-7b-service spec: model: name: "llama-7b" storageUri: "s3://models/llama-7b" framework: "vllm" resources: gpu: type: "nvidia.com/gpu" count: 1 # 请求的 GPU 数量 memory: "14Gi" # 显存需求 sharing: mode: "exclusive" # exclusive | time-slicing | mps # time-slicing 配置：GPU 时间片共享 # timeSlicing: # replicas: 4 # 将 1 个 GPU 切分为 4 个时间片 # mps 配置：多进程服务 # mps: # activeThreadPercentage: 50 scheduling: priority: "high" # high | medium | low affinity: # 优先调度到已有模型缓存的节点，减少冷启动时间 nodeAffinity: preferredDuringSchedulingIgnoredDuringExecution: - preference: matchExpressions: - key: "model-cache.ai/llama-7b" operator: In values: ["cached"] weight: 100 scaling: minReplicas: 1 maxReplicas: 10 # 自定义指标：基于推理队列深度扩缩容 metrics: - type: Pods pods: metric: name: inference_queue_depth target: type: AverageValue averageValue: "5" # 预热策略：扩容时先加载模型再接入流量 warmup: enabled: true readinessProbe: httpGet: path: /health/ready port: 8080 initialDelaySeconds: 10 periodSeconds: 5

2.2 自定义调度器实现

// scheduler.go — AI 推理服务自定义调度器 // 设计意图：基于 GPU 资源可用性和模型缓存状态进行调度决策， // 优先将 Pod 调度到已有模型缓存的节点，减少冷启动时间 package scheduler import ( "context" "fmt" "sort" v1 "k8s.io/api/core/v1" "k8s.io/apimachinery/pkg/runtime" "k8s.io/kubernetes/pkg/scheduler/framework" ) const Name = "AIInferenceScheduler" type AIScheduler struct { handle framework.Handle } func New(obj runtime.Object, h framework.Handle) (framework.Plugin, error) { return &AIScheduler{handle: h}, nil } func (s *AIScheduler) Name() string { return Name } // Score 打分：为每个节点评分，分数越高越优先调度 func (s *AIScheduler) Score( ctx context.Context, state *framework.CycleState, pod *v1.Pod, nodeInfo *framework.NodeInfo, ) (int64, *framework.Status) { node := nodeInfo.Node() if node == nil { return 0, framework.NewStatus(framework.Error, "node not found") } var score int64 = 0 // 因子 1：模型缓存命中（权重 50） // 如果节点已缓存 Pod 需要的模型，大幅加分 modelName := pod.Labels["model.ai/name"] if modelName != "" { cacheKey := fmt.Sprintf("model-cache.ai/%s", modelName) if val, ok := node.Labels[cacheKey]; ok && val == "cached" { score += 50 } } // 因子 2：GPU 资源可用性（权重 30） // 优先调度到 GPU 资源充足的节点 gpuAllocatable := node.Status.Allocatable["nvidia.com/gpu"] gpuRequested := nodeInfo.RequestedResource.ScalarResources["nvidia.com/gpu"] gpuAvailable := gpuAllocatable.Value() - gpuRequested if gpuAvailable > 0 { score += 30 } // 因子 3：节点负载（权重 20） // 优先调度到负载较低的节点，避免资源争抢 // 通过节点注解中的 GPU 利用率判断 if gpuUtilStr, ok := node.Annotations["gpu.ai/utilization"]; ok { var util int fmt.Sscanf(gpuUtilStr, "%d", &util) // 利用率越低分数越高 score += int64((100 - util) / 5) } return score, nil } func (s *AIScheduler) ScoreExtensions() framework.ScoreExtensions { return nil } // Filter 过滤：排除不满足 GPU 需求的节点 func (s *AIScheduler) Filter( ctx context.Context, state *framework.CycleState, pod *v1.Pod, nodeInfo *framework.NodeInfo, ) *framework.Status { node := nodeInfo.Node() if node == nil { return framework.NewStatus(framework.Error, "node not found") } // 检查节点是否有 GPU 资源 gpuAllocatable, ok := node.Status.Allocatable["nvidia.com/gpu"] if !ok || gpuAllocatable.IsZero() { return framework.NewStatus(framework.Unschedulable, "node has no GPU") } // 检查 GPU 显存是否满足需求 if memReq, ok := pod.Annotations["gpu.ai/memory-required"]; ok { nodeGPUMem := node.Annotations["gpu.ai/memory-total"] if nodeGPUMem < memReq { return framework.NewStatus( framework.Unschedulable, fmt.Sprintf("GPU memory insufficient: need %s, have %s", memReq, nodeGPUMem), ) } } return nil }

三、弹性伸缩与模型预热策略

3.1 基于队列深度的 HPA

// inference_hpa.go — 推理服务自定义 HPA 控制器 // 设计意图：基于推理队列深度和 GPU 利用率进行扩缩容， // 扩容时先预热模型再接入流量，避免冷启动导致的请求超时 package controller import ( "context" "fmt" "time" autoscalingv2 "k8s.io/api/autoscaling/v2" v1 "k8s.io/api/core/v1" metav1 "k8s.io/apimachinery/pkg/apis/meta/v1" "k8s.io/apimachinery/pkg/labels" "k8s.io/metrics/pkg/apis/external_metrics" ) type InferenceHPAController struct { metricsClient MetricsClient kubeClient KubeClient } type ScaleDecision struct { TargetReplicas int32 Reason string } func (c *InferenceHPAController) CalculateScale( ctx context.Context, hpa *autoscalingv2.HorizontalPodAutoscaler, ) (*ScaleDecision, error) { // 获取当前副本数 currentReplicas := hpa.Status.CurrentReplicas // 采集推理队列深度指标 queueDepth, err := c.getQueueDepthMetric(ctx, hpa) if err != nil { return nil, fmt.Errorf("获取队列深度失败: %w", err) } // 采集 GPU 利用率指标 gpuUtilization, err := c.getGPUUtilizationMetric(ctx, hpa) if err != nil { return nil, fmt.Errorf("获取 GPU 利用率失败: %w", err) } // 扩缩容决策逻辑 var targetReplicas int32 = currentReplicas var reason string queueThreshold := int32(5) // 队列深度阈值 gpuScaleUpThreshold := 0.8 // GPU 利用率扩容阈值 gpuScaleDownThreshold := 0.3 // GPU 利用率缩容阈值 if queueDepth > queueThreshold || gpuUtilization > gpuScaleUpThreshold { // 扩容：队列深度或 GPU 利用率超过阈值 increment := int32(queueDepth / queueThreshold) if increment < 1 { increment = 1 } targetReplicas = currentReplicas + increment reason = fmt.Sprintf( "队列深度 %d 超过阈值 %d, GPU 利用率 %.1f%%", queueDepth, queueThreshold, gpuUtilization*100, ) } else if queueDepth == 0 && gpuUtilization < gpuScaleDownThreshold { // 缩容：队列空闲且 GPU 利用率低 if currentReplicas > *hpa.Spec.MinReplicas { targetReplicas = currentReplicas - 1 reason = fmt.Sprintf( "队列空闲, GPU 利用率 %.1f%% 低于阈值 %.1f%%", gpuUtilization*100, gpuScaleDownThreshold*100, ) } } // 限制最大副本数 if targetReplicas > hpa.Spec.MaxReplicas { targetReplicas = hpa.Spec.MaxReplicas } return &ScaleDecision{ TargetReplicas: targetReplicas, Reason: reason, }, nil } func (c *InferenceHPAController) getQueueDepthMetric( ctx context.Context, hpa *autoscalingv2.HorizontalPodAutoscaler, ) (int32, error) { // 从 Prometheus 查询推理队列深度 selector, _ := labels.Parse(hpa.Spec.ScaleTargetRef.Name) metricValue, err := c.metricsClient.GetExternalMetric( ctx, "inference_queue_depth", metav1.NamespaceDefault, selector, ) if err != nil { return 0, err } return int32(metricValue), nil } func (c *InferenceHPAController) getGPUUtilizationMetric( ctx context.Context, hpa *autoscalingv2.HorizontalPodAutoscaler, ) (float64, error) { selector, _ := labels.Parse(hpa.Spec.ScaleTargetRef.Name) metricValue, err := c.metricsClient.GetExternalMetric( ctx, "gpu_utilization", metav1.NamespaceDefault, selector, ) if err != nil { return 0, err } return metricValue / 100.0, nil } // 模型预热：扩容时等待新 Pod 模型加载完成 func (c *InferenceHPAController) WarmupAndWait( ctx context.Context, pod *v1.Pod, timeout time.Duration, ) error { deadline := time.Now().Add(timeout) for time.Now().Before(deadline) { // 检查 Pod 的就绪探针 ready := c.isPodReady(ctx, pod) if ready { return nil } select { case <-ctx.Done(): return ctx.Err() case <-time.After(5 * time.Second): continue } } return fmt.Errorf("Pod %s 模型预热超时", pod.Name) } func (c *InferenceHPAController) isPodReady(ctx context.Context, pod *v1.Pod) bool { for _, condition := range pod.Status.Conditions { if condition.Type == v1.PodReady && condition.Status == v1.ConditionTrue { return true } } return false }

四、边界分析与架构权衡

GPU 共享的安全隔离：时间片共享模式下，多个推理服务在同一 GPU 上交替执行。一个服务的推理延迟会受到其他服务的影响，无法提供稳定的 SLA。MPS 模式虽然支持并行执行，但缺乏显存隔离——一个服务的显存泄漏会影响同一 GPU 上的所有服务。对延迟敏感的核心服务应使用独占模式，对延迟容忍的批处理服务可使用共享模式。

模型缓存的存储成本：将模型缓存到节点本地磁盘可以加速冷启动，但每个节点的缓存容量有限。一个 70B 模型的权重文件约 140GB，如果集群有 10 个 GPU 节点，全量缓存需要 1.4TB 存储。需要根据模型使用频率选择性缓存，低频模型按需加载。

扩容预热的流量损失：扩容时新 Pod 需要预热模型，这段时间内增加的流量仍由现有 Pod 承担。如果流量增长过快，预热期间的排队超时不可避免。解决方案是预测性扩容——基于历史流量模式提前扩容，而非等待队列深度超阈值后被动扩容。

缩容的模型缓存浪费：缩容终止 Pod 后，节点上的模型缓存可能被清理。如果短时间内再次扩容，又需要重新加载模型。可以通过延迟清理缓存和缩容冷却期来缓解，但增加了资源占用时间。

五、总结

AI 推理服务的 K8s 调度需要在资源利用率和服务稳定性之间找到平衡。核心策略包括：基于模型缓存的调度优化减少冷启动时间，GPU 独占/共享/MPS 三种模式适配不同优先级的服务，基于队列深度的弹性伸缩配合模型预热避免流量损失。落地建议：核心推理服务使用 GPU 独占模式保证 SLA，批处理服务使用共享模式提升利用率；将模型缓存到节点本地磁盘加速冷启动，但需管理缓存容量；实现预测性扩容减少被动扩容的延迟；缩容时设置冷却期，避免频繁扩缩容导致的模型重复加载。

企业官网建设流程全解析