使用Spot机器

本节将介绍Karpenter创建Spot机器,以及它如何处理Spot中断事件

先删除之前创建的资源:

kubectl delete deployment inflate
kubectl delete provisioners.karpenter.sh default
kubectl delete awsnodetemplates.karpenter.k8s.aws default

部署Provisioner,它的karpenter.sh/capacity-type为spot:

mkdir -p ~/environment/karpenter
cd ~/environment/karpenter
cat <<EoF> spot.yaml
apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
  name: default
spec:
  providerRef:
    name: default

  labels:
    eks-immersion-team: my-team

  requirements:
    - key: "karpenter.k8s.aws/instance-category"
      operator: In
      values: ["c", "m", "r"]
    - key: "kubernetes.io/arch"
      operator: In
      values: ["amd64"]
    - key: "karpenter.sh/capacity-type" # If not included, the webhook for the AWS cloud provider will default to on-demand
      operator: In
      values: ["spot"]
  limits:
    resources:
      cpu: "1000"
      memory: 1000Gi
    
  consolidation:
    enabled: true

---
apiVersion: karpenter.k8s.aws/v1alpha1
kind: AWSNodeTemplate
metadata:
  name: default
spec:
  subnetSelector:
    alpha.eksctl.io/cluster-name: ${CLUSTER_NAME}
  securityGroupSelector:
    aws:eks:cluster-name: ${CLUSTER_NAME}
  tags:
    managed-by: "karpenter"
    intent: "apps"
EoF

kubectl apply -f spot.yaml

部署应用,有两个replica:

cd ~/environment/karpenter
cat <<EOF > spot-deploy.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: inflate
spec:
  replicas: 2
  selector:
    matchLabels:
      app: inflate
  template:
    metadata:
      labels:
        app: inflate
    spec:
      nodeSelector:
        intent: apps
        karpenter.sh/capacity-type: spot
      containers:
      - image: public.ecr.aws/eks-distro/kubernetes/pause:3.7
        name: inflate
        resources:
          requests:
            cpu: "1"
            memory: 256M
      nodeSelector:
        eks-immersion-team: my-team    
EOF
kubectl apply -f spot-deploy.yaml

Karpenter会创建一个spot机器,来部署pod:

image-20231029080421356

处理Spot回收通知

Karpenter原生支持Spot的中断事件。它会监控将要到来的回收通知,当检测到后,会自动驱逐并终止节点

ec2-spot-interrupter 可以用于生成spot回收事件,我们将用它来做测试

安装ec2-spot-interrupter:

wget https://github.com/aws/amazon-ec2-spot-interrupter/releases/download/v0.0.10/ec2-spot-interrupter_0.0.10_Linux_amd64.tar.gz
tar -xzvf ec2-spot-interrupter_0.0.10_Linux_amd64.tar.gz 

先获取上面创建出来的spot实例的id:

export NODE_NAME=$(kubectl get nodes -l "eks-immersion-team" -o name | cut -d/ -f2) 
echo $NODE_NAME
export NODE_ID=$(aws ec2 describe-instances --query "Reservations[].Instances[?PrivateDnsName == '${NODE_NAME}'].InstanceId" --output text)
echo $NODE_ID

运行ec2-spot-interrupter,它会发送一条回收通知。spot从收到回收通知到真正被回收,中间有2分钟:

./ec2-spot-interrupter --instance-ids $NODE_ID

image-20231029081759982

Karpenter处理回收事件

Karpenter收到回收通知后,先把spot节点驱逐掉:

image-20231029081706130

同时再拉起一台新的spot机器,将pod部署在上面:

image-20231029081740383

实现机制

Karpenter是如何接收到Spot回收事件的呢?在一开始我们创建Karpenter时,会注意到有几个EventBridge事件及一个SQS队列被创建出来:

image-20231029083432490

我们查看下最后一个Rule,它检查的事件为Spot回收通知:

image-20231029084044613

在收到事件后,会发送到SQS队列:

image-20231029084130276

Karpener会监听这个SQS队列,当有新消息过来时就读取并处理。在每一章创建Karpenter时,它的Role上就已经授予了访问SQS的权限:

image-20231029101418364

我们在安装Karpenter时的命令为:

helm upgrade --install karpenter oci://public.ecr.aws/karpenter/karpenter --version ${KARPENTER_VERSION} --namespace karpenter --create-namespace \
  --set serviceAccount.annotations."eks\.amazonaws\.com/role-arn"=${KARPENTER_IAM_ROLE_ARN} \
  --set settings.aws.clusterName=${CLUSTER_NAME} \
  --set settings.aws.clusterEndpoint=${CLUSTER_ENDPOINT} \
  --set settings.aws.defaultInstanceProfile=KarpenterNodeInstanceProfile-${CLUSTER_NAME} \
  --set settings.aws.interruptionQueueName=${CLUSTER_NAME} \
  --wait

其中只要指定了--interruption-queue-name,就开启了Karpenter的Spot中断处理


中断事件处理(Interruption handling)除了spot被回收外,还有其他几种。伴随着EventBridge Rule被一起创建出来:

  1. EC2实例状态发生改变(Terminating / Stopping)

image-20231029100521435

  1. 计划事件(例如EC2计划发生底层维护)

image-20231029100555838

  1. EC2 Rebalance Recommendation(** Karpenter v0.31版本还是只监听不处理**)

image-20231029100539308

当Karpenter检查到这些事件时,会自动驱逐并替换节点。

可以从代码中找到处理逻辑:https://github.com/aws/karpenter/blob/main/pkg/controllers/interruption/controller.go

image-20231030092856985

除了EC2 Rebalance Recommendation外,其他的三种事件都会CordonAndDrain