Drift

如果节点上使用的 AMI 与 AWSNodeTemplate 上设置的 AMI ID 不匹配,Karpenter 会自动使用注释 karpenter.sh/volutical-disruption: "drifted" 将节点注释为漂移。 一旦节点被标记为漂移,Karpenter 将自动驱逐和终止节点,除非触发了PDB或Pod上有标记karpenter.sh/do-not-evict: “true”

在本节,我们将学习如何启用Drift并通过更新 AMI ID 来测试它。


首先删除之前创建的资源:

kubectl delete deployment inflate
kubectl delete provisioners.karpenter.sh default
kubectl delete awsnodetemplates.karpenter.k8s.aws default

开启drift检测

Karpenter默认没有开启drift检测,可以在configmap中确认:

image-20231029110045967

编辑这个configmap,将它的值改为true:

kubectl edit configmap -n karpenter karpenter-global-settings

更改完成后,还要将karpenter pod重启才能生效:

kubectl rollout restart deploy karpenter -n karpenter

测试Drift

EKS集群当前版本是1.25。我们将这样进行测试:

  1. 先在AWSNodeTemplate上配置使用1.24的AMI,Provisioner引用它
  2. 让Karpenter部署一个pod,会拉起一个节点,这个节点使用的是1.24的AMI
  3. 创建新的AWSNodeTemplate,使用1.25的AMI,更新Provisioner来引用它
  4. Karpenter会检测到之前的节点使用的镜像跟当前Node Template不一样,所以会触发更换动作

创建旧版本镜像的Node Template

先将1.24版本的镜像保存到环境变量:

export AMI_OLD=$(aws ssm get-parameter --name /aws/service/eks/optimized-ami/1.24/amazon-linux-2/recommended/image_id --region $AWS_REGION --query "Parameter.Value" --output text)
echo 1.24=$AMI_OLD

创建基于这个镜像版本的Node Template,命名为oldnode

cd ~/environment/karpenter
cat << EOF > oldnode_template.yaml
apiVersion: karpenter.k8s.aws/v1alpha1
kind: AWSNodeTemplate
metadata:
  name: oldnode
spec:
  amiSelector:
    aws::ids: $AMI_OLD
  subnetSelector:
    alpha.eksctl.io/cluster-name: ${CLUSTER_NAME}
  securityGroupSelector:
    kubernetes.io/cluster/${CLUSTER_NAME}: owned
  tags:
    managed-by: "karpenter"
    intent: "apps"
EOF

kubectl -f oldnode_template.yaml create

基于这个NodeTemplate创建Provisioner:

mkdir -p ~/environment/karpenter
cd ~/environment/karpenter
cat <<EoF> provisioner.yaml
apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
  name: default
spec:
  providerRef:  # 在这里引用了oldnode
    name: oldnode

  labels:
    eks-immersion-team: my-team

  requirements:
    - key: "karpenter.k8s.aws/instance-category"
      operator: In
      values: ["c", "m", "r"]
    - key: "kubernetes.io/arch"
      operator: In
      values: ["amd64"]
    - key: "karpenter.sh/capacity-type" # If not included, the webhook for the AWS cloud provider will default to on-demand
      operator: In
      values: ["on-demand"]
  limits:
    resources:
      cpu: "1000"
      memory: 1000Gi
    
  consolidation:
    enabled: true

EoF

kubectl apply -f provisioner.yaml

部署应用,Karpenter会基于1.24的镜像拉起一个节点:

cd ~/environment/karpenter
cat <<EoF> drift-deploy.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: inflate
spec:
  replicas: 1
  selector:
    matchLabels:
      app: inflate
  template:
    metadata:
      labels:
        app: inflate
    spec:
      terminationGracePeriodSeconds: 0
      containers:
        - name: inflate
          image: public.ecr.aws/eks-distro/kubernetes/pause:3.7
          resources:
            requests:
              cpu: 1
      nodeSelector:
        eks-immersion-team: my-team
EoF

kubectl apply -f drift-deploy.yaml

执行命令,确认拉起来的节点版本符合预期:

kubectl get nodes -l eks-immersion-team=my-team 

image-20231029141340615

部署新版本的Node Template

先取回新版本AMI的id:

export AMI_NEW=$(aws ssm get-parameter --name /aws/service/eks/optimized-ami/1.25/amazon-linux-2/recommended/image_id --region $AWS_REGION --query "Parameter.Value" --output text)
echo 1.25=$AMI_NEW

基于这个版本,创建新的Node Template

cd ~/environment/karpenter
cat << EOF > newnode_template.yaml
apiVersion: karpenter.k8s.aws/v1alpha1
kind: AWSNodeTemplate
metadata:
  name: newnode
spec:
  amiSelector:
    aws::ids: $AMI_NEW
  subnetSelector:
    alpha.eksctl.io/cluster-name: ${CLUSTER_NAME}
  securityGroupSelector:
    kubernetes.io/cluster/${CLUSTER_NAME}: owned
  tags:
    managed-by: "karpenter"
    intent: "apps"
EOF
kubectl -f newnode_template.yaml create

更新Provisioner中对于Node Template的引用,从oldnode改为newnode

kubectl edit provisioner default

image-20231029141649342

执行以下命令查看节点状态:

kubectl get nodes -l eks-immersion-team=my-team 

会发现原来节点状态先被标记为Ready,SchedulingDisabled, 然后Karpenter新拉起来1.25版本的节点,最后把旧节点下掉。这表明了Karpenter的Drift检测功能已经生效:

image-20231029141851805

资源清理

完成后,重新将Drift检测功能关掉:

kubectl edit configmap -n karpenter karpenter-global-settings  # 设置featureGates.driftEnabled: "false"

重启Karpenter Pod:

kubectl rollout restart deploy karpenter -n karpenter

删除两个NodeTemplate和provisioner

kubectl delete awsnodetemplate oldnode
kubectl delete awsnodetemplate newnode
kubectl delete provisioner default

删除应用:

kubectl delete deployment inflate