本节将介绍Karpenter创建Spot机器,以及它如何处理Spot中断事件
先删除之前创建的资源:
kubectl delete deployment inflate
kubectl delete nodepool.karpenter.sh default
kubectl delete ec2nodeclass.karpenter.k8s.aws default
部署NodePool,它的karpenter.sh/capacity-type
为spot:
mkdir -p ~/environment/karpenter
cd ~/environment/karpenter
cat <<EoF> spot.yaml
apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
name: default
spec:
disruption:
consolidationPolicy: WhenUnderutilized
expireAfter: Never
limits:
cpu: 1000
memory: 1000Gi
template:
metadata:
labels:
eks-immersion-team: my-team
spec:
nodeClassRef:
name: default
requirements:
- key: karpenter.k8s.aws/instance-category
operator: In
values:
- c
- m
- r
- key: kubernetes.io/arch
operator: In
values:
- amd64
- key: karpenter.sh/capacity-type
operator: In
values:
- spot
---
apiVersion: karpenter.k8s.aws/v1beta1
kind: EC2NodeClass
metadata:
name: default
spec:
amiFamily: AL2
role: "KarpenterNodeRole-${CLUSTER_NAME}"
securityGroupSelectorTerms:
- tags:
alpha.eksctl.io/cluster-name: $CLUSTER_NAME
subnetSelectorTerms:
- tags:
alpha.eksctl.io/cluster-name: $CLUSTER_NAME
tags:
intent: apps
managed-by: karpenter
eks-immersion-team: my-team
EoF
kubectl apply -f spot.yaml
部署应用,有两个replica:
cd ~/environment/karpenter
cat <<EOF > spot-deploy.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: inflate
spec:
replicas: 2
selector:
matchLabels:
app: inflate
template:
metadata:
labels:
app: inflate
spec:
nodeSelector:
intent: apps
karpenter.sh/capacity-type: spot
containers:
- image: public.ecr.aws/eks-distro/kubernetes/pause:3.7
name: inflate
resources:
requests:
cpu: "1"
memory: 256M
nodeSelector:
eks-immersion-team: my-team
EOF
kubectl apply -f spot-deploy.yaml
Karpenter会创建一个spot机器,来部署pod:
Karpenter原生支持Spot的中断事件。它会监控将要到来的回收通知,当检测到后,会自动驱逐并终止节点
ec2-spot-interrupter 可以用于生成spot回收事件,我们将用它来做测试
安装ec2-spot-interrupter
:
wget https://github.com/aws/amazon-ec2-spot-interrupter/releases/download/v0.0.10/ec2-spot-interrupter_0.0.10_Linux_amd64.tar.gz
tar -xzvf ec2-spot-interrupter_0.0.10_Linux_amd64.tar.gz
先获取上面创建出来的spot实例的id:
export NODE_NAME=$(kubectl get nodes -l "eks-immersion-team" -o name | cut -d/ -f2)
echo $NODE_NAME
export NODE_ID=$(aws ec2 describe-instances --query "Reservations[].Instances[?PrivateDnsName == '${NODE_NAME}'].InstanceId" --output text)
echo $NODE_ID
运行ec2-spot-interrupter
,它会发送一条回收通知。spot从收到回收通知到真正被回收,中间有2分钟:
./ec2-spot-interrupter --instance-ids $NODE_ID
Karpenter收到回收通知后,先把spot节点驱逐掉:
同时再拉起一台新的spot机器,将pod部署在上面:
从karpenter日志中也能找到对应事件"initiating delete from interruption message"
:
kubectl -n karpenter logs -l app.kubernetes.io/name=karpenter
Karpenter是如何接收到Spot回收事件的呢?在一开始我们创建Karpenter时,会注意到有几个EventBridge事件及一个SQS队列被创建出来:
我们查看下最后一个Rule,它检查的事件为Spot回收通知:
在收到事件后,会发送到SQS队列:
Karpener会监听这个SQS队列,当有新消息过来时就读取并处理。在第一章创建Karpenter时,它的Role上就已经授予了访问SQS的权限:
我们在安装Karpenter时的命令为:
helm upgrade --install karpenter oci://public.ecr.aws/karpenter/karpenter --version ${KARPENTER_VERSION_STR} --namespace karpenter --create-namespace \
--set serviceAccount.annotations."eks\.amazonaws\.com/role-arn"=${KARPENTER_IAM_ROLE_ARN} \
--set settings.clusterName=${CLUSTER_NAME} \
--set settings.clusterEndpoint=${CLUSTER_ENDPOINT} \
--set settings.interruptionQueue=${CLUSTER_NAME} \
--set settings.featureGates.drift=true \
--set settings.featureGates.SpotToSpotConsolidation=true \
--wait
其中只要指定了interruptionQueue
,就开启了Karpenter的Spot中断处理
中断事件处理(Interruption handling
)除了spot被回收外,还有其他几种。伴随着EventBridge Rule被一起创建出来:
EC2 Rebalance Recommendation
(Karpenter v0.39版本还是只监听不处理)当Karpenter检查到这些事件时,会自动驱逐并替换节点。
可以从代码中找到处理逻辑: https://github.com/aws/karpenter/blob/main/pkg/controllers/interruption/controller.go
除了EC2 Rebalance Recommendation
外,其他的三种事件都会CordonAndDrain