本节介绍通过Prometheus + Grafana来观测Karpenter的相关指标
首先删除之前创建的资源:
kubectl delete deployment inflate
kubectl delete provisioners.karpenter.sh default
kubectl delete awsnodetemplates.karpenter.k8s.aws default
部署Provisioner和AWSNodeTemplate:
mkdir -p ~/environment/karpenter
cd ~/environment/karpenter
cat << EOF > observability_karpenter_provisioner_node_template.yaml
apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
name: default
spec:
providerRef:
name: default
ttlSecondsAfterEmpty: 30
labels:
eks-immersion-team: default
requirements:
- key: "node.kubernetes.io/instance-type"
operator: In
values: ["c5.large", "m5.large", "m5.xlarge"]
- key: "kubernetes.io/arch"
operator: In
values: ["amd64"]
limits:
resources:
cpu: "50"
---
apiVersion: karpenter.k8s.aws/v1alpha1
kind: AWSNodeTemplate
metadata:
name: default
spec:
subnetSelector:
alpha.eksctl.io/cluster-name: ${CLUSTER_NAME}
securityGroupSelector:
aws:eks:cluster-name: ${CLUSTER_NAME}
tags:
managed-by: "karpenter"
intent: "apps"
EOF
kubectl -f observability_karpenter_provisioner_node_template.yaml create
检查Karpenter的日志, 确认没有错误:
kubectl logs deployment/karpenter -c controller -n karpenter
运行以下命令,看Karpenter的指标是否能获取:
kubectl run -i --tty --rm debug --image=alpine/curl --restart=Never -- wget -O - http://karpenter.karpenter.svc.cluster.local:8000/metrics
先部署inflate应用,把replica设置成0,后面再进行扩容:
cd ~/environment/karpenter
cat << EOF > appDeploy.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: inflate
spec:
replicas: 0
selector:
matchLabels:
app: inflate
template:
metadata:
labels:
app: inflate
spec:
terminationGracePeriodSeconds: 0
containers:
- name: inflate
image: public.ecr.aws/eks-distro/kubernetes/pause:3.7
resources:
requests:
memory: 1Gi
cpu: 1
EOF
kubectl -f appDeploy.yaml create
由于grafana需要pvc,所以要提前安装好ebs-csi-driver, 安装的步骤可参考: https://docs.aws.amazon.com/eks/latest/userguide/ebs-csi.html
运行以下脚本,安装Prometheus和˝Grafana,Grafana将暴露一个ELB的地址:
cd ~/environment/karpenter
cat << EOS > installMonitoring.sh
#!/bin/bash
# https://karpenter.sh/docs/getting-started/getting-started-with-eksctl/
helm repo add grafana-charts https://grafana.github.io/helm-charts
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
kubectl create namespace monitoring
curl -fsSL https://raw.githubusercontent.com/aws/karpenter/"${KARPENTER_VERSION}"/website/content/en/preview/getting-started/getting-started-with-karpenter/prometheus-values.yaml | tee prometheus-values.yaml
helm install --namespace monitoring prometheus prometheus-community/prometheus --values prometheus-values.yaml
curl -fsSL https://raw.githubusercontent.com/aws/karpenter/"${KARPENTER_VERSION}"/website/content/en/preview/getting-started/getting-started-with-karpenter/grafana-values.yaml | tee grafana-values.yaml
cat << EOF >> grafana-values.yaml
service:
enabled: true
type: LoadBalancer
port: 80
targetPort: 3000
annotations: {}
labels: {}
portName: service
appProtocol: ""
EOF
helm install --namespace monitoring grafana grafana-charts/grafana --values grafana-values.yaml
EOS
chmod +x installMonitoring.sh
./installMonitoring.sh >/dev/null 2>&1
获取granafa的admin用户登录密码:
kubectl -n monitoring get secret grafana -o jsonpath="{.data.admin-password}" | base64 --decode ; echo
获取Grafana的ELB地址:
kubectl -n monitoring get svc grafana -o json|jq '.status.loadBalancer.ingress[0].hostname'|tr -d '"'
ELB的部署需要2-3分钟。
在浏览器中打开URL,进入Grafana:
将inflate应用replica扩展到30个:
kubectl scale deployment inflate --replicas=30
在grafana进入dashboard -> Karpenter Performance
页面:
在Pod start latency
部分,显示了pod从创建到运行的时间,可以看到50%以内是在13s内运行:
进入Karpenter capacity
dashboard:
在底部的Node Summary
页面,显示了每个节点的CPU/内存利用率:
这些指标是从http://karpenter.karpenter.svc.cluster.local:8080/metrics
抓到的,指标的详细说明参考: https://karpenter.sh/preview/reference/metrics/