本节介绍通过Prometheus + Grafana来观测Karpenter的相关指标
首先删除之前创建的资源:
kubectl delete deployment inflate
kubectl delete nodepools.karpenter.sh default
kubectl delete ec2nodeclasses.karpenter.k8s.aws default
部署NodeClass和NodePool:
mkdir -p ~/environment/karpenter
cd ~/environment/karpenter
cat << EoF > observability_karpenter_nodepool_node_class.yaml
apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
name: default
spec:
template:
metadata:
labels:
eks-immersion-team: default
spec:
nodeClassRef:
name: default
requirements:
- key: "karpenter.k8s.aws/instance-category"
operator: In
values: ["c", "m", "r"]
- key: "kubernetes.io/arch"
operator: In
values: ["amd64"]
- key: "karpenter.sh/capacity-type" # If not included, the webhook for the AWS cloud provider will default to on-demand
operator: In
values: ["on-demand"]
kubelet:
cpuCFSQuota: true
disruption:
consolidateAfter: 30s
consolidationPolicy: WhenEmpty
expireAfter: Never
limits:
cpu: "100"
---
apiVersion: karpenter.k8s.aws/v1beta1
kind: EC2NodeClass
metadata:
name: default
spec:
amiFamily: AL2
role: "KarpenterNodeRole-${CLUSTER_NAME}"
securityGroupSelectorTerms:
- tags:
alpha.eksctl.io/cluster-name: $CLUSTER_NAME
subnetSelectorTerms:
- tags:
alpha.eksctl.io/cluster-name: $CLUSTER_NAME
tags:
intent: apps
managed-by: karpenter
eks-immersion-team: my-team
EoF
kubectl -f observability_karpenter_nodepool_node_class.yaml create
检查Karpenter的日志, 确认没有错误:
kubectl logs deployment/karpenter -c controller -n karpenter
运行以下命令,看Karpenter的指标是否能获取:
kubectl run -i --tty --rm debug --image=alpine/curl --restart=Never -- wget -O - http://karpenter.karpenter.svc.cluster.local:8000/metrics
先部署inflate应用,把replica设置成0,后面再进行扩容:
cd ~/environment/karpenter
cat << EOF > appDeploy.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: inflate
spec:
replicas: 0
selector:
matchLabels:
app: inflate
template:
metadata:
labels:
app: inflate
spec:
terminationGracePeriodSeconds: 0
containers:
- name: inflate
image: public.ecr.aws/eks-distro/kubernetes/pause:3.7
resources:
requests:
memory: 1Gi
cpu: 1
EOF
kubectl -f appDeploy.yaml create
由于grafana需要pvc,所以要提前安装好ebs-csi-driver, 安装的步骤可参考: https://docs.aws.amazon.com/eks/latest/userguide/ebs-csi.html
运行以下脚本,安装Prometheus和˝Grafana,Grafana将暴露一个ELB的地址:
cd ~/environment/karpenter
export KARPENTER_NAMESPACE=karpenter
cat << EoF > installMonitoring.sh
#!/bin/bash
# https://karpenter.sh/docs/getting-started/getting-started-with-eksctl/
helm repo add grafana-charts https://grafana.github.io/helm-charts
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
kubectl create namespace monitoring
curl -fsSL https://raw.githubusercontent.com/aws/karpenter/"${KARPENTER_VERSION}"/website/content/en/preview/getting-started/getting-started-with-karpenter/prometheus-values.yaml | envsubst | tee prometheus-values.yaml
helm install --namespace monitoring prometheus prometheus-community/prometheus --values prometheus-values.yaml
curl -fsSL https://raw.githubusercontent.com/aws/karpenter/"${KARPENTER_VERSION}"/website/content/en/preview/getting-started/getting-started-with-karpenter/grafana-values.yaml | tee grafana-values.yaml
cat << EOF_GRAFANA_VALUES >> grafana-values.yaml
service:
enabled: true
type: LoadBalancer
port: 80
targetPort: 3000
annotations: {}
labels: {}
portName: service
appProtocol: ""
EOF_GRAFANA_VALUES
helm install --namespace monitoring grafana grafana-charts/grafana --values grafana-values.yaml
EoF
chmod +x installMonitoring.sh
./installMonitoring.sh >/dev/null 2>&1
chmod +x installMonitoring.sh
./installMonitoring.sh >/dev/null 2>&1
获取granafa的admin用户登录密码:
kubectl -n monitoring get secret grafana -o jsonpath="{.data.admin-password}" | base64 --decode ; echo
获取Grafana的ELB地址:
kubectl -n monitoring get svc grafana -o json|jq '.status.loadBalancer.ingress[0].hostname'|tr -d '"'
ELB的部署需要2-3分钟。
在浏览器中打开URL,进入Grafana(用户名是admin):
将inflate应用replica扩展到30个:
kubectl scale deployment inflate --replicas=30
在kube ops view中能看到机器扩容及pod部署上去:
在grafana进入dashboard -> Karpenter Performance
页面:
在Pod start latency
部分,显示了pod从创建到运行的时间(karpenter_pods_startup_time_seconds
),可以看到50%以内是在41s内运行:
进入Karpenter capacity
dashboard:
在底部的Node Summary
页面,显示了每个节点的CPU/内存利用率:
这些指标是从http://karpenter.karpenter.svc.cluster.local:8080/metrics
抓到的,指标的详细说明参考: https://karpenter.sh/preview/reference/metrics/
helm uninstall --namespace monitoring grafana
helm uninstall --namespace monitoring prometheus
kubectl delete namespace monitoring
kubectl delete deployment inflate
kubectl delete nodepools.karpenter.sh default
kubectl delete ec2nodeclasses.karpenter.k8s.aws default