可观测性(Grafana + Prometheus)

本节介绍通过Prometheus + Grafana来观测Karpenter的相关指标

首先删除之前创建的资源:

kubectl delete deployment inflate
kubectl delete nodepools.karpenter.sh default
kubectl delete ec2nodeclasses.karpenter.k8s.aws default

安装Karpenter所需资源

部署NodeClass和NodePool:

mkdir -p ~/environment/karpenter
cd ~/environment/karpenter
cat << EoF > observability_karpenter_nodepool_node_class.yaml
apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: default
spec:
  template:
    metadata:
      labels:
        eks-immersion-team: default
    spec:
      nodeClassRef:
        name: default
      requirements: 
        - key: "karpenter.k8s.aws/instance-category"
          operator: In
          values: ["c", "m", "r"]
        - key: "kubernetes.io/arch"
          operator: In
          values: ["amd64"]
        - key: "karpenter.sh/capacity-type" # If not included, the webhook for the AWS cloud provider will default to on-demand
          operator: In
          values: ["on-demand"]      
      kubelet:
        cpuCFSQuota: true                         
  disruption:
    consolidateAfter: 30s
    consolidationPolicy: WhenEmpty
    expireAfter: Never
  limits:
      cpu: "100"
---
apiVersion: karpenter.k8s.aws/v1beta1
kind: EC2NodeClass
metadata:
  name: default
spec:
  amiFamily: AL2
  role: "KarpenterNodeRole-${CLUSTER_NAME}"
  securityGroupSelectorTerms:
  - tags:
      alpha.eksctl.io/cluster-name: $CLUSTER_NAME
  subnetSelectorTerms:
  - tags:
      alpha.eksctl.io/cluster-name: $CLUSTER_NAME
  tags:
    intent: apps
    managed-by: karpenter
    eks-immersion-team: my-team
EoF
kubectl -f observability_karpenter_nodepool_node_class.yaml create

检查Karpenter的日志, 确认没有错误:

kubectl logs deployment/karpenter -c controller -n karpenter

运行以下命令,看Karpenter的指标是否能获取:

kubectl run -i --tty --rm debug --image=alpine/curl --restart=Never -- wget -O - http://karpenter.karpenter.svc.cluster.local:8000/metrics

image-20231027160940604

部署Inflate应用

先部署inflate应用,把replica设置成0,后面再进行扩容:

cd ~/environment/karpenter
cat << EOF > appDeploy.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: inflate
spec:
  replicas: 0
  selector:
    matchLabels:
      app: inflate
  template:
    metadata:
      labels:
        app: inflate
    spec:
      terminationGracePeriodSeconds: 0
      containers:
        - name: inflate
          image: public.ecr.aws/eks-distro/kubernetes/pause:3.7
          resources:
            requests:
              memory: 1Gi
              cpu: 1
EOF
kubectl -f appDeploy.yaml create

部署Prometheus和Grafana

由于grafana需要pvc,所以要提前安装好ebs-csi-driver, 安装的步骤可参考: https://docs.aws.amazon.com/eks/latest/userguide/ebs-csi.html

运行以下脚本,安装Prometheus和˝Grafana,Grafana将暴露一个ELB的地址:

cd ~/environment/karpenter
export KARPENTER_NAMESPACE=karpenter
cat << EoF > installMonitoring.sh
#!/bin/bash
# https://karpenter.sh/docs/getting-started/getting-started-with-eksctl/

helm repo add grafana-charts https://grafana.github.io/helm-charts
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

kubectl create namespace monitoring

curl -fsSL https://raw.githubusercontent.com/aws/karpenter/"${KARPENTER_VERSION}"/website/content/en/preview/getting-started/getting-started-with-karpenter/prometheus-values.yaml | envsubst | tee prometheus-values.yaml

helm install --namespace monitoring prometheus prometheus-community/prometheus --values prometheus-values.yaml


curl -fsSL https://raw.githubusercontent.com/aws/karpenter/"${KARPENTER_VERSION}"/website/content/en/preview/getting-started/getting-started-with-karpenter/grafana-values.yaml | tee grafana-values.yaml

cat << EOF_GRAFANA_VALUES >> grafana-values.yaml
service:
  enabled: true
  type: LoadBalancer
  port: 80
  targetPort: 3000
  annotations: {}
  labels: {}
  portName: service
  appProtocol: ""
EOF_GRAFANA_VALUES
helm install --namespace monitoring grafana grafana-charts/grafana --values grafana-values.yaml
EoF
chmod +x installMonitoring.sh
./installMonitoring.sh >/dev/null 2>&1
chmod +x installMonitoring.sh
./installMonitoring.sh >/dev/null 2>&1

获取granafa的admin用户登录密码:

kubectl -n monitoring get secret grafana -o jsonpath="{.data.admin-password}" | base64 --decode ; echo

获取Grafana的ELB地址:

kubectl -n monitoring get svc grafana -o json|jq '.status.loadBalancer.ingress[0].hostname'|tr -d '"'

ELB的部署需要2-3分钟。

在浏览器中打开URL,进入Grafana(用户名是admin):

image-20231027210322453

将inflate应用replica扩展到30个:

kubectl scale deployment inflate --replicas=30

在kube ops view中能看到机器扩容及pod部署上去:

image-20240804110344579

在grafana进入dashboard -> Karpenter Performance页面:

image-20231027210632483

Pod start latency部分,显示了pod从创建到运行的时间(karpenter_pods_startup_time_seconds),可以看到50%以内是在41s内运行:

image-20240804110524621

进入Karpenter capacity dashboard:

image-20231027211708593

在底部的Node Summary页面,显示了每个节点的CPU/内存利用率:

image-20240804110725724

这些指标是从http://karpenter.karpenter.svc.cluster.local:8080/metrics抓到的,指标的详细说明参考: https://karpenter.sh/preview/reference/metrics/

清理资源

helm uninstall --namespace monitoring grafana
helm uninstall --namespace monitoring prometheus
kubectl delete namespace monitoring

kubectl delete deployment inflate
kubectl delete nodepools.karpenter.sh default
kubectl delete ec2nodeclasses.karpenter.k8s.aws default