可观测性(Grafana + Prometheus)

本节介绍通过Prometheus + Grafana来观测Karpenter的相关指标

首先删除之前创建的资源:

kubectl delete deployment inflate
kubectl delete provisioners.karpenter.sh default
kubectl delete awsnodetemplates.karpenter.k8s.aws default

image-20231027183040289

安装Karpenter所需资源

部署Provisioner和AWSNodeTemplate:

mkdir -p ~/environment/karpenter
cd ~/environment/karpenter

cat << EOF > observability_karpenter_provisioner_node_template.yaml
apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
  name: default
spec:
  providerRef:
    name: default
  ttlSecondsAfterEmpty: 30
  labels:
    eks-immersion-team: default

  requirements:
    - key: "node.kubernetes.io/instance-type"
      operator: In
      values: ["c5.large", "m5.large", "m5.xlarge"]
    - key: "kubernetes.io/arch"
      operator: In
      values: ["amd64"]
  limits:
    resources:
      cpu: "50"
---

apiVersion: karpenter.k8s.aws/v1alpha1
kind: AWSNodeTemplate
metadata:
  name: default
spec:
  subnetSelector:
    alpha.eksctl.io/cluster-name: ${CLUSTER_NAME}
  securityGroupSelector:
    aws:eks:cluster-name: ${CLUSTER_NAME}
  tags:
    managed-by: "karpenter"
    intent: "apps"
EOF

kubectl -f observability_karpenter_provisioner_node_template.yaml create

检查Karpenter的日志, 确认没有错误:

kubectl logs deployment/karpenter -c controller -n karpenter

运行以下命令,看Karpenter的指标是否能获取:

kubectl run -i --tty --rm debug --image=alpine/curl --restart=Never -- wget -O - http://karpenter.karpenter.svc.cluster.local:8000/metrics

image-20231027160940604

部署Inflate应用

先部署inflate应用,把replica设置成0,后面再进行扩容:

cd ~/environment/karpenter
cat << EOF > appDeploy.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: inflate
spec:
  replicas: 0
  selector:
    matchLabels:
      app: inflate
  template:
    metadata:
      labels:
        app: inflate
    spec:
      terminationGracePeriodSeconds: 0
      containers:
        - name: inflate
          image: public.ecr.aws/eks-distro/kubernetes/pause:3.7
          resources:
            requests:
              memory: 1Gi
              cpu: 1
EOF
kubectl -f appDeploy.yaml create

部署Prometheus和Grafana

由于grafana需要pvc,所以要提前安装好ebs-csi-driver, 安装的步骤可参考: https://docs.aws.amazon.com/eks/latest/userguide/ebs-csi.html

运行以下脚本,安装Prometheus和˝Grafana,Grafana将暴露一个ELB的地址:

cd ~/environment/karpenter
cat << EOS > installMonitoring.sh
#!/bin/bash
# https://karpenter.sh/docs/getting-started/getting-started-with-eksctl/

helm repo add grafana-charts https://grafana.github.io/helm-charts
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

kubectl create namespace monitoring

curl -fsSL https://raw.githubusercontent.com/aws/karpenter/"${KARPENTER_VERSION}"/website/content/en/preview/getting-started/getting-started-with-karpenter/prometheus-values.yaml | tee prometheus-values.yaml

helm install --namespace monitoring prometheus prometheus-community/prometheus --values prometheus-values.yaml


curl -fsSL https://raw.githubusercontent.com/aws/karpenter/"${KARPENTER_VERSION}"/website/content/en/preview/getting-started/getting-started-with-karpenter/grafana-values.yaml | tee grafana-values.yaml

cat << EOF >> grafana-values.yaml
service:
  enabled: true
  type: LoadBalancer
  port: 80
  targetPort: 3000
  annotations: {}
  labels: {}
  portName: service
  appProtocol: ""
EOF
helm install --namespace monitoring grafana grafana-charts/grafana --values grafana-values.yaml
EOS
chmod +x installMonitoring.sh
./installMonitoring.sh >/dev/null 2>&1

获取granafa的admin用户登录密码:

kubectl -n monitoring get secret grafana -o jsonpath="{.data.admin-password}" | base64 --decode ; echo

获取Grafana的ELB地址:

kubectl -n monitoring get svc grafana -o json|jq '.status.loadBalancer.ingress[0].hostname'|tr -d '"'

ELB的部署需要2-3分钟。

在浏览器中打开URL,进入Grafana:

image-20231027210322453

将inflate应用replica扩展到30个:

kubectl scale deployment inflate --replicas=30

在grafana进入dashboard -> Karpenter Performance页面:

image-20231027210632483

Pod start latency部分,显示了pod从创建到运行的时间,可以看到50%以内是在13s内运行:

image-20231027211524817

进入Karpenter capacity dashboard:

image-20231027211708593

在底部的Node Summary页面,显示了每个节点的CPU/内存利用率:

image-20231027211835790

这些指标是从http://karpenter.karpenter.svc.cluster.local:8080/metrics抓到的,指标的详细说明参考: https://karpenter.sh/preview/reference/metrics/