如何用Prometheus和Grafana监控Kubernetes集群？

作者 | Kubernetes Advocate

Prometheus 是一款免费软件，用于监控事件和警报工具。它可以帮助在时间戳系列数据库中记录实时指标，使用 Http 模型进行 n 次查询和实时报警。我们可以使用 Prometheus 来监控整个 Kubernetes 集群。

Prometheus 栈包括：

Prometheus
Alertmanager
kube-state-metrics
node-exporter
Grafana

我们还可以在其中包括警报和仪表板。

Capacity planning
Cluster health
Deployments
k8s cluster rsrc use
k8s node rsrc use
k8s resources cluster
k8s resources namespace
k8s resources pod
kube DNS
kubelet
Nodes
Pods
Statefulset
Kubernetes all-nodes
Kubernetes cluster-all
Kubernetes pods-cluster
Kubernetes resources-requests

警报

Component Down （API Server、Kubelet、Node exporter、Alertmanager 以及 Prometheus 等等）
Pod alerts （Crashloopbackoff、Pending，尚未就绪）
Workload controller alerts （Replicas Mismatch、DaemonSet NotScheduled、DaemonSet MisScheduled、Job Failed 和 Long-running Jobs）
Resources alerts （Cpu overcommit、Memory overcommit、Quota exceeded）
Persistent Volume alerts
Kube API error 和 Client alerts
Prometheus configuration error alerts

安装

第一步：从 GitHub 克隆 Prometheus-grafana 仓库：

git clone URL to GIT REPO

第二步：创建一个 manifest 文件：

cd Prometheus-grafanaawk ‘FNR==1 {print “ — -”}{print}’ manifests/* > “prometheus_grafana_manifest.yaml”

第三步：安装 Prometheus-Grafana 栈：

kubectl apply -f prometheus_grafana_manifest.yaml

第四步：为 Grafana 创建 ingress：

如果集群中有一个 ingress 控制器，请更新 grafana-ingress.yaml 文件中的域和 ingress 类，并创建 ingress 资源。

kubectl apply -f grafana-ingress.yaml

如果没有 ingress 控制器，仍然可以使用负载平衡服务或节点端口服务，或使用 Kube-proxy 访问 grafana 。

Grafana Credentials（凭据）

Grafana 的默认凭据为：

Username：Cloud
Password：Cloud

Grafana 登陆页面：

Grafana Nodes 仪表板

你可以根据自己的兴趣设置自己的用户名和密码。

在更新凭据机密文件中的值之前，必须以 base64 格式对用户名和密码进行编码。

echo “myuser” | base64
bXl1c2VyCg==
echo “HgTf0n9L@wrd” | base64 HgTf0n9L@wrd
GHJKLYuiGFDYH=

现在，我们将使用 manifests 目录下的 2-grafana-cerdentials-secret.yaml 中用 base64 编码的用户名和密码来“更新 admin-user 和 admin-password 的值”。

apiVersion: v1
kind: Secret
metadata:
name: grafana
namespace: prometheus
labels:
app.kubernetes.io/name: prometheus
app.kubernetes.io/component: grafana
type: Opaque
data:
admin-user: jdvchksojs)==
admin-password: GHJKLYuiGFDYH=

运行命令：

kubectl apply -f 2-grafana-credentials-secret.yaml

如果 Grafana 已经安装并正在运行，则必须删除现有的 Pod。我们将看到一个新的 Pod，具有最新配置和更新配置。

获取 Grafana 凭据

你可以通过解码值从 secret 中获得凭据：

echo "Username: $(kubectl get secret grafana --namespace prometheus 
--output=jsonpath='{.data.admin-user}' | base64 --decode)"
echo "Password: $(kubectl get secret grafana --namespace prometheus 
--output=jsonpath='{.data.admin-password}' | base64 --decode)"

我们还可以看到，在 Prometheus 中，无需身份验证即可登录到 Web 界面。

Prometheus Web 界面：

配置 Alertmanager（警报管理器）

在安装栈时，必须提供警报接收器的详细信息。

否则，你将永远不会收到有关集群状态变更和资源利用率的通知。

我们可以根据需要更改配置。

Alert Manager 配置了一个以 YAML 格式编写的配置文件，该文件定义了规则、通知路由和接收器。

下面是 Email、Slack 和 Webhook 接收器的配置示例：

Email ：

global:
resolve_timeout: 5m
receivers:
- name: email_config
email_configs:
- to: "< to_address >"
from: "< from_address >"
smarthost: "< smtp_host:port >"
auth_username: "< smtp_username >"
auth_password: "< smtp_password >"
route:
group_by:
- job
receiver: email_config
group_interval: 5m
group_wait: 30s
repeat_interval: 30m

Slack :

global:
resolve_timeout: 5m
slack_api_url: "< slack_webhook_url >"
receivers:
- name: "slack-notifications"
slack_configs:
- channel: "#alerts"
route:
group_by:
- job
receiver: "slack-notifications"
group_interval: 5m
group_wait: 30s
repeat_interval: 30m

Web-hook :

global:
resolve_timeout: 5m
receivers:
- name: webhook
webhook_configs:
- url: "< webhook_url >"
route:
group_by:
- job
repeat_interval: 30m
group_interval: 5m
group_wait: 30s
receiver: webhook

如上所述，在 mainifests 目录下的 1-alermanager-configmap.yaml 文件中更新配置，并应用配置。

kubectl apply -f 1-alertmanager-configmap.yaml

更新 coonfigmap 后，重启正在运行的 alertmanager pod。将使用更新后的配置创建一个新的 pod。

参考阅读：

https://medium.com/faun/how-to-monitor-kubernetes-cluster-with-prometheus-and-grafana-8ec7e060896f