Prometheus问题记录1--ArgocdClusterConnectionDown误报

背景

最近发现集成的zabbix/prometheus一直在报Argocd集群连接问题,但是在Argocd页面上看又是正常的,
原表达式

1
2
3
4
5
6
7
8
- alert: "Argocd Cluster Connection Down"
expr: argocd_cluster_connection_status{status="down"} > 0
for: 5m
labels:
severity: warning
annotations:
summary: Argocd Cluster Connection Unhealthy (instance {{ $labels.instance }})
description: "Service {{ $labels.name }} Argocd Cluster Connection is down.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"

解决办法

参考MR: https://github.com/argoproj/argo-cd/pull/7419
解析:argocd_cluster_connection_status以前返回0表示没问题,后来倒了一下,如果返回1说明没问题,所以导致了argocd升级之后一直返回0,而Prometheus的rule没有对应的升级,导致了误报

修改以后

1
2
3
4
5
6
7
8
9
10
name: Argocd Cluster Connection Down
expr: argocd_cluster_connection_status < 1
for: 5m
labels:
severity: warning
annotations:
description: Service {{ $labels.name }} Argocd Cluster Connection is down.
VALUE = {{ $value }}
LABELS = {{ $labels }}
summary: Argocd Cluster Connection Unhealthy (instance {{ $labels.instance }})

问题解决