Prometheus问题记录1--ArgocdClusterConnectionDown误报
背景
最近发现集成的zabbix/prometheus一直在报Argocd集群连接问题,但是在Argocd页面上看又是正常的,
原表达式
1 2 3 4 5 6 7 8
| - alert: "Argocd Cluster Connection Down" expr: argocd_cluster_connection_status{status="down"} > 0 for: 5m labels: severity: warning annotations: summary: Argocd Cluster Connection Unhealthy (instance {{ $labels.instance }}) description: "Service {{ $labels.name }} Argocd Cluster Connection is down.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
|
解决办法
参考MR: https://github.com/argoproj/argo-cd/pull/7419
解析:argocd_cluster_connection_status以前返回0表示没问题,后来倒了一下,如果返回1说明没问题,所以导致了argocd升级之后一直返回0,而Prometheus的rule没有对应的升级,导致了误报
修改以后
1 2 3 4 5 6 7 8 9 10
| name: Argocd Cluster Connection Down expr: argocd_cluster_connection_status < 1 for: 5m labels: severity: warning annotations: description: Service {{ $labels.name }} Argocd Cluster Connection is down. VALUE = {{ $value }} LABELS = {{ $labels }} summary: Argocd Cluster Connection Unhealthy (instance {{ $labels.instance }})
|
问题解决
前Android/Vue开发,现Infra从业人员,主营监控/AWS