VictoriaMetrics on K8s

2022-06-07

The authorization method used in all articles in this blog is Free reprint - non-commercial - non-derivative - keep signature. Please be sure to indicate the source, thank you.
Disclaimer:
This blog is welcome to forward, but please keep the original author information!
Facebook: @chungpht;
Blog address: Archie's personal blog;
Content is my study, research and summary, if there are similarities, it is an honor!

Install VictoriaMetrics on K8S

  • In previous section we deployed VMs cluster on a baremetal. in this part we will deploy VMs cluster on K8S and some vmalert demo

  • Requirement:
    • Helm 3
    • kubectl
  • Im using Kubenestes Engine by BizFlyCloud for fast provisioning managed K8S cluster

  • Add helm repo
helm repo add vm https://victoriametrics.github.io/helm-charts/
helm repo update
  • Verify that helm added
➜ helm search repo vm/
NAME                         	CHART VERSION	APP VERSION	DESCRIPTION
vm/victoria-metrics-agent    	0.8.10       	v1.78.0    	Victoria Metrics Agent - collects metrics from ...
vm/victoria-metrics-alert    	0.4.33       	v1.78.0    	Victoria Metrics Alert - executes a list of giv...
vm/victoria-metrics-auth     	0.2.51       	1.78.0     	Victoria Metrics Auth - is a simple auth proxy ...
vm/victoria-metrics-cluster  	0.9.30       	1.78.0     	Victoria Metrics Cluster version - high-perform...
vm/victoria-metrics-gateway  	0.1.8        	1.78.0     	Victoria Metrics Gateway - Auth & Rate-Limittin...
vm/victoria-metrics-k8s-stack	0.9.5        	1.78.0     	Kubernetes monitoring on VictoriaMetrics stack....
vm/victoria-metrics-operator 	0.10.3       	0.25.1     	Victoria Metrics Operator
vm/victoria-metrics-single   	0.8.31       	1.78.0     	Victoria Metrics Single version - high-performa...
  • Change helm value
cat <<EOF | helm install vmcluster vm/victoria-metrics-cluster -f -
vmselect:
  podAnnotations:
      prometheus.io/scrape: "true"
      prometheus.io/port: "8481"

vminsert:
  podAnnotations:
      prometheus.io/scrape: "true"
      prometheus.io/port: "8480"

vmstorage:
  persistentVolume:
      enabled: "true"
      storageClass: "premium-ssd"
  podAnnotations:
      prometheus.io/scrape: "true"
      prometheus.io/port: "8482"
EOF     
  • By running helm install vmcluster vm/victoria-metrics-cluster we install VictoriaMetrics cluster to default namespace inside cluster

  • By adding podAnnotations: prometheus.io/scrape: "true" we enable the scraping of metrics from the vmselect, vminsert and vmstorage pods.

  • By adding podAnnotations:prometheus.io/port: "some_port" we enable the scraping of metrics from the vmselect, vminsert and vmstorage pods from their ports as well.

  • By adding storageClass: "premium-ssd", im using SSD volume for vmstorage (default in BizFlyCloud is hdd)

  • Here is output:

cat <<EOF | helm install vmcluster vm/victoria-metrics-cluster -f -
vmselect:
  podAnnotations:
      prometheus.io/scrape: "true"
      prometheus.io/port: "8481"

vminsert:
  podAnnotations:
      prometheus.io/scrape: "true"
      prometheus.io/port: "8480"

vmstorage:
  persistentVolume:
      enabled: "true"
      storageClass: "premium-ssd"
  podAnnotations:
      prometheus.io/scrape: "true"
      prometheus.io/port: "8482"
EOF
W0707 11:30:11.773345   30007 warnings.go:70] policy/v1beta1 PodSecurityPolicy is deprecated in v1.21+, unavailable in v1.25+
W0707 11:30:12.896258   30007 warnings.go:70] policy/v1beta1 PodSecurityPolicy is deprecated in v1.21+, unavailable in v1.25+
NAME: vmcluster
LAST DEPLOYED: Thu Jul  7 11:30:10 2022
NAMESPACE: default
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
Write API:

The Victoria Metrics write api can be accessed via port 8480 with the following DNS name from within your cluster:
vmcluster-victoria-metrics-cluster-vminsert.default.svc.cluster.local

Get the Victoria Metrics insert service URL by running these commands in the same shell:
  export POD_NAME=$(kubectl get pods --namespace default -l "app=vminsert" -o jsonpath="{.items[0].metadata.name}")
  kubectl --namespace default port-forward $POD_NAME 8480

You need to update your Prometheus configuration file and add the following lines to it:

prometheus.yml

    remote_write:
      - url: "http://<insert-service>/insert/0/prometheus/"



for example -  inside the Kubernetes cluster:

    remote_write:
      - url: "http://vmcluster-victoria-metrics-cluster-vminsert.default.svc.cluster.local:8480/insert/0/prometheus/"
Read API:

The VictoriaMetrics read api can be accessed via port 8481 with the following DNS name from within your cluster:
vmcluster-victoria-metrics-cluster-vmselect.default.svc.cluster.local

Get the VictoriaMetrics select service URL by running these commands in the same shell:
  export POD_NAME=$(kubectl get pods --namespace default -l "app=vmselect" -o jsonpath="{.items[0].metadata.name}")
  kubectl --namespace default port-forward $POD_NAME 8481

You need to specify select service URL into your Grafana:
 NOTE: you need to use the Prometheus Data Source

Input this URL field into Grafana

    http://<select-service>/select/0/prometheus/


for example - inside the Kubernetes cluster:

    http://vmcluster-victoria-metrics-cluster-vmselect.default.svc.cluster.local:8481/select/0/prometheus/
  • We can verify
➜ kubectl get pods -o wide
NAME                                                           READY   STATUS    RESTARTS   AGE     IP            NODE                                           NOMINATED NODE   READINESS GATES
vmcluster-victoria-metrics-cluster-vminsert-9f9f844cc-fn99z    1/1     Running   0          3m57s   10.200.0.8    pool-io8z1y3o-2tmpa93tbaza1mma-node-frxbftnn   <none>           <none>
vmcluster-victoria-metrics-cluster-vminsert-9f9f844cc-sjgb5    1/1     Running   0          3m57s   10.200.0.5    pool-io8z1y3o-2tmpa93tbaza1mma-node-frxbftnn   <none>           <none>
vmcluster-victoria-metrics-cluster-vmselect-75b77ffd66-g42qb   1/1     Running   0          3m57s   10.200.0.6    pool-io8z1y3o-2tmpa93tbaza1mma-node-frxbftnn   <none>           <none>
vmcluster-victoria-metrics-cluster-vmselect-75b77ffd66-lszvk   1/1     Running   0          3m57s   10.200.0.7    pool-io8z1y3o-2tmpa93tbaza1mma-node-frxbftnn   <none>           <none>
vmcluster-victoria-metrics-cluster-vmstorage-0                 1/1     Running   0          3m57s   10.200.0.9    pool-io8z1y3o-2tmpa93tbaza1mma-node-frxbftnn   <none>           <none>
vmcluster-victoria-metrics-cluster-vmstorage-1                 1/1     Running   0          2m58s   10.200.0.10   pool-io8z1y3o-2tmpa93tbaza1mma-node-frxbftnn   <none>           <none>

➜ kubectl get svc -o wide
NAME                                           TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)                      AGE     SELECTOR
kubernetes                                     ClusterIP   10.93.0.1       <none>        443/TCP                      118m    <none>
vmcluster-victoria-metrics-cluster-vminsert    ClusterIP   10.93.56.130    <none>        8480/TCP                     5m13s   app.kubernetes.io/instance=vmcluster,app.kubernetes.io/name=victoria-metrics-cluster,app=vminsert
vmcluster-victoria-metrics-cluster-vmselect    ClusterIP   10.93.238.187   <none>        8481/TCP                     5m13s   app.kubernetes.io/instance=vmcluster,app.kubernetes.io/name=victoria-metrics-cluster,app=vmselect
vmcluster-victoria-metrics-cluster-vmstorage   ClusterIP   None            <none>        8482/TCP,8401/TCP,8400/TCP   5m13s   app.kubernetes.io/instance=vmcluster,app.kubernetes.io/name=victoria-metrics-cluster,app=vmstorage

➜ kubectl get pvc -o wide
NAME                                                              STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE     VOLUMEMODE
vmstorage-volume-vmcluster-victoria-metrics-cluster-vmstorage-0   Bound    pvc-67a13278-3b46-4bef-bb46-9628c4b762f2   8Gi        RWO            premium-ssd    6m18s   Filesystem
vmstorage-volume-vmcluster-victoria-metrics-cluster-vmstorage-1   Bound    pvc-c180d36f-c3e9-4e32-833a-5b1edf3f22d9   8Gi        RWO            premium-ssd    5m19s   Filesystem
  • Note: vmselect has vmui at URL <URL>/select/0/prometheus/vmui/

  • Next we will install vm-agent for scrape metric form kube-api server and kubelet and node exporter

  • Install node-exporter:

---
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: node-exporter
  namespace: default
  labels:
    k8s-app: node-exporter
spec:
  selector:
    matchLabels:
      k8s-app: node-exporter
  template:
    metadata:
      labels:
        k8s-app: node-exporter
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "9100"
        prometheus.io/path: "/metrics"
    spec:
      containers:
      - name: node-exporter
        image: quay.io/prometheus/node-exporter:v1.1.2
        ports:
        - name: metrics
          containerPort: 9100
        args:
        - "--path.procfs=/host/proc"
        - "--path.sysfs=/host/sys"
        - "--path.rootfs=/host"
        volumeMounts:
        - name: dev
          mountPath: /host/dev
        - name: proc
          mountPath: /host/proc
        - name: sys
          mountPath: /host/sys
        - name: rootfs
          mountPath: /host
      volumes:
        - name: dev
          hostPath:
            path: /dev
        - name: proc
          hostPath:
            path: /proc
        - name: sys
          hostPath:
            path: /sys
        - name: rootfs
          hostPath:
            path: /
      hostPID: true
      hostNetwork: true
      tolerations:
      - operator: "Exists"
  • Deploy kube-state-metrics for monitors Kubernetes API Server and generate metrics about the state of the object
git clone https://github.com/kubernetes/kube-state-metrics.git -b release-2.5
cd kube-state-metrics/examples/autosharding
kubectl apply -f ./
  • Create file guide-vmcluster-vmagent-values.yaml
remoteWriteUrls:
   - http://vmcluster-victoria-metrics-cluster-vminsert.default.svc.cluster.local:8480/insert/0/prometheus/

config:
  global:
    scrape_interval: 10s

  scrape_configs:
    - job_name: 'vmalalert'
      static_configs:
        - targets: ['vmalert-victoria-metrics-alert-server.default.svc.cluster.local:8880']  
    - job_name: 'kube-state-metrics'
      static_configs:
        - targets: ['kube-state-metrics.kube-system.svc.cluster.local:8080']
    - job_name: vmagent
      static_configs:
        - targets: ["localhost:8429"]
    - job_name: "kubernetes-apiservers"
      kubernetes_sd_configs:
        - role: endpoints
      scheme: https
      tls_config:
        ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
        insecure_skip_verify: true
      bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
      relabel_configs:
        - source_labels:
            [
              __meta_kubernetes_namespace,
              __meta_kubernetes_service_name,
              __meta_kubernetes_endpoint_port_name,
            ]
          action: keep
          regex: default;kubernetes;https
    - job_name: "kubernetes-nodes"
      scheme: https
      tls_config:
        ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
        insecure_skip_verify: true
      bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
      kubernetes_sd_configs:
        - role: node
      relabel_configs:
        - action: labelmap
          regex: __meta_kubernetes_node_label_(.+)
        - target_label: __address__
          replacement: kubernetes.default.svc:443
        - source_labels: [__meta_kubernetes_node_name]
          regex: (.+)
          target_label: __metrics_path__
          replacement: /api/v1/nodes/$1/proxy/metrics
    - job_name: "kubernetes-nodes-cadvisor"
      scheme: https
      tls_config:
        ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
        insecure_skip_verify: true
      bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
      kubernetes_sd_configs:
        - role: node
      relabel_configs:
        - action: labelmap
          regex: __meta_kubernetes_node_label_(.+)
        - target_label: __address__
          replacement: kubernetes.default.svc:443
        - source_labels: [__meta_kubernetes_node_name]
          regex: (.+)
          target_label: __metrics_path__
          replacement: /api/v1/nodes/$1/proxy/metrics/cadvisor
      metric_relabel_configs:
        - action: replace
          source_labels: [pod]
          regex: '(.+)'
          target_label: pod_name
          replacement: '${1}'
        - action: replace
          source_labels: [container]
          regex: '(.+)'
          target_label: container_name
          replacement: '${1}'
        - action: replace
          target_label: name
          replacement: k8s_stub
        - action: replace
          source_labels: [id]
          regex: '^/system\.slice/(.+)\.service$'
          target_label: systemd_service_name
          replacement: '${1}'
    - job_name: "kubernetes-service-endpoints"
      kubernetes_sd_configs:
        - role: endpoints
      relabel_configs:
        - action: drop
          source_labels: [__meta_kubernetes_pod_container_init]
          regex: true
        - action: keep_if_equal
          source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port, __meta_kubernetes_pod_container_port_number]
        - source_labels:
            [__meta_kubernetes_service_annotation_prometheus_io_scrape]
          action: keep
          regex: true
        - source_labels:
            [__meta_kubernetes_service_annotation_prometheus_io_scheme]
          action: replace
          target_label: __scheme__
          regex: (https?)
        - source_labels:
            [__meta_kubernetes_service_annotation_prometheus_io_path]
          action: replace
          target_label: __metrics_path__
          regex: (.+)
        - source_labels:
            [
              __address__,
              __meta_kubernetes_service_annotation_prometheus_io_port,
            ]
          action: replace
          target_label: __address__
          regex: ([^:]+)(?::\d+)?;(\d+)
          replacement: $1:$2
        - action: labelmap
          regex: __meta_kubernetes_service_label_(.+)
        - source_labels: [__meta_kubernetes_namespace]
          action: replace
          target_label: kubernetes_namespace
        - source_labels: [__meta_kubernetes_service_name]
          action: replace
          target_label: kubernetes_name
        - source_labels: [__meta_kubernetes_pod_node_name]
          action: replace
          target_label: kubernetes_node
    - job_name: "kubernetes-service-endpoints-slow"
      scrape_interval: 5m
      scrape_timeout: 30s
      kubernetes_sd_configs:
        - role: endpoints
      relabel_configs:
        - action: drop
          source_labels: [__meta_kubernetes_pod_container_init]
          regex: true
        - action: keep_if_equal
          source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port, __meta_kubernetes_pod_container_port_number]
        - source_labels:
            [__meta_kubernetes_service_annotation_prometheus_io_scrape_slow]
          action: keep
          regex: true
        - source_labels:
            [__meta_kubernetes_service_annotation_prometheus_io_scheme]
          action: replace
          target_label: __scheme__
          regex: (https?)
        - source_labels:
            [__meta_kubernetes_service_annotation_prometheus_io_path]
          action: replace
          target_label: __metrics_path__
          regex: (.+)
        - source_labels:
            [
              __address__,
              __meta_kubernetes_service_annotation_prometheus_io_port,
            ]
          action: replace
          target_label: __address__
          regex: ([^:]+)(?::\d+)?;(\d+)
          replacement: $1:$2
        - action: labelmap
          regex: __meta_kubernetes_service_label_(.+)
        - source_labels: [__meta_kubernetes_namespace]
          action: replace
          target_label: kubernetes_namespace
        - source_labels: [__meta_kubernetes_service_name]
          action: replace
          target_label: kubernetes_name
        - source_labels: [__meta_kubernetes_pod_node_name]
          action: replace
          target_label: kubernetes_node
    - job_name: "kubernetes-services"
      metrics_path: /probe
      params:
        module: [http_2xx]
      kubernetes_sd_configs:
        - role: service
      relabel_configs:
        - source_labels:
            [__meta_kubernetes_service_annotation_prometheus_io_probe]
          action: keep
          regex: true
        - source_labels: [__address__]
          target_label: __param_target
        - target_label: __address__
          replacement: blackbox
        - source_labels: [__param_target]
          target_label: instance
        - action: labelmap
          regex: __meta_kubernetes_service_label_(.+)
        - source_labels: [__meta_kubernetes_namespace]
          target_label: kubernetes_namespace
        - source_labels: [__meta_kubernetes_service_name]
          target_label: kubernetes_name
    - job_name: "kubernetes-pods"
      kubernetes_sd_configs:
        - role: pod
      relabel_configs:
        - action: drop
          source_labels: [__meta_kubernetes_pod_container_init]
          regex: true
        - action: keep_if_equal
          source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port, __meta_kubernetes_pod_container_port_number]
        - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
          action: keep
          regex: true
        - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
          action: replace
          target_label: __metrics_path__
          regex: (.+)
        - source_labels:
            [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
          action: replace
          regex: ([^:]+)(?::\d+)?;(\d+)
          replacement: $1:$2
          target_label: __address__
        - action: labelmap
          regex: __meta_kubernetes_pod_label_(.+)
        - source_labels: [__meta_kubernetes_namespace]
          action: replace
          target_label: kubernetes_namespace
        - source_labels: [__meta_kubernetes_pod_name]
          action: replace
          target_label: kubernetes_pod_name
  • Install vmagent via helm
➜ helm install vmagent vm/victoria-metrics-agent -f  guide-vmcluster-vmagent-values.yaml
W0707 11:47:16.213788   30348 warnings.go:70] policy/v1beta1 PodSecurityPolicy is deprecated in v1.21+, unavailable in v1.25+
W0707 11:47:16.711587   30348 warnings.go:70] policy/v1beta1 PodSecurityPolicy is deprecated in v1.21+, unavailable in v1.25+
NAME: vmagent
LAST DEPLOYED: Thu Jul  7 11:47:16 2022
NAMESPACE: default
STATUS: deployed
REVISION: 1
TEST SUITE: None
  • Check log of vmagent for sure it work
➜ kubectl logs pod/vmagent-victoria-metrics-agent-57fd5c67d5-2jm5l
{"ts":"2022-07-07T04:47:32.559Z","level":"info","caller":"VictoriaMetrics/lib/logger/flag.go:12","msg":"build version: vmagent-20220621-071016-tags-v1.78.0-0-g091408be6"}
{"ts":"2022-07-07T04:47:32.559Z","level":"info","caller":"VictoriaMetrics/lib/logger/flag.go:13","msg":"command line flags"}
{"ts":"2022-07-07T04:47:32.559Z","level":"info","caller":"VictoriaMetrics/lib/logger/flag.go:20","msg":"flag \"envflag.enable\"=\"true\""}
{"ts":"2022-07-07T04:47:32.559Z","level":"info","caller":"VictoriaMetrics/lib/logger/flag.go:20","msg":"flag \"envflag.prefix\"=\"VM_\""}
{"ts":"2022-07-07T04:47:32.559Z","level":"info","caller":"VictoriaMetrics/lib/logger/flag.go:20","msg":"flag \"loggerFormat\"=\"json\""}
{"ts":"2022-07-07T04:47:32.560Z","level":"info","caller":"VictoriaMetrics/lib/logger/flag.go:20","msg":"flag \"promscrape.config\"=\"/config/scrape.yml\""}
{"ts":"2022-07-07T04:47:32.560Z","level":"info","caller":"VictoriaMetrics/lib/logger/flag.go:20","msg":"flag \"remoteWrite.tmpDataPath\"=\"/tmpData\""}
{"ts":"2022-07-07T04:47:32.560Z","level":"info","caller":"VictoriaMetrics/lib/logger/flag.go:20","msg":"flag \"remoteWrite.url\"=\"secret\""}
{"ts":"2022-07-07T04:47:32.560Z","level":"info","caller":"VictoriaMetrics/app/vmagent/main.go:101","msg":"starting vmagent at \":8429\"..."}
{"ts":"2022-07-07T04:47:32.560Z","level":"info","caller":"VictoriaMetrics/lib/memory/memory.go:42","msg":"limiting caches to 2476393267 bytes, leaving 1650928845 bytes to the OS according to -memory.allowedPercent=60"}
{"ts":"2022-07-07T04:47:32.575Z","level":"info","caller":"VictoriaMetrics/lib/persistentqueue/fastqueue.go:59","msg":"opened fast persistent queue at \"/tmpData/persistent-queue/1_AA56387F2518752A\" with maxInmemoryBlocks=400, it contains 0 pending bytes"}
{"ts":"2022-07-07T04:47:32.576Z","level":"info","caller":"VictoriaMetrics/app/vmagent/remotewrite/client.go:176","msg":"initialized client for -remoteWrite.url=\"1:secret-url\""}
{"ts":"2022-07-07T04:47:32.576Z","level":"info","caller":"VictoriaMetrics/app/vmagent/main.go:126","msg":"started vmagent in 0.016 seconds"}
{"ts":"2022-07-07T04:47:32.576Z","level":"info","caller":"VictoriaMetrics/lib/promscrape/scraper.go:103","msg":"reading Prometheus configs from \"/config/scrape.yml\""}
{"ts":"2022-07-07T04:47:32.576Z","level":"info","caller":"VictoriaMetrics/lib/httpserver/httpserver.go:93","msg":"starting http server at http://127.0.0.1:8429/"}
{"ts":"2022-07-07T04:47:32.577Z","level":"info","caller":"VictoriaMetrics/lib/httpserver/httpserver.go:94","msg":"pprof handlers are exposed at http://127.0.0.1:8429/debug/pprof/"}
{"ts":"2022-07-07T04:47:32.583Z","level":"info","caller":"VictoriaMetrics/lib/promscrape/config.go:114","msg":"starting service discovery routines..."}
  • Next we will install Grafana for visualize metrics from VMs
helm repo add grafana https://grafana.github.io/helm-charts
helm repo update
  • Install Grafana, we using service type LoadBalancer for expose grafana dashboard
cat <<EOF | helm install my-grafana grafana/grafana -f -
  service:
    type: LoadBalancer
  datasources:
    datasources.yaml:
      apiVersion: 1
      datasources:
        - name: victoriametrics
          type: prometheus
          orgId: 1
          url: http://vmcluster-victoria-metrics-cluster-vmselect.default.svc.cluster.local:8481/select/0/prometheus/
          access: proxy
          isDefault: true
          updateIntervalSeconds: 10
          editable: true

  dashboardProviders:
   dashboardproviders.yaml:
     apiVersion: 1
     providers:
     - name: 'default'
       orgId: 1
       folder: ''
       type: file
       disableDeletion: true
       editable: true
       options:
         path: /var/lib/grafana/dashboards/default

  dashboards:
    default:
      victoriametrics:
        gnetId: 11176
        revision: 18
        datasource: victoriametrics
      vmagent:
        gnetId: 12683
        revision: 7
        datasource: victoriametrics
      kubernetes:
        gnetId: 14205
        revision: 1
        datasource: victoriametrics
EOF
W0707 12:45:12.018558   30830 warnings.go:70] policy/v1beta1 PodSecurityPolicy is deprecated in v1.21+, unavailable in v1.25+
W0707 12:45:12.095824   30830 warnings.go:70] policy/v1beta1 PodSecurityPolicy is deprecated in v1.21+, unavailable in v1.25+
W0707 12:45:13.962861   30830 warnings.go:70] policy/v1beta1 PodSecurityPolicy is deprecated in v1.21+, unavailable in v1.25+
W0707 12:45:13.965152   30830 warnings.go:70] policy/v1beta1 PodSecurityPolicy is deprecated in v1.21+, unavailable in v1.25+
NAME: my-grafana
LAST DEPLOYED: Thu Jul  7 12:45:11 2022
NAMESPACE: default
STATUS: deployed
REVISION: 1
  • Get Grafana admin password:
kubectl get secret --namespace default my-grafana -o jsonpath="{.data.admin-password}" | base64 --decode ; echo
  • kube-state-metric grafana dashboard: https://grafana.com/grafana/dashboards/13332

  • vmalert grafana dashboard: https://grafana.com/grafana/dashboards/14950

  • Create config-map for vmalert:
cat << EOF | kubectl apply -f - 
apiVersion: v1
kind: ConfigMap
metadata:
  name: vmalert-victoria-metrics-alert-server-alert-rules-config
  namespace: default
data:
  alert-rules.yaml: |-
    groups:
    - name: k8s
        rules:
        - alert: KubernetesNodeReady
            expr: kube_node_status_condition{condition="Ready",status="true"} == 0
            for: 5m
            labels:
            alert_level: high
            alert_type: state
            alert_source_type: k8s
            annotations:
            summary: "Kubernetes Node ready (instance )"
            description: "Node  has been unready for a long time\n VALUE = \n LABELS: "
        - alert: KubernetesMemoryPressure
            expr: kube_node_status_condition{condition="MemoryPressure",status="true"} == 1
            for: 5m
            labels:
            alert_level: middle
            alert_type: mem
            alert_source_type: k8s
            annotations:
            summary: "Kubernetes memory pressure (instance )"
            description: " has MemoryPressure condition\n VALUE = \n LABELS: "
        - alert: KubernetesPodCrashLooping
            expr: rate(kube_pod_container_status_restarts_total[15m]) * 60 * 5 > 5
            for: 5m
            labels:
            alert_level: middle
            alert_type: state
            alert_source_type: k8s
            annotations:
            summary: "Kubernetes pod crash looping (instance )"
            description: "Pod  is crash looping\n VALUE = \n LABELS: "
    ##pod
    - name: pod
        rules:
        - alert: ContainerMemoryUsage
            expr: (sum(container_memory_working_set_bytes) BY (instance, name) / sum(container_spec_memory_limit_bytes > 0) BY (instance, name) * 100) > 80
            for: 5m
            labels:
            alert_level: middle
            alert_type: mem
            alert_source_type: pod
            annotations:
            summary: "Container Memory usage (instance )"
            description: "Container Memory usage is above 80%\n VALUE = \n LABELS: "
    ##kvm
    - name: kvm
        rules:
        - alert: VirtualMachineDown
            expr: up{machinetype="virtualmachine"} == 0
            for: 2m
            labels:
            alert_level: high
            alert_type: state
            alert_source_type: kvm
            annotations:
            summary: "Prometheus VirtualmachineMachine target missing (instance )"
            description: "A Prometheus VirtualMahine target has disappeared. An exporter might be crashed.\n VALUE = \n LABELS: "
        - alert: HostUnusualDiskWriteLatency
            expr: rate(node_disk_write_time_seconds_total{machinetype="virtualmachine"}[1m]) / rate(node_disk_writes_completed_total{machinetype="virtualmachine"}[1m]) > 100
            for: 5m
            labels:
            alert_level: middle
            alert_type: disk
            alert_source_type: kvm
            annotations:
            summary: "Host unusual disk write latency (instance )"
            description: "Disk latency is growing (write operations > 100ms)\n VALUE = \n LABELS: "
        - alert: HostHighCpuLoad
            expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle",machinetype="virtualmachine"}[5m])) * 100) > 80
            for: 5m
            labels:
            alert_level: middle
            alert_type: cpu
            alert_source_type: kvm
            annotations:
            summary: "Host high CPU load (instance )"
            description: "CPU load is > 80%\n VALUE = \n LABELS: "
        - alert: HostSwapIsFillingUp
            expr: (1 - (node_memory_SwapFree_bytes{machinetype="virtualmachine"} / node_memory_SwapTotal_bytes{machinetype="virtualmachine"})) * 100 > 80
            for: 5m
            labels:
            alert_level: middle
            alert_type: mem
            alert_source_type: kvm
            annotations:
            summary: "Host swap is filling up (instance )"
            description: "Swap is filling up (>80%)\n VALUE = \n LABELS: "
        - alert: HostUnusualNetworkThroughputIn
            expr: sum by (instance) (irate(node_network_receive_bytes_total{machinetype="virtualmachine"}[2m])) / 1024 / 1024 > 100
            for: 5m
            labels:
            alert_level: middle
            alert_type: network
            alert_source_type: kvm
            annotations:
            summary: "Host unusual network throughput in (instance )"
            description: "Host network interfaces are probably receiving too much data (> 100 MB/s)\n VALUE = \n LABELS: "
        - alert: HostOutOfMemory
            expr: node_memory_MemAvailable_bytes{machinetype="virtualmachine"} / node_memory_MemTotal_bytes{machinetype="virtualmachine"} * 100 < 10
            for: 5m
            labels:
            alert_level: middle
            alert_type: mem
            alert_source_type: kvm
            annotations:
            summary: "Host out of memory (instance )"
            description: "Node memory is filling up (< 10% left)\n VALUE = \n LABELS: "
            description: "The node is under heavy memory pressure. High rate of major page faults\n VALUE = \n LABELS: "
    #node-exporter
    - name: machine
        rules:
        - alert: MachineDown
            expr: up{machinetype="physicalmachine"} == 0
            for: 2m
            labels:
            alert_level: high
            alert_type: state
            alert_source_type: machine
            annotations:
            summary: "Prometheus Machine target missing (instance )"
            description: "A Prometheus Mahine target has disappeared. An exporter might be crashed.\n VALUE = \n LABELS: "
        - alert: HostUnusualDiskWriteLatency
            expr: rate(node_disk_write_time_seconds_total{machinetype="physicalmachine"}[1m]) / rate(node_disk_writes_completed_total{machinetype="physicalmachine"}[1m]) > 100
            for: 5m
            labels:
            alert_level: middle
            alert_type: disk
            alert_source_type: machine
            annotations:
            summary: "Host unusual disk write latency (instance )"
            description: "Disk latency is growing (write operations > 100ms)\n VALUE = \n LABELS: "
        - alert: HostHighCpuLoad
            expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle",machinetype="physicalmachine"}[5m])) * 100) > 80
            for: 5m
            labels:
            alert_level: middle
            alert_type: cpu
            alert_source_type: machine
            annotations:
            summary: "Host high CPU load (instance )"
            description: "CPU load is > 80%\n VALUE = \n LABELS: "
        - alert: HostSwapIsFillingUp
            expr: (1 - (node_memory_SwapFree_bytes{machinetype="physicalmachine"} / node_memory_SwapTotal_bytes{machinetype="physicalmachine"})) * 100 > 80
            for: 5m
            labels:
            alert_level: middle
            alert_type: state
            alert_source_type: machine
            annotations:
            summary: "Host swap is filling up (instance )"
            description: "Swap is filling up (>80%)\n VALUE = \n LABELS: "
        - alert: HostUnusualNetworkThroughputIn
            expr: sum by (instance) (irate(node_network_receive_bytes_total{machinetype="physicalmachine"}[2m])) / 1024 / 1024 > 100
            for: 5m
            labels:
            alert_level: middle
            alert_type: network
            alert_source_type: machine
            annotations:
            summary: "Host unusual network throughput in (instance )"
            description: "Host network interfaces are probably receiving too much data (> 100 MB/s)\n VALUE = \n LABELS: "
        - alert: HostOutOfMemory
            expr: node_memory_MemAvailable_bytes{machinetype="physicalmachine"} / node_memory_MemTotal_bytes{machinetype="physicalmachine"} * 100 < 10
            for: 5m
            labels:
            alert_level: middle
            alert_type: mem
            alert_source_type: machine
            annotations:
            summary: "Host out of memory (instance )"
            description: "Node memory is filling up (< 10% left)\n VALUE = \n LABELS: "
            description: "The node is under heavy memory pressure. High rate of major page faults\n VALUE = \n LABELS: "
        - alert: HostOutOfDiskSpace
            expr: (node_filesystem_avail_bytes{machinetype="physicalmachine"} * 100) / node_filesystem_size_bytesi{machinetype="physicalmachine"} < 10
            for: 5m
            labels:
            alert_level: middle
            alert_type: disk
            alert_source_type: machine
            annotations:
            summary: "Host out of disk space (instance )"
            description: "Disk is almost full (< 10% left)\n VALUE = \n LABELS: "
        - alert: HostDiskWillFillIn4Hours
            expr: predict_linear(node_filesystem_free_bytes{fstype!~"tmpfs",machinetype="physicalmachine"}[1h], 4 * 3600) < 0
            for: 5m
            labels:
            alert_level: middle
            alert_type: disk
            alert_source_type: machine
            annotations:
            summary: "Host disk will fill in 4 hours (instance )"
            description: "Disk will fill in 4 hours at current write rate\n VALUE = \n LABELS: "
        - alert: HostOutOfInodes
            expr: node_filesystem_files_free{mountpoint ="/rootfs",machinetype="physicalmachine"} / node_filesystem_files{mountpoint ="/rootfs",machinetype="physicalmachine"} * 100 < 10
            for: 5m
            labels:
            alert_level: middle
            alert_type: disk
            alert_source_type: machine
            annotations:
            summary: "Host out of inodes (instance )"
            description: "Disk is almost running out of available inodes (< 10% left)\n VALUE = \n LABELS: "
        - alert: HostOomKillDetected
            expr: increase(node_vmstat_oom_kill{machinetype="physicalmachine"}[5m]) > 0
            for: 5m
            labels:
            alert_level: middle
            alert_type: state
            alert_source_type: machine
            annotations:
            summary: "Host OOM kill detected (instance )"
            description: "OOM kill detected\n VALUE = \n LABELS: "
        - alert: HostNetworkTransmitErrors
            expr: increase(node_network_transmit_errs_total{machinetype="physicalmachine"}[5m]) > 0
            for: 5m
            labels:
            alert_level: middle
            alert_type: network
            alert_source_type: machine
            annotations:
            summary: "Host Network Transmit Errors (instance )"
            description: ' interface  has encountered  transmit errors in the last five minutes.\n VALUE = \n LABELS: '
EOF
  • Next we install and config vmalert
cat <<EOF | helm install vmalert vm/victoria-metrics-alert -f -
server:
  datasource:
    url: "http://vmcluster-victoria-metrics-cluster-vmselect.default.svc.cluster.local:8481/select/0/prometheus/"
  remote:
    write:
      url: "http://vmcluster-victoria-metrics-cluster-vminsert.default.svc.cluster.local:8480/insert/0/prometheus/"
  extraArgs:
    envflag.enable: "true"
    envflag.prefix: VM_
    loggerFormat: json
  notifier:
    alertmanager:
      url: "http://vmalert-alertmanager.default.svc.cluster.local:9093"
  configMap: "vmalert-victoria-metrics-alert-server-alert-rules-config"
alertmanager:
  enabled: true
  image: prom/alertmanager
  tag: latest
  config:
    receivers:
      - name: devnull
        telegram_configs:
          - api_url: https://api.telegram.org
            bot_token: "xnxx"
            chat_id: 454062609
            parse_mode: "HTML"
EOF
  • Optional, create karma dashboard for alermanager
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: karma
  name: karma
  namespace: default
spec:
  replicas: 1
  selector:
    matchLabels:
      app: karma
  template:
    metadata:
      labels:
        app: karma
    spec:
      containers:
        - image: ghcr.io/prymitive/karma:v0.85
          name: karma
          ports:
            - containerPort: 8080
              name: http
          resources:
            limits:
              cpu: 400m
              memory: 400Mi
            requests:
              cpu: 200m
              memory: 200Mi
          env:
          - name: ALERTMANAGER_URI
            value: "http://vmalert-alertmanager.default.svc.cluster.local:9093"
---
apiVersion: v1
kind: Service
metadata:
  labels:
    app: karma
  name: karma
  namespace: default
spec:
  ports:
    - name: http
      port: 8080
      targetPort: http
  selector:
    app: karma
  type: NodePort
  • Optionnal create promxy for prometheus proxy for vmselect if dont want use vmui
apiVersion: v1
data:
  config.yaml: |
    ### Promxy configuration  Just configure victoriametrics select Component address and interface 
    promxy:
      server_groups:
        - static_configs:
            - targets:
              - vmcluster-victoria-metrics-cluster-vmselect.default.svc.cluster.local:8481
          path_prefix: /select/0/prometheus
kind: ConfigMap
metadata:
  name: promxy-config
  namespace: default
---
apiVersion: v1
kind: Service
metadata:
  labels:
    app: promxy
  name: promxy
  namespace: default
spec:
  ports:
  - name: promxy
    port: 8082
    protocol: TCP
    targetPort: 8082
  type: NodePort
  selector:
    app: promxy
---
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: promxy
  name: promxy
  namespace: default
spec:
  replicas: 1
  selector:
    matchLabels:
      app: promxy
  template:
    metadata:
      labels:
        app: promxy
    spec:
      containers:
      - args:
        - "--config=/etc/promxy/config.yaml"
        - "--web.enable-lifecycle"
        command:
        - "/bin/promxy"
        image: quay.io/jacksontj/promxy:latest
        imagePullPolicy: Always
        livenessProbe:
          httpGet:
            path: "/-/healthy"
            port: 8082
          initialDelaySeconds: 3
        name: promxy
        ports:
        - containerPort: 8082
        readinessProbe:
          httpGet:
            path: "/-/ready"
            port: 8082
          initialDelaySeconds: 3
        volumeMounts:
        - mountPath: "/etc/promxy/"
          name: promxy-config
          readOnly: true

      - args:
        - "--volume-dir=/etc/promxy"
        - "--webhook-url=http://localhost:8082/-/reload"
        image: jimmidyson/configmap-reload:v0.1
        name: promxy-server-configmap-reload
        volumeMounts:
        - mountPath: "/etc/promxy/"
          name: promxy-config
          readOnly: true
      volumes:
      - configMap:
          name: promxy-config
        name: promxy-config

Discuss

Appreciation

Lorem ipsum dolor sit amet



List of contents