首页 公告 项目 RSS

使用prometheus和mimir做k8s多集群监控

September 19, 2023 本文有 1813 个字 需要花费 4 分钟阅读

简介

大家都知道,我之前的监控是使用 prometheus + thanos 做的,但是 thanos 组件过于多,很麻烦,组件之间各种链接,当你做多集群监控的时候因为组件过于多,维护会很难受,因此就想着用 grafana 的mimir去替换掉 thanos

mimir介绍

如果熟悉 thanos 的话,其实mimir是很好上手的,因为mimir很多的组件和 thanos 是重叠的,下面是 thanos 的组件

  • Compactor
  • Querier/Query
  • Query Frontend
  • Receiver
  • Rule (Aka Ruler)
  • Sidecar
  • Store
  • Tools

下面是mimir的组件

  • Compactor
  • Distributor
  • Ingester
  • Querier
  • Query-frontend
  • Store-gateway
  • (Optional) Alertmanager
  • (Optional) Overrides-exporter
  • (Optional) Query-scheduler
  • (Optional) Ruler

如果你研究 thanos 的话,你要清楚知道各个组件是干什么用的,但是你研究mimir刚开始的时候其实可以不用那么清楚的知道各个组件的用处,因为mimir提供了三种部署模式

  • 单体模式
  • 微服务模式
  • 读写模式

单体模式很简单,就是直接启动所有的组件在一个进程中,适合量比较小的监控系统

微服务模式就是把各个组件都拆开单独部署,有点小复杂,适合大规模应用

如果你不想这么复杂,但是又有一定的量的话,可以试试读写模式,类似于数据库中的读写分离

因为我不想和之前 thanos 一样这么复杂,所以就选择了单体模式去部署到k8s中,毕竟刚开始就这么复杂也没有什么必要,我k8s的 yaml也是根据下面这个

https://github.com/grafana/mimir/tree/main/docs/sources/mimir/get-started/play-with-grafana-mimir

去修改的

部署说明

我所有的部署,都是在k8s下面进行的,所有的 yaml 使用kustomization去编译,然后扔到 git 上使用 argocd 去部署,这样我可以方便知道每一次修改了什么东西,下面是mimir目录结构

mimir
|____mimir
| |____mimir-1
| |____mimir-0
| |____grafana
| |____minio
| |____mimir-2

部署 minio

mimir支持的存储其实并没有和 thanos 一样那么多,但是我觉得支持那么多也没啥必要,只要支持s3就足够了,下面是mimir支持的存储

  • Amazon S3 (and compatible implementations like MinIO)
  • Google Cloud Storage
  • Azure Blob Storage
  • Swift (OpenStack Object Storage)

后期肯定是会支持更多的存储的

minio 这里我定义成了 statefulset,后端 pvc 用什么大家可以根据实际的情况去操作,也没有什么特别的,除了statefulset再去创建一个 service 可以让 mimir 去链接就好了

sts.yaml

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: minio
  namespace: mimir
spec:
  selector:
    matchLabels:
      app: minio 
  serviceName: "minio"
  replicas: 1 
  template:
    metadata:
      labels:
        app: minio 
    spec:
      containers:
      - name: minio
        image: minio/minio:RELEASE.2023-08-23T10-07-06Z
        command:
          - sh
          - -c
          - mkdir -p /data/mimir && minio server /data --console-address :9001
        env:
        - name: MINIO_ROOT_USER
          value: mimir
        - name: MINIO_ROOT_PASSWORD
          value: xxxxxxxxxxxxxxxxx
        ports:
        - containerPort: 9000
          name: http
        - containerPort: 9001
          name: web
        volumeMounts:
        - name: minio-data
          mountPath: /data
  volumeClaimTemplates:
  - metadata:
      name: minio-data
    spec:
      accessModes: [ "ReadWriteOnce" ]
      storageClassName: "xxxxxxxxxxxxxxxxxxx"
      resources:
        requests:
          storage: 500Gi

svc.yaml

apiVersion: v1
kind: Service
metadata:
  name: minio
  namespace: mimir
spec:
  selector:
    app: minio
  type: ClusterIP
  ports:
  - name: minio
    protocol: TCP
    port: 9000
    targetPort: 9000

kustomization.yaml

resources:
- ./pvc.yaml
- ./deploy.yaml
- ./svc.yaml

部署mimir-0

mimir 这边我定义的是 deployment + pvc

deploy.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name:  mimir-0
  namespace: mimir
  labels:
    name: mimir
    app: mimir-0
spec:
  selector:
    matchLabels:
      app: mimir-0
  replicas: 1
  template:
    metadata:
      labels:
        app:  mimir-0
        name: mimir
    spec:
      containers:
      - name: mimir
        image: grafana/mimir:2.9.0
        args:
          - -config.file=/etc/mimir.yaml
        ports:
        - containerPort: 8080
          name: http
        - containerPort: 7946
          name: memberlist
        volumeMounts:
        - name: mimir-conf
          mountPath: /etc/mimir.yaml
          subPath: mimir.yaml
        - name:  mimir-data
          mountPath: /tsdb
      volumes:
        - name: mimir-conf
          configMap:
            name: mimir-conf
            items:
            - key: mimir.yaml
              path: mimir.yaml
        - name:  mimir-data
          persistentVolumeClaim:
            claimName: mimir-0
      restartPolicy: Always

pvc.yaml

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: mimir-0
  namespace: mimir
  labels:
    app: mimir-0
spec:
  storageClassName: xxxxxxxxxxxxxx
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 50Gi

svc.yaml

apiVersion: v1
kind: Service
metadata:
  name: mimir-0
  namespace: mimir
spec:
  selector:
    app: mimir-0
  type: ClusterIP
  ports:
  - name: mimir
    protocol: TCP
    port: 7946
    targetPort: 7946

部署mimir-x

因为 mimir 可以横向无限扩容下去,所以你可以无限部署下去,除了名字不一样,其他的都一样,我的话部署了三个节点

  • mimir-0
  • mimir-1
  • mimir-2

这三个节点的配置肯定是一样的,如下

mimir.yaml

target: all

common:
  storage:
    backend: s3
    s3:
      endpoint: minio:9000
      access_key_id: mimir
      secret_access_key: xxxxxxxxxxxxxxx
      insecure: true
      bucket_name: mimir

blocks_storage:
  s3:
    bucket_name: mimir-blocks

memberlist:
  join_members: [mimir-1.mimir.svc.cluster.local, mimir-1.mimir.svc.cluster.local, mimir-1.mimir.svc.cluster.local]

server:
  log_level: info


limits:
# 一个租户允许摄取的大小
  ingestion_rate: 100000
  ingestion_burst_size: 2000000

配置貌似也没有什么可以说的

之后定义 mimir 集群的 svc 来和 ingress 绑定

apiVersion: v1
kind: Service
metadata:
  name: mimir
  namespace: mimir
spec:
  selector:
    name: mimir
  type: ClusterIP
  ports:
  - name: mimir
    protocol: TCP
    port: 8080
    targetPort: 8080

因为是多集群的监控,所以集群和集群之间肯定不是处于一个网络下,所以为了安全,mimir 和 prometheus 之间的链接最好有一个认证,简单起见,直接使用 http basic auth 了

所以创建一个 secret 定义 http basic auth 的账号密码

auth.yaml

apiVersion: v1
data:
  auth: xxxxxxxxxxxxxxxxxxx
kind: Secret
metadata:
  name: basic-auth
  namespace: mimir

然后就是 ingress

ingress.yaml

apiVersion: v1
kind: Secret
metadata:
  name: mimir
  namespace: mimir
type: kubernetes.io/tls
data:
  tls.crt: xxxxxxxxxxxxxxxxxx
  tls.key: xxxxxxxxxxxxxxxxxx

---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: mimir
  namespace: mimir
  annotations:
    kubernetes.io/ingress.class: "nginx"
    nginx.org/proxy-connect-timeout: "180s"
    nginx.org/proxy-read-timeout: "180s"
    nginx.org/client-max-body-size: "1024m"
    nginx.ingress.kubernetes.io/auth-type: basic
    nginx.ingress.kubernetes.io/auth-secret: basic-auth
    nginx.ingress.kubernetes.io/auth-realm: 'Authentication Required'
spec:
  tls:
  - hosts:
      - mimir.xxxxxxxxxxxx.com
    secretName: mimir
  rules:
  - host: mimir.xxxxxxxxxxxx.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: mimir
            port:
              number: 8080

然后就是总的

kustomization.yaml

resources:
- ./mimir-0
- ./mimir-1
- ./mimir-2
- ./minio

- ./ingress.yaml
- ./svc.yaml
- ./auth.yaml

namespace: mimir

configMapGenerator:
- name: mimir-conf
  files:
  - ./mimir.yaml
  options: 
    disableNameSuffixHash: true

定义 grafana

grafana 这部分就不说了,没什么好说的

定义 prometheus

prometheus 一定要是 agent 模式,不然的话部分数据是直接存储在 prometheus 中的,你在 grafana 中添加 mimir 的数据源之后,默认就少了两小时的数据

同样 prometheus 我也定义成 deployment

deploy.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name:  prometheus
  namespace: prometheus
  labels:
    app:  prometheus
spec:
  selector:
    matchLabels:
      app: prometheus
  replicas: 1
  template:
    metadata:
      labels:
        app:  prometheus
    spec:
      securityContext:
        runAsUser: 0
      hostAliases:
        - ip: 1.1.1.1
          hostnames:
            - mimir.xxxxxxxxx.com
      serviceAccountName: prometheus
      containers:
      - name:  prometheus
        image:  prom/prometheus:v2.46.0
        ports:
          - containerPort: 9090
            name: web
        args:
        - --enable-feature=agent
        - --config.file=/etc/prometheus.yaml
        volumeMounts:
        - name: prometheus-config
          mountPath: /etc/prometheus.yaml
          subPath: prometheus.yaml
        - name: prometheus-data
          mountPath: /prometheus
      volumes:
      - name: prometheus-config
        configMap:
          name: prometheus-config
      - name:  prometheus-data
        persistentVolumeClaim:
          claimName: prometheus
      restartPolicy: Always

这里我用了hostAliases,这样子的话可以不用添加 dns 解析了,而且进一步加强了安全

注意

--enable-feature=agent

开启 agent 模式

pvc.yaml

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: prometheus
  namespace: prometheus
  labels:
    app: prometheus
spec:
  storageClassName: xxxxxxxxxxxx
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 50Gi

然后就是 prometheus 的配置

prometheus.yaml

global:
  scrape_interval: 5s
  external_labels:
    cluster: xxxxxxxxxxxxx

  - job_name: node-exporter
    static_configs:
    - targets: 
      - "192.168.1.1:9100"

remote_write:
  - url: https://xxxxx:[email protected]/api/v1/push
    headers:
      X-Scope-OrgID: xxxxxx

kustomization.yaml

resources:
- ./deploy.yaml
- ./pvc.yaml

namespace: prometheus
configMapGenerator:
- name: prometheus-config
  files:
  - ./prometheus.yaml
  options: 
    disableNameSuffixHash: true

之后的工作

之后就是添加各种各样的 exporter 去采集数据就好了,当然你也可以把 prometheus 去换成 grafana agent,但是 grafana agent能采集的东西也就那么多,并不是很全面,虽然对于大部分场景是够用了的,所以我这边还是使用 prometheus 比较优雅

欢迎关注我的博客www.bboy.app

Have Fun