深入解析Kubernetes监控体系与prometheus-adapter

本文发布于Cylon的收藏册，转载请著名原文链接~

Kubernetes监控架构设计

k8s监控设计背景说明

根据 Kubernetes监控架构 ¹，Kubernetes 集群中的 metrcis 可以分为 系统指标 (Core Metrics) 和 服务指标 (service metrics) ; 系统指标(System metrics) 是通用的指标，通常可以从每一个被监控的实体中获得（例如，容器和节点的CPU和内存使用情况）。服务指标(Service metrics) 是在应用程序代码中显式定义并暴露的 (例如，API Server 处理的 500 错误数量)。

Kubernetes将系统指标分为两部分：

核心指标 (core metrics) 是 Kubernetes 理解和用于其内部组件和核心工具操作的指标，例如：用于调度的指标 (包括资源估算算法的输入, 初始资源/VPA (vertical autoscaling)，集群自动扩缩 (cluster autoscaling)，水平Pod自动扩缩 (horizontal pod autoscaling ) 除自定义指标之外的指标)；Kube Dashboard 使用的指标，以及 “kubectl top” 命令使用的指标。
非核心指标 (non-core metrics) 是指不被 Kubernetes 解释的指标。我们一般假设这些指标包含核心指标 (但不一定是 Kubernetes 可理解的格式)，以及其他额外的指标。

所以，kubernetes monitoring 的架构被设计拥有如下特点：

通过标准的主 API (当前为主监控 API) 提供关于Node, Pod 和容器的核心系统指标，使得核心 Kubernetes 功能不依赖于非核心组件
kubelet 只导出有限的指标集，即核心 Kubernetes 组件正常运行所需的指标。
…

监控管道

Kubernetes 监控管道分为两个：

核心指标管道 (core metrics pipeline) 由 Kubelet、资源估算器, 一个精简版 Heapster (metrics-server)，以及 api-server 中 master metrics API 组成。这些指标被核心系统组件使用，例如调度逻辑（如调度器和基于系统指标的HPA）和一些简单 UI 组件（如 kubectl top），这个管道并不打算与第三方监控系统集成。
监控管道：一个用于收集系统中的各种指标并将其暴露给最终用户端，以及通过适配器暴露给 HPA(用于自定义指标) 和 Infrastore 的。用户可以选择多种监控系统供应商（例如 Prometheus, metric-server），也可以完全不使用。

Core Metrics Pipeline

根据 kubernetes 监控设计文档可以得知，核心指标指

使用这组核心指标，由Kubelet收集，并仅供 Kubernetes 系统组件使用，支持"第一类资源隔离和利用特性"。
不设计成面向用户的 API，而是尽可能通用，以支持未来的用户级组件。

核心指标的包含三类：

CpuUsage: 记录从创建对象开始的累计CPU使用时间。
MemoryUsage: 记录工作集内存使用量。
FilesystemUsage: 记录文件系统使用情况,包括已用字节数和已用Inode数。

Monitoring Pipeline

根据 Kubernetes 监控设计文档 ¹ 得知，监控管道用于与核心Kubernetes组件分开的系统，可以更加灵活。并且监控管道可以收集不同类型的指标：

Core system metrics
Non-core system metrics
Service metrics from user application containers
Service metrics from Kubernetes infrastructure containers (using Prometheus instrumentation)

监控管道主要用于根据自定义指标进行 HPA，监控管道提供了一个无状态的 API Adapter，用于拉去监控给 HPA

指标API

API类别

根据监控架构设计文档，Kubernetes 定义了两套指标 API，资源指标 API 和自定义指标 API；Kubernetes 为资源指标 API 提供了两种实现：Heapster 和 metrics-server，而自定义指标 API 由不同的监控供应商实现。下面将详细描述每个 API。

资源指标 API (Resource Metrics API)：该 API 允许消费者访问 Pod 和 Node 的资源指标（CPU & Memory）
- The API is implemented by metrics-server and prometheus-adapter.
自定义指标 API (Custom Metrics API)：该 API 允许消费者访问描述 Kubernetes 资源的任意指标。
- 用户可以根据 kubernetes-sigs/custom-metrics-apiserver 仓库来自定义 API-server

API的访问

资源指标，该 API 是在 /apis/metrics.k8s.io/ ，可以使用 kubectl proxy --port 8080 代理后进行访问，

$ kubectl proxy --port=8080
$ curl localhost:8080/apis/metrics.k8s.io/v1beta1/nodes

或者使用 kubectl get --raw 进行获取

$ kubectl get --raw "/apis/metrics.k8s.io/v1beta1/nodes" | jq

自定义指标，该 API 是在 /apis/custom.metrics.k8s.io/ ，访问的方式相同，用户通过该 PATH 进行访问。

# 查看有哪些指标可用
$ kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1/" | jq

Prometheus-adapter

通过上一章节介绍了kubernetes监控体系，这已经可以了解到了 prometheus-adapter 的定位；prometheus-adapter 是通过 kubernetes custom-metrics-apiserver 标准实现的一个 custom.metrics.k8s.io API，用于提供给 HPA 的一种指标适配器，可以将任何指标转化为 HPA 可用的指标。他全名为 Kubernetes Custom Metrics Adapter for Prometheus。

prometheus-adapter配置文件详解

prometheus-adapter负责确定哪些指标以及如何去发现这些指标，根据这个标准，配置文件分为四个步骤来完成这套 “发现” 规则

每一个指标可以大致分为四个部分，对应在配置文件中：

Discovery ，用于指定 adapter 应如何查找此规则的所有Prometheus指标。
Association ，用于指定 adapter 应如何确定特定指标与哪些 Kubernetes 资源相关联。
Naming ，用于指定 adapter 应如何在自定义指标 API 中公开该指标。
Querying ，用于指定如何将针对一个或多个 Kubernetes 对象的特定指标请求转换为对 Prometheus 的查询。

配置文件如下所示，这是官方给出的样板配置文件（文章编写时版本为0.12）

rules:
# Each rule represents a some naming and discovery logic.
# Each rule is executed independently of the others, so
# take care to avoid overlap.  As an optimization, rules
# with the same `seriesQuery` but different
# `name` or `seriesFilters` will use only one query to
# Prometheus for discovery.

# some of these rules are taken from the "default" configuration, which
# can be found in pkg/config/default.go

# this rule matches cumulative cAdvisor metrics measured in seconds
- seriesQuery: '{__name__=~"^container_.*",container!="POD",namespace!="",pod!=""}'
  resources:
    # skip specifying generic resource<->label mappings, and just
    # attach only pod and namespace resources by mapping label names to group-resources
    overrides:
      namespace: {resource: "namespace"}
      pod: {resource: "pod"}
  # specify that the `container_` and `_seconds_total` suffixes should be removed.
  # this also introduces an implicit filter on metric family names
  name:
    # we use the value of the capture group implicitly as the API name
    # we could also explicitly write `as: "$1"`
    matches: "^container_(.*)_seconds_total$"
  # specify how to construct a query to fetch samples for a given series
  # This is a Go template where the `.Series` and `.LabelMatchers` string values
  # are available, and the delimiters are `<<` and `>>` to avoid conflicts with
  # the prometheus query language
  metricsQuery: "sum(rate(<<.Series>>{<<.LabelMatchers>>,container!="POD"}[2m])) by (<<.GroupBy>>)"

# this rule matches cumulative cAdvisor metrics not measured in seconds
- seriesQuery: '{__name__=~"^container_.*_total",container!="POD",namespace!="",pod!=""}'
  resources:
    overrides:
      namespace: {resource: "namespace"}
      pod: {resource: "pod"}
  seriesFilters:
  # since this is a superset of the query above, we introduce an additional filter here
  - isNot: "^container_.*_seconds_total$"
  name: {matches: "^container_(.*)_total$"}
  metricsQuery: "sum(rate(<<.Series>>{<<.LabelMatchers>>,container!="POD"}[2m])) by (<<.GroupBy>>)"

# this rule matches cumulative non-cAdvisor metrics
- seriesQuery: '{namespace!="",__name__!="^container_.*"}'
  name: {matches: "^(.*)_total$"}
  resources:
    # specify an a generic mapping between resources and labels.  This
    # is a template, like the `metricsQuery` template, except with the `.Group`
    # and `.Resource` strings available.  It will also be used to match labels,
    # so avoid using template functions which truncate the group or resource.
    # Group will be converted to a form acceptible for use as a label automatically.
    template: "<<.Resource>>"
    # if we wanted to, we could also specify overrides here
  metricsQuery: "sum(rate(<<.Series>>{<<.LabelMatchers>>,container!="POD"}[2m])) by (<<.GroupBy>>)"

# this rule matches only a single metric, explicitly naming it something else
# It's series query *must* return only a single metric family
- seriesQuery: 'cheddar{sharp="true"}'
  # this metric will appear as "cheesy_goodness" in the custom metrics API
  name: {as: "cheesy_goodness"}
  resources:
    overrides:
      # this should still resolve in our cluster
      brand: {group: "cheese.io", resource: "brand"}
  metricsQuery: 'count(cheddar{sharp="true"})'

# external rules are not tied to a Kubernetes resource and can reference any metric
# https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale-walkthrough/#autoscaling-on-metrics-not-related-to-kubernetes-objects
externalRules:
- seriesQuery: '{__name__="queue_consumer_lag",name!=""}'
  metricsQuery: sum(<<.Series>>{<<.LabelMatchers>>}) by (name)
- seriesQuery: '{__name__="queue_depth",topic!=""}'
  metricsQuery: sum(<<.Series>>{<<.LabelMatchers>>}) by (name)
  # Kubernetes metric queries include a namespace in the query by default
  # but you can explicitly disable namespaces if needed with "namespaced: false"
  # this is useful if you have an HPA with an external metric in namespace A
  # but want to query for metrics from namespace B
  resources:
    namespaced: false

# TODO: should we be able to map to a constant instance of a resource
# (e.g. `resources: {constant: [{resource: "namespace", name: "kube-system"}}]`)?

Discovery

Discovery 部分控制了查找要在自定义指标 API 中公开的指标的过程。其中有两个关键字段：seriesQuery 和 seriesFilters。

seriesQuery 指定了用于查找某些 Prometheus series 的 Prometheus series 查询(作为传递给 Prometheus /api/v1/series)。适配器将从这些系列中剥离标签值，然后在后续步骤中使用得到的“指标名称—标签名称”的组合。

在许多情况下，seriesQuery 就足以缩小 Prometheus series 的列表。但有时(特别是当两个规则可能重叠时)，对指标名称进行额外的过滤是很有用的。在这种情况下,可以使用 seriesFilters。在从 seriesQuery 返回 series 列表后，每个 series 的指标名称都会通过指定的任何过滤器进行过滤。

过滤器可以是以下两种形式之一:

is: ，匹配名称符合指定正则表达式的任何序列。
isNot: ，匹配名称不符合指定正则表达式的任何序列。

例如

# match all cAdvisor metrics that aren't measured in seconds
seriesQuery: '{__name__=~"^container_.*_total",container!="POD",namespace!="",pod!=""}'
seriesFilters:
  - isNot: "^container_.*_seconds_total"

Association

Association 部分控制了确定序列指标可以附加到哪些 Kubernetes 资源的过程。resources 字段控制了这个过程。

有两种方式来关联资源与特定指标。在这两种情况下,标签的值都会成为特定对象的名称。

一种方式是指定，任何符合某个特定模式的标签名称都指向基于标签名称的某个“group_resource”。这可以使用 template 字段来完成。pattern 被指定为一个 Go 模板，其中 Group 和 Resource 字段分别代表“组”和“资源”。

# any label `kube_<group>_<resource>` becomes <group>.<resource> in Kubernetes
resources:
  template: "kube_<<.Group>>_<<.Resource>>"

另一种方式是指定某个特定标签代表某个特定的 Kubernetes 资源。这可以使用 overrides 字段来完成。每个 override 将一个 Prometheus 标签映射到一个 Kubernetes group-resource。例如:

# the microservice label corresponds to the apps.deployment resource
resources:
  overrides:
    microservice: 
      group: "apps"
      resource: "deployment"

Association 部分提供了两种关联 Prometheus 指标和 Kubernetes 资源的方式，可以根据需要灵活地组合使用。这是实现自定义指标 API 的关键一环。

Naming

Naming 部分控制了将 Prometheus 指标名称转换为自定义指标 API 中的指标，这是通过 name 字段来实现的。

Naming 的控制通过指定一个从 Prometheus 名称中提取 API 名称的模式，以及对提取值进行的可选转换来实现。

模式由 matches 字段指定，这是一个正则表达式。如果没有指定,它默认为 .* 。

转换由 as 字段指定。你可以使用 matches 字段中定义的任何捕获组。如果 matches 字段没有捕获组，as 字段默认为 $0 。如果只包含一个捕获组，as 字段默认为 $1 。否则，如果没有指定 as 字段就会出错。例如

# match turn any name <name>_total to <name>_per_second
# e.g. http_requests_total becomes http_requests_per_second
name:
  matches: "^(.*)_total$"
  as: "${1}_per_second"

Querying

Querying 部分控制了实际获取特定指标值的过程。它由 metricsQuery 字段来控制。

metricsQuery 字段是一个 Go 模板,它会被转换成一个 Prometheus 查询，使用从特定的自定义指标 API 调用获取的输入数据。对自定义指标 API 的一次调用会被简化为一个指标名称、一个 “group-resource” 和一个或多个该 “group-resource” 的对象。这些会被转换成模板中的以下字段:

Series: 指标名称
LabelMatchers: 一个逗号分隔的标签匹配器列表，匹配给定的对象。当前包括特定的 “group-resource” 标签，以及 namespace 标。
GroupBy: 一个逗号分隔的用于分组的标签列表。当前包括用于 LabelMatchers 的组-资源标签。

例如，假设我们有一个 http_requests_total 序列 (在 API 中公开为 http_requests_per_second )，具有 service、pod、ingress、namespace 和 verb 标签。前四个对应于 Kubernetes 资源。那么,如果有人请求了 pods/http_request_per_second 指标，那么针对 somens 命名空间中的 pod1 和 pod2，我们会有:

Series: “http_requests_total”
LabelMatchers: "pod=~"pod1|pod2",namespace="somens""
GroupBy: pod

对应 prometheus promql 如下所示

sum(http_requests_total{pod=~"pod1|pod2",namespace="somens"}) by (pod)

此外,还有两个高级字段是其他字段的"原始"形式:

LabelValuesByName: 映射。将 LabelMatchers 字段中的标签和值对应起来。值是用 | 预先连接的 (用于在 Prometheus 中使用 =~ 匹配器)。
GroupBySlice: GroupBy 字段的切片形式。

通常，我们可能会想使用 Series、LabelMatchers 和 GroupBy 字段。其他两个是用于高级用法的。

Querying 预计会为每个请求的对象返回一个值。适配器会使用返回的系列上的标签，将给定的系列关联回其相应的对象。例如:

# convert cumulative cAdvisor metrics into rates calculated over 2 minutes
metricsQuery: "sum(rate(<<.Series>>{<<.LabelMatchers>>,container!="POD"}[2m])) by (<<.GroupBy>>)"

完整的配置文件实例

例如，我们想使用 springboot 的 actuator 提供的 jvm_memory_used_bytes 和 jvm_memory_max_bytes 计算内存使用率，如下所式

rules:
- seriesQuery: 'jvm_memory_used_bytes'
  resources:
    overrides:
      namespace:
        resource: "namespace"
      pod:
        resource: "pod"
  name:
    matches: 'jvm_memory_used_bytes'
    as: memory_percent
  metricsQuery: 'sum(jvm_memory_used_bytes{<<.LabelMatchers>>}) by (<<.GroupBy>>) / sum(jvm_memory_max_bytes{<<.LabelMatchers>>}) by (<<.GroupBy>>) * 100'
- seriesQuery: 'process_cpu_usage'
  resources:
    overrides:
      namespace:
        resource: "namespace"
      pod:
        resource: "pod"  
  name:
    matches: 'process_cpu_usage'
    as: process_cpu_percent
  metricsQuery: 'sum(avg_over_time(process_cpu_usage{<<.LabelMatchers>>}[1m])) by (<<.GroupBy>>)'

这里用到了一个技巧，就是使用查询多个指标，这里参考了 prometheus-adapter 的说明 ⁴

这很好理解,虽然一开始可能看起来不太明显。

基本上，你只需要选择一个指标作为 “Discovery” 和 “naming” 指标，然后使用它来配置配置中的 “discovery” 和 “naming” 部分。之后，你就可以在 metricsQuery 中写任何你想要的指标了！ ==Querying 的序列可以包含任何你想要的指标，只要它们有正确的标签集合即可==。

例如，假设你有两个指标 foo_total 和 foo_count，它们都有一个标签 system_name，用于表示节点资源，那么如下配置所示

rules:
- seriesQuery: 'foo_total'
  resources: {overrides: {system_name: {resource: "node"}}}
  name:
    matches: 'foo_total'
    as: 'foo'
  metricsQuery: 'sum(foo_total{<<.LabelMatchers>>}) by (<<.GroupBy>>) / sum(foo_count{<<.LabelMatchers>>}) by (<<.GroupBy>>)'

由于我们使用了 jvm_memory_used_bytes 和 jvm_memory_max_bytes ，那么我们可以在 “discovery” 和 “naming” 部分写任意指标，在 ”quering“ 中使用真是的指标进行替换，就可以完成

查询 kubernetes 的指标

完成配置后，可以使用下面命令进行查询

kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1"|jq
{
  "kind": "APIResourceList",
  "apiVersion": "v1",
  "groupVersion": "custom.metrics.k8s.io/v1beta1",
  "resources": [
    {
      "name": "pods/process_cpu_percent",
      "singularName": "",
      "namespaced": true,
      "kind": "MetricValueList",
      "verbs": [
        "get"
      ]
    },
    {
      "name": "pods/memory_percent",
      "singularName": "",
      "namespaced": true,
      "kind": "MetricValueList",
      "verbs": [
        "get"
      ]
    },
    {
      "name": "namespaces/memory_percent",
      "singularName": "",
      "namespaced": false,
      "kind": "MetricValueList",
      "verbs": [
        "get"
      ]
    },
    {
      "name": "namespaces/process_cpu_percent",
      "singularName": "",
      "namespaced": false,
      "kind": "MetricValueList",
      "verbs": [
        "get"
      ]
    }
  ]
}

可以通过 custom API 进程查询具体获取的值，如下所示

$ kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1/namespaces/public/pods/*/process_cpu_percent"|jq
{
  "kind": "MetricValueList",
  "apiVersion": "custom.metrics.k8s.io/v1beta1",
  "metadata": {},
  "items": [
    {
      "describedObject": {
        "kind": "Pod",
        "namespace": "msg",
        "name": "message-gateway-api-78c4d5cdbf-9k2g7",
        "apiVersion": "/v1"
      },
      "metricName": "process_cpu_percent",
      "timestamp": "2024-05-31T11:40:25Z",
      "value": "404m",
      "selector": null
    },
    
    ...
    
    {
      "describedObject": {
        "kind": "Pod",
        "namespace": "msg",
        "name": "message-core-79fdc6fdd-lkpdm",
        "apiVersion": "/v1"
      },
      "metricName": "process_cpu_percent",
      "timestamp": "2024-05-31T11:40:25Z",
      "value": "31m",
      "selector": null
    },
    {
      "describedObject": {
        "kind": "Pod",
        "namespace": "msg",
        "name": "message-push-admin-554f5d96fd-xlnhj",
        "apiVersion": "/v1"
      },
      "metricName": "process_cpu_percent",
      "timestamp": "2024-05-31T11:40:25Z",
      "value": "487m",
      "selector": null
    }
  ]
}

我们可以看到，返回值是带有 ”m“ 的单位，这里 issue 是这样回答的

The m-suffix means milli, Quantity Values are explained here: https://github.com/kubernetes-sigs/prometheus-adapter/blob/master/docs/walkthrough.md#quantity-values ⁵

在指标 API 中最常见的是 m 后缀,它表示毫单位，即单位的千分之一；由于我们返回值是一个百分比，例如 4.87%，那么实际值是 0.0487，那么他的毫单位为就是 “487m” ，和上面返回值一样。

Prometheus-adapter的安装

在这里采用 helm 方式进行安装，只需要修改对应参数即可

helm install prometheus-adapter -n monitoring  prometheus-community/prometheus-adapter \
	--set prometheus.url=http://prometheus.default.svc \
	--set logLevel=2 \
	--set rules.external=xxx # 如果使用外部规则替换默认的config.yaml,则需要提前创建一个configmap，然后这里指定这个名称