Prometheus与SNMP对接监控


SNMP与普罗米修斯

两者之间的通信格式和协议都不相同,故不能直接通信,但是我们可以使用snmp_exporter。snmp-exporter是prometheus官方开源的一款网络设备监控工具,从SNMP收集的信息,供Prometheus监控系统使用。它有两个部分。一个执行实际抓取的snmp-exporter和一个generator(它依赖于NetSNMP)创建供导出器使用的配置。

相关信息

1.prometheus: 监控系统,负责指标的收取和一些告警规则的定义

2.snmp-exporter: 用于通过snmp协议暴露交换机的相关指标

3.SNMP Exporter Config Generator: 此配置生成器使用NetSNMP解析MIB,并使用它们为snmp_exporter生成配置–帮助生成snmp的配置文件

4.MIB和OID: MIB是管理信息库的缩写,它是用于管理通信网络中的实体的数据库。数据库是分层的(树形结构),并且每个条目都通过对象标识符(OID)进行寻址

5.snmp协议: SNMP 是专门设计用于在 IP 网络管理网络节点(服务器、工作站、路由器、交换机及HUBS等)的一种标准协议

数据流流向:

服务器(SNMP Agent) -> snmp_exporter -> Prometheus

测试环境:

snmp监控对象准备,以centos7环境的snmp服务演示,监控系统指标

获取MIB:

找厂商提供,也可以在下面的github里面找,有很多基础的MIB

https://github.com/librenms/librenms/tree/master/mibs

一、SNMP服务配置

  1. 安装snmp服务采集端

yum install -y net-snmp-utils net-snmp net-snmp-devel net-snmp-libs net-snmp-perl mrtg systemctl enable snmpd

systemctl start snmpd

2、配置/etc/snmp/snmpd.conf

systemctl restart snmpd

3、snmp协议测试

本地测试:snmpwalk -v 2c -cpublic localhost 1.3.6.1.2.1.25.3.3.1.1

远程测试:snmpwalk -v 2c -c public 172.16.0.112 1.3.6.1.2.1.25.3.3.1.1

其中SNMPv2-SMI是/usr/share/snmp/mibs目录下的SNMPv2-SMI.txt文件

snmptranslate -Tz -m ./SNMPv2-SMI.txt

二、安装snmp_exporter服务端

snmp_exporter一个服务可以管理上千台snmp服务,可以和snmp服务端不在同一台机器上

1、安装snmp_exporter服务

下载snmp-prometheus部署包

wget https://github.com/prometheus/snmp_exporter/releases/download/v0.20.0/snmp_exporter-0.20.0.linux-amd64.tar.gz

tar -zxvf snmp_exporter-0.20.0.linux-amd64.tar.gz

mv snmp_exporter-0.20.0.linux-amd64 /usr/local/snmp_exporter

添加启动服务

vim /usr/lib/systemd/system/snmp_exporter.service

[Unit]

Description=SNMP Exporter

After=network-online.target

# This assumes you are running snmp_exporter under the user "prometheus"

[Service]

Type=simple

Restart=on-failure

ExecStart=/usr/local/snmp_exporter/snmp_exporter --config.file=/usr/local/snmp_exporter/snmp.yml

RestartSec=5

[Install]

WantedBy=multi-user.target

systemctl daemon-reload

systemctl start snmp_exporter

systemctl enable snmp_exporter

三、自定义snmp指标采集配置generator生成器

官方提供的snmp.yaml默认配置以及涵盖大多数用例,如果需要从 MIB 生成自己的配置或自定义遍历哪些对象、使用非公共 MIB,可以使用generator生成器

上边在配置 SNMP Exporter 的配置文件的时候,使用了官方提供的默认的 snmp.yml ,有些设备官方的配置里并没有支持,这个时候就需要通过 MIB 文件来自己生成了

1、安装环境依赖

安装依赖:

yum -y install gcc gcc-g++ make net-snmp net-snmp-utils net-snmp-libs net-snmp-devel git

2、安装go环境

方式一:yum安装go环境

yum install -y epel-release

yum install -y golang

[root@docker02 generator]# go version

go version go1.16.13 linux/amd64

方式二:本地系统上完成snmp_exporter的配置文件

源码安装go环境

wget https://golang.google.cn/dl/go1.15.4.linux-amd64.tar.gz

tar -C /usr/local -xzf go1.15.4.linux-amd64.tar.gz

export PATH=$PATH:/usr/local/go/bin

echo "export PATH=$PATH:/usr/local/go/bin" >> /etc/profile

go version

go env -w GOPROXY=https://goproxy.io,direct

3、配置generator

git clone https://github.com/prometheus/snmp_exporter.git

cd snmp_exporter/generator

go build

###build时由于会连接国外进行下载go依赖,如果执行无响应可能无法完成下载,需要设置代理

export GOPROXY=https://goproxy.io

export GO111MODULE=on

make mibs

执行完上述步骤后,在当前目录下会出现mibs文件夹,里面是下载好的一些mib文件。如果有些oid树是厂家自定义的,则要求厂家提供mib库文件(注意mib文件中,name不能为中文)并放到mibs目录下,

如自定义mib私有库则需要更改generator.yml

mv generator.yml generator.yml.bak

vim generator.yml 修改后,修改后的文件内容如下:

modules:

test:

walk:

- 1.3.6.1.2.1.25.3.3.1.1

version: 2

auth:

community: public

执行以下操作,生成新的 snmp.yml 文件

export MIBDIRS=mibs

或者export MIBDIRS=/usr/share/snmp/mibs/

./generator generate

将新生成的snmp.yml 替换掉原snmp_exporter中的snmp.yml

cp /opt/snmp_exporter/generator/snmp.yml /usr/local/snmp_exporter/snmp.yml

systemctl restart snmp_exporter

4、服务验证

登录snmp_exporter服务端口9116,module是generator.yml中定义的,点击submit确认返回结果

和snmpwalk -v 2c -c public 172.16.0.112 1.3.6.1.2.1.25.3.3.1.1执行的结果一致

5、其他方式完成snmp.yml文件生成:docker环境上完成snmp_exporter的配置文件

如果想在 docker 中运行生成器以生成snmp.yml配置,请运行以下命令。

Docker 映像需要一个包含您希望使用的 MIBgenerator.yml的目录和一个名为的目录。mibs

此示例将生成snmp.yml包含在 snmp_exporter repo 顶层中的示例:

make mibs

docker build -t snmp-generator .

docker run -ti \

-v "${PWD}:/opt/" \

snmp-generator generate

=====================================================

或者

mkdir generator

cd generator

vi generator.yml

docker run -it -v "${PWD}:/opt/" snmp-generator generate

docker run -it -v "${PWD}:/opt/" prom/snmp-generator:master generate

四、Prometheus服务配置

1、在prometheus 配置文件添加

- job_name: 'snmp'

scrape_interval: 10s

static_configs:

- targets:

- 172.16.0.167 # 交换机IP地址

metrics_path: /snmp

params:

module: [test] #generator.yml或SNMP_EXRORTER配置文件中对应的模块

relabel_configs:

- source_labels: [__address__]

target_label: __param_target

- source_labels: [__param_target]

target_label: instance

- target_label: __address__

replacement: 172.16.0.112:9116 # snmp_exporter 服务IP地址

2、 reload prometheus配置

curl -X POST 127.0.0.1:9090/-/reload

3、数据查询

http://172.16.1.84:9090/targets

五、Genesys对接Prometheus服务配置

环境准备:

  1. genesys的license版本需要在11.16版本以上,过低会无法获取数据
  2. 准备snmp_exporter服务端和Generator工具

步骤一、验证snmp服务是否正常

例如使用MIB工具连接到Genesys,并导入Genesys的MIB库文件GENESYS-SML-MIB-G7,选中gServerTable右键Table View查看是否可以获取的到数据

步骤二、snmp_exporter服务端导入MIB文件

将GENESYS-SML-MIB-G7.txt文件导入到/usr/share/snmp/mibs/目录下,使用命令验证是否可以获取到数据

snmpwalk -v 2c -m all -c public 172.16.5.51 gServerVersion

或者

snmpwalk -v 2c -m GENESYS-SML-MIB-G71 -c public 172.16.5.51 gServerVersion

步骤三、生成generator.yml文件

vim generator.yml

modules:

genesys:

walk:

- servers

- genericServer

- gsCleanupTimeout

- gServersTable

- gServersEntry

- gServerId

- gServerName

- gServerStatus

- gServerType

- gServerVersion

- gServerWorkDir

- gServerCommandLine

- gServerPID

- gServerCommand

- gServerDeleteClient

- gServerControlTable

- gServerControlEntry

- gsCtrlServerID

- gsCtrlTableID

- gsCtrlRefreshStatus

- gsCtrlLastRefreshed

- gsCtrlAutomaticRefresh

- gsCtrlRowStatus

- gsInfoTable

- gsInfoEntry

- gsClientsExistNum

- gsClientsTotalNum

- gsServerConfigFile

- gsLogTable

- gsLogEntry

- gsPollingTable

- gsPollingEntry

- gsPollingStatus

- gsPollingInterval

- gsPollingID

- gsPollingLastTrap

- gsClientTable

- gsClientEntry

- gsClientSocket

- gsClientAppName

- gsClientAuthorized

- gsClientType

- gsClientGotEvents

- gsClientSentReqs

- specificServer

- tServer

- tsInfoTable

- tsInfoEntry

- tsCallsExistNum

- tsCallsTotalNum

- tsLinksCommand

- tsLastChangedLinkStatus

- tsCallTable

- tsCallEntry

- callInstanceID

- callConnID

- callState

- callCallID

- callType

- callReferenceID

- callTimeStamp

- callDNIS

- callANI

- callNumParties

- callPartiesList

- callCustomerID

- callFirstTransferLocation

- callFirstTransferDN

- callLastTransferLocation

- callLastTransferDN

- tsDtaTable

- tsDtaEntry

- tsDtaInstanceID

- tsDtaDigits

- tsDtaMode

- tsDtaState

- tsDtaType

- tsLinkTable

- tsLinkEntry

- tsLinkID

- tsLinkName

- tsLinkStatus

- tsLinkProtocol

- tsLinkSocket

- tsLinkPID

- tsLinkDelay

- tsLinkPort

- tsLinkAddress

- tsLinkX25LocalAddress

- tsLinkMode

- tsLinkX25Device

- tsLinkDTEClass

- tsLinkTemplate

- tsCallFilterTable

- tsCallFilterEntry

- fltCallCreatedBefore

- fltCallCreatedAfter

- fltCallUpdatedBefore

- fltCallUpdatedAfter

- fltClearCallByConnId

- tsCallInfoTable

- tsCallInfoEntry

- callInfoInstanceID

- callInfoConnID

- callInfoType

- callInfoCreationTimestamp

- callInfoLastUpdatedTimestamp

- callInfoInternalParties

- tsLinkStatsTable

- tsLinkStatsEntry

- linkId

- timeElapsedSec

- numberMessagesTx

- rateMessagesTx

- numberMessagesRx

- rateMessagesRx

- hosts

- hostsTable

- hostsEntry

- hostId

- hostName

- hostStatus

- hostIPAddress

- hostOSType

- hostLCAPort

- solutions

- solutionsTable

- solutionsEntry

- solutionId

- solutionName

- solutionType

- solutionStatus

- solutionControlServer

- solutionsComponentsTable

- solutionsComponentsEntry

- componentId

- componentName

- notifications

- gsAlarm

- gsMLAlarm

- gsServerUpTrap

- gsServerDownTrap

- gsPollingSignal

- tsLinkStatusTrap

- genesysmibConformance

- genesysmibGroups

- gServersListGroup

- genericServerControlGroup

- genericServerInfoGroup

- genericServerLogGroup

- genericServerPollingGroup

- genericServerClientGroup

- tServerInfoGroup

- tServerCallGroup

- tServerDtaGroup

- tServerLinkGroup

- genericAlarmObjectGroup

- specificAlarmObjectGroup

- hostsGroup

- solutionsGroup

- solutionsComponentsGroup

- notificationGroup

- tServerCallFilterGroup

- tServerCallInfoGroup

- genesysmibCompliances

- genesysmibCompliance

version: 2

auth:

community: public

生成snmp.yml文件

./generator generate

替换snmp_exporter默认文件

cp /opt/snmp_exporter/generator/snmp.yml /usr/local/snmp_exporter/snmp.yml

systemctl restart snmp_exporter

看到数据表示服务正常获取数据

步骤四、Prometheus服务配置

参考上文

- job_name: 'snmp'

scrape_interval: 10s

static_configs:

- targets:

- 172.16.5.51 # genesys地址

metrics_path: /snmp

params:

module: [genesys] #generator.yml或SNMP_EXRORTER配置文件中对应的模块

relabel_configs:

- source_labels: [__address__]

target_label: __param_target

- source_labels: [__param_target]

target_label: instance

- target_label: __address__

replacement: 172.16.0.112:9116 # snmp_exporter 服务IP地址

如果要实现告警貌似有难度因为获取到的数据不是key vlaue形式,告警只能通过sql语句进行匹配查询自定义告警规则,有些难度

====================================================================================

genesys:普罗米修斯告警监控指标

gServerStatus指标如下:目前只告警1、2、3、6状态

1: statusUnknown

2: statusStopped

3: statusPending

4: statusRunning

5: statusInitializing

6: statusServiceUnavailable

7: statusSuspending

8: statusSuspended

告警条件指标

0*gServerName+ on(gServerId,instance,job) group_left(gServerStatus) gServerStatus ==1

0*gServerName+ on(gServerId,instance,job) group_left(gServerStatus) gServerStatus ==2

0*gServerName+ on(gServerId,instance,job) group_left(gServerStatus) gServerStatus ==3

0*gServerName+ on(gServerId,instance,job) group_left(gServerStatus) gServerStatus ==6

如果要排除告警的服务如下配置:

假设把:gServerId="106"进行排除

{gServerId="106", gServerName="LOGDAP", instance="172.16.5.51", job="genesys_snmp"} 1

{gServerId="107", gServerName="ERSDAP", instance="172.16.5.51", job="genesys_snmp"} 1

{gServerId="135", gServerName="CFG_ICONDAP", instance="172.16.5.51", job="genesys_snmp"} 1

============================================================================================

0*gServerName+ on(gServerId,instance,job) group_left(gServerStatus) gServerStatus{gServerId!~"(106|107|135)"} ==1

手动告警规则:

vim alert_genesys.rules

groups:

- name: genesys-alert-rule

rules:

- alert: gsAlarm statusUnknown

expr: 0*gServerName+ on(gServerId,instance,job) group_left(gServerStatus) gServerStatus ==1

for: 10s

labels:

severity: critical

annotations:

summary: "{{ $labels.gServerName }} statusUnknown"

description: "{{ $labels.gServerName }} PID为{{ $labels.gServerId}}的服务异常,请及时检查!!!"

- alert: gsAlarm statusStopped

expr: 0*gServerName+ on(gServerId,instance,job) group_left(gServerStatus) gServerStatus ==2

for: 10s

labels:

severity: critical

annotations:

summary: "{{ $labels.gServerName }} statusStopped"

description: "{{ $labels.gServerName }} PID为{{ $labels.gServerId}}的服务异常,请及时检查!!!"

vim alert_genesys.rules

- alert: gsAlarm statusPending

expr: 0*gServerName+ on(gServerId,instance,job) group_left(gServerStatus) gServerStatus ==3

for: 10s

labels:

severity: critical

annotations:

summary: "{{ $labels.gServerName }} statusPending"

description: "{{ $labels.gServerName }} PID为{{ $labels.gServerId}}的服务异常,请及时检查!!!"

- alert: gsAlarm statusServiceUnavailable

expr: 0*gServerName+ on(gServerId,instance,job) group_left(gServerStatus) gServerStatus ==6

for: 10s

labels:

severity: critical

annotations:

summary: "{{ $labels.gServerName }} statusServiceUnavailable"

description: "{{ $labels.gServerName }} PID为{{ $labels.gServerId}}的服务异常,请及时检查!!!"

=================================================================================

使用Trap进行告警

gsServersLastTrap{gsServersLastTrap=~'.*DOWN'}==1

gsServersLastAlarm{gsServersLastAlarm=~'.*'}==1

vim alert_genesys.rules

groups:

- name: genesys-alert-rule

rules:

- alert: gsServersLastTrap

expr: gsServersLastTrap{gsServersLastTrap=~'.*DOWN'}==1

for: 10s

labels:

severity: critical

annotations:

summary: "gsServersLastTrap 通知"

description: "{{ $labels.gsServersLastTrap }}"

- alert: gsServersLastAlarm

expr: gsServersLastAlarm{gsServersLastAlarm=~'.*'}==1

for: 10s

labels:

severity: critical

annotations:

summary: "gsServersLastAlarm 通知"

description: "{{ $labels.gsServersLastAlarm }}"

alertmanager邮件的配置如下:

vim prometheus.yml

global:

scrape_interval: 60s

evaluation_interval: 60s

scrape_timeout: 30s

rule_files:

- /etc/prometheus/*.rules

# Alertmanager configuration

alerting:

alertmanagers:

- static_configs:

- targets: ['172.16.1.84:9093']

vim alertmanager/alertmanager.yml

global:

resolve_timeout: 5m

smtp_require_tls: false

smtp_smarthost: 'smtp.163.com:25' # smtp地址

smtp_from: '15188257614@163.com' # 发送邮箱地址

smtp_auth_username: '15188257614@163.com' # 邮箱用户

smtp_auth_password: 'JPLGFRGKFPVJIXGU' # 邮箱密码

templates:

- '/etc/alertmanager/email.tmpl'

route:

group_by: ["alertname"] # 分组名

group_wait: 60s # 当收到告警的时候,等待十秒看是否还有告警,如果有就一起发出去

group_interval: 5m

repeat_interval: 1h # 重复报警的间隔时间

receiver: 'email' # 全局报警组,这个参数是必选的,和下面报警组名要相同

receivers:

- name: 'email' # 报警组名

email_configs:

#- to: 'doufadong@wilcom.com.cn,yf-tangjun@wilcom.com.cn,luchengjun@wilcom.com.cn' # 收件人邮箱

- to: 'doufadong@wilcom.com.cn' # 收件人邮箱

html: '{{ template "email.to.html" . }}'

send_resolved: true

headers: {Subject: "告警邮件通知"}

inhibit_rules:

- source_match:

severity: 'critical'

target_match:

severity: 'warning'

equal: ['alertname', 'dev', 'instance']

[root@harbor alertmanager]# cat email.tmpl

{{ define "email.to.html" }}

{{ range .Alerts }}

=========start==========<br>

告警程序: prometheus_alert <br>

告警级别: {{ .Labels.severity }} 级 <br>

告警类型: {{ .Labels.alertname }} <br>

故障主机: {{ .Labels.instance }} <br>

告警主题: {{ .Annotations.summary }} <br>

告警详情: {{ .Annotations.description }} <br>

触发时间: {{ .StartsAt.Local }} <br>

=========end==========<br>

{{ end }}

{{ end }}

========================================================

和grafana告警整合

  1. 修改grafana smtp配置后重启服务

vi conf/defaults.ini

#################################### SMTP / Emailing #####################

[smtp]

enabled = true

host = smtp.163.com:25

user = 15188257614@163.com

# If the password contains # or ; you have to wrap it with triple quotes. Ex """#password;"""

password = JPLGFRGKFPVJIXGU

cert_file =

key_file =

skip_verify = false

from_address = 15188257614@163.com

from_name = Grafana

ehlo_identity =

startTLS_policy =

[emails]

welcome_email_on_sign_up = false

templates_pattern = emails/*.html

配置告警媒介为邮箱,填写接收邮件地址

  1. 创建genesys的图表和告警规则

以上只能对gservertable表中获取数据,其他表无法获取数据是由于没有和tserver做绑定设置,设置如下

C:\Users\Lenovo\Documents\WeChat Files\wxid_aw5nfgpynre721\FileStorage\Temp\1655093245095.png

C:\Users\Lenovo\Documents\WeChat Files\wxid_aw5nfgpynre721\FileStorage\Temp\1655093393884.png

.116.1-11(116是tserver的索引,1-11是不同的服务,状态需要设置为4)完成设置后相关数据都能获取到了

使用脚本自动配置

vim gs_check_snmp.sh

#!/bin/bash

gs_snmp_ip=172.16.5.51

gs_server_id=`snmpwalk -v 2c -m GENESYS-SML-MIB-G71 -c public $gs_snmp_ip |grep SIPServer |grep "is UP" |awk -F . '{print$15}' |tail -n1`

gs_run_count=`snmpwalk -v 2c -m GENESYS-SML-MIB-G71 -c public $gs_snmp_ip |grep SIPServer |grep "is UP" |wc -l`

gs_conf_count=`snmpwalk -v 2c -m GENESYS-SML-MIB-G71 -c public $gs_snmp_ip gsCtrlRowStatus |grep -v "gsCtrlRowStatus = No Such Object available on this agent at this OID"|wc -l`

###判断gs sipserver是否正常运行

if [ "$gs_run_count" -eq "0" ];then

echo -e "\033[32m sipserver 服务未运行 \033[0m"

else

if [ "$gs_conf_count" -eq "0" ];then

echo -e "\033[32m gs_snmp 配置中 \033[0m"

for i in {1,2,4,5,6,7,8,9,10};do snmpset -c private -v 2c $gs_snmp_ip 1.3.6.1.4.1.1729.100.1.3.1.6.$gs_server_id.$i i 4;done

echo -e "\033[32m gs_snmp 配置成功 \033[0m"

else

echo -e "\033[32m gs_snmp 无需配置 \033[0m"

exit 1

fi

fi

#删除操作

#snmpset -c private -v 2c 172.16.5.51 1.3.6.1.4.1.1729.100.1.3.1.6.116.8 i 6

######状态栏列对象有6个定义值:

#active(1),表明状态行是可用的

#notInService(2),表明行存在但不可用

#notReady(3),表明存在,但因为缺少必要的信息而不能用

#createAndGo(4),有管理者设置,表明希望创建一个概念行并设置该行的状态列对象为active

#createAndWait(5),有管理者设置,表明希望创建一个概念行,但不可用

#destroy(6),删除行

管理的对象:https://docs.genesys.com/Documentation/FR/latest/MLUG/SNMPobjs

六、Pushgateway安装使用

wget https://github.com/prometheus/pushgateway/releases/download/v1.0.0/pushgateway-1.0.0.linux-amd64.tar.gz

tar -zxvf pushgateway-1.0.0.linux-amd64.tar.gz

mv pushgateway-1.0.0.linux-amd64 /usr/local/pushgateway

添加服务文件

cat > /usr/lib/systemd/system/pushgateway.service <<EOF

[Unit]

Description=pushgateway

Documentation=https://github.com/prometheus/pushgateway

After=network.target

[Service]

ExecStart=/usr/local/pushgateway/pushgateway

Restart=on-failure

[Install]

WantedBy=multi-user.target

EOF

systemctl start pushgateway

systemctl start pushgateway

服务验证

http://172.16.0.167:9091/

推送一条测试数据

echo "some_metric 3.14" | curl --data-binary @- http://172.16.0.167:9091/metrics/job/some_job

七、SNMPTrapServer使用—弃用

git clone https://toscode.gitee.com/sunzheng86/snmptrap-server.git

cd snmptrap-server

yum -y install go

export GOPROXY=https://goproxy.io

export GO111MODULE=on

go get

go build -o snmptrap-server main.go

mv miblist.txt miblist.txt.bak

snmptranslate -Tz -m /usr/share/snmp/mibs/GENESYS-SML-MIB-G7.txt >miblist.txt

vim config.yml

LOGCONF:

level: info

format: text

director: logs

link-name: current.log

TrapServer:

ip: "172.16.5.51"

port: 161

version: "v2c" #v1, v2c, v3

community: "public"

timeout: 2 #seconds

maxoids: 60

mib_map_file: "miblist.txt"

ApiServer:

api_port: 8070 # apiserver的监听端口

api_read_timeout: 5 #apiserver读请求超时时间, 秒

api_write_timeout: 5 # apiserver写请求超时时间,秒

api_web_root: "./webapp"

sender:

pushgateway_url: "http://172.16.0.167:9091"

job_name: "snmp_trap"

# zabbix_host: "10.187.32.8"

# zabbix_port: 10051

# zbx_itme_key: "snmptraper.fallback"

# sender_dir: "./zabbix_sender.exe"

senders:

# - "zabbix"

- "pushgateway"

服务运行

./snmptrap-server -c config.yml

发送测试数据

snmptrap -v 2c -c public 172.16.0.167 '1234567' 1.3 sysLocation.0 s "test"

目前有个问题,不能获取:snmp_trap_pdu{host=“发送过来的主机IP” OID=“.x.x.x.x." value="xxxx" type="xxxxx" ts=“接收到消息的时间”} ,暂时放弃

八、奥科SBC SNMP说明

注意事项:根据简易文档说明只需要几个mib文件就可以获取指标,实际测试发现只加载几个无法获取到数据,需要将整个Mibs7.4.20220524的mib文件全部覆盖替换到ManageEngine MibBrowser客户端工具默认的mib目录下,注意ManageEngine MibBrowser工具识别mib文件后缀是空的,如果使用centos 7 snmp工具获取数据指标,同样的修改后缀后在全部替换到/usr/share/snmp/mibs并重启snmp服务

3520ab11aacb64614eb417da2596609

如下面截图就是识别失败需要全部替换mibs文件

0ac7fb2546e1441bf15a83d1ec2afa1

九、prometheus的网络ping监控exporter

项目地址:https://github.com/2767321434/ping_exporter

Prometheus exporter for test network by ping 测试网络情况的exporter,使用ping,可以监控特定http接口,或者百度等,监控网络波动情况

克隆项目:git clone https://github.com/2767321434/ping_exporter

启动方法
./main -port 8889 -pingaddr www.baidu.com -count 4
启动后访问 127.0.0.1:8889/metrics
可以查看到输出指标,有无法访问次数和平均延迟统计信息

指标数据:

# TYPE ping_avg_time gauge平均延迟

ping_avg_time 0.22

# TYPE ping_lost gauge丢包率

ping_lost 0

十、prometheus使用pushgateway监控网路丢包

监控网路丢包脚本

[root@gtcq-gt-monitor-prometheus-01 ~]# timeout 50 ping -q -A -s 500 -W 1000 -c 1000 10.1.32.95|grep transmitted|awk '{print $6}'

[root@gtcq-gt-monitor-prometheus-01 shell_script]# more icmp_gpu_monitor.sh

#!/bin/bash

#

#####################################

#@brief 功能:监控网路丢包率和延迟 -s 是一个ping包的大小 -W 是延迟timeout -c 是发生多少数据包

#@author xiajing

#@version 1.0

#@date 2021/01/13

#@log no

#####################################

#shell Env

#ping发包数

c_times=200

#IP列表数组

ip_arr=( 10.1.33.188 )

for (( i = 0; i < ${#ip_arr[@]}; ++i ))

do

result=`timeout 16 ping -q -A -s 200 -W 250 -c $c_times ${ip_arr[i]}|grep transmitted|awk '{print $6,$10}'`

if [ -z "$result" ]

then

value_lostpk=101

value_rrt=1000

echo "ykt_lostpk_gt_jd ${value_lostpk}" | curl --data-binary @- http://127.0.0.1:9091/metrics/job/ykt_icmp/instance/${ip_arr[i]}

echo "ykt_rrt_gt_jd ${value_rrt}" | curl --data-binary @- http://127.0.0.1:9091/metrics/job/ykt_icmp/instance/${ip_arr[i]}

else

lostpk=$(echo $result|awk '{print $1}')

rrt=$(echo $result|awk '{print $2}')

value_lostpk=$(echo $lostpk | sed 's/%//g')

value_rrt=$(echo $rrt |sed 's/ms//g')

#value_rrt=$(($value_rrt/$c_times))

value_rrt=$(printf "%.5f" `echo "scale=5;$value_rrt/$c_times"|bc`)

echo "ykt_lostpk_gt_jd ${value_lostpk}" | curl --data-binary @- http://127.0.0.1:9091/metrics/job/ykt_icmp/instance/${ip_arr[i]}

echo "ykt_rrt_gt_jd ${value_rrt}" | curl --data-binary @- http://127.0.0.1:9091/metrics/job/ykt_icmp/instance/${ip_arr[i]}

fi

echo ${ip_arr[i]}"==="$value_lostpk"==="$value_rrt

done

[root@gtcq-gt-monitor-prometheus-01 shell_script]#

在这里插入图片描述

在这里插入图片描述

十一、Prometheus通过Process-exporter实现任务进程监控

1.下载process-exporter

wget https://github.com/ncabatoff/process-exporter/releases/download/v0.7.10/process-exporter-0.7.10.linux-amd64.tar.gz

2.安装部署process-exporter

在要监控的机器上,安装proces-exporter。

tar -xvf process-exporter-0.7.10.linux-amd64.tar.gz

mv process-exporter-0.7.10.linux-amd64 /usr/local/process-exporter

3.编写配置文件

vi process-name.yaml

process_names:

- name: "{{.Matches}}"

cmdline:

- 'ces-wd.jar'

- name: "{{.Matches}}"

cmdline:

- 'freeswitch'

4.服务启动

vim /usr/lib/systemd/system/process-exporter.service

[Unit]

Description=process_exporter

After=network.target

[Service]

User=root

Type=simple

ExecStart=/usr/local/process-exporter/process-exporter -config.path /usr/local/process-exporter/process-name.yaml

Restart=on-failure

[Install]

WantedBy=multi-user.target

systemctl daemon-reload

systemctl start process-exporter.service

systemctl status process-exporter.service

systemctl enable process-exporter.service #开机自启

curl localhost:9256/metrics #服务验证

5配置Prometheus

修改Prometheus配置文件

- job_name: 'process'

static_configs:

-targets: ['172.**.**.**:9256']

多个进程配置示例

…..

- job_name: 'process'

scrape_interval: 2m

scrape_timeout: 120s

static_configs:

file_sd_configs:

- files:

- /data/prometheus/config/process_exporter.json

…..

[root@harbor config]# cat process_exporter.json

[

{

"labels": {

"desc": "ces-wd",

"group": "ces-wd.jar",

"host_ip": "172.16.1.254",

"hostname": "abc_baidu"

},

"targets": [

"172.16.1.254:9256"

]

},

{

"labels": {

"desc": "freeswitch",

"group": "freeswitch",

"host_ip": "172.16.1.254",

"hostname": "abc_baidu"

},

"targets": [

"172.16.1.254:9256"

]

}

]

配置告警规则:

groups:

- name: dol_alert_process_rule

rules:

- alert: Dolphinscheduler Alert ProcessDown # 告警名称

expr: (namedprocess_namegroup_num_procs{groupname="map[:alertlib]"}) == 0

for: 1m # 满足告警条件持续时间多久后,才会发送告警

labels: #标签项

severity: error

annotations: # 解析项,详细解释告警信息

summary: "dolphinscheduler Alert {{ $labels.instance }} has been down for more than 1 minutes"

description: "dolphinscheduler Alert has been down, This requires immediate action!"

热加载Prometheus报警规则

./promtool check config prometheus.yml

systemctl reload prometheus.service

监控的指标

十二、pgsql安装与监控

yum install https://download.postgresql.org/pub/repos/yum/reporpms/EL-7-x86_64/pgdg-redhat-repo-latest.noarch.rpm

yum install postgresql12-server postgresql12

初始化数据库,初始化数据库命令会在 /var/lib/pgsql 目录下创建名称为12文件夹,11为数据库版本,如果安装的是其他版本,对应的是其版本号(9.4、9.5)

/usr/pgsql-12/bin/postgresql-12-setup initdb

设置开机启动并启动

systemctl enable postgresql-12 && systemctl start postgresql-12

安装完成后,默认会创建一个名为postgres的linux登录用户

默认postgresql不允许远程连接,修改配置,使远程用户可以连接

vim /var/lib/pgsql/12/data/postgresql.conf

listen_addresses = '*'

vim /var/lib/pgsql/12/data/pg_hba.conf

# IPv4 local connections:

host all all 0.0.0.0/0 md5

重启生效

systemctl restart postgresql-12

修改数据库用户密码

使用postgres用户登录,并设置postgres用户密码

su – postgres

postgres=# ALTER USER postgres WITH PASSWORD 'pgsql123';

psqlpostgres=# select usename,passwd from pg_shadow;

postgres=# SELECT datname FROM pg_database;

使用\q可以退出终端

访问测试

bash-4.2$ psql -d postgres -U postgres -h 172.16.1.111

Password for user postgres:

psql (9.2.24, server 12.11)

WARNING: psql version 9.2, server version 12.0.

Some psql features might not work.

Type "help" for help.

postgres=#

监控组件部署

wget https://github.com/wrouesnel/postgres_exporter/releases/download/v0.5.1/postgres_exporter_v0.5.1_linux-amd64.tar.gz

mkdir /usr/local/postgres_exporter && tar -zxvf postgres_exporter_v0.5.1_linux-amd64.tar.gz -C /usr/local/postgres_exporter

cd /usr/local/postgres_exporter && mv postgres_exporter_v0.5.1_linux-amd64/* ./ &&

rm -fr postgres_exporter_v0.5.1_linux-amd64/

vim postgres_exporter.env

DATA_SOURCE_NAME="postgresql://postgres:pgsql123@172.16.1.111:5432/?sslmode=disable"

vim /etc/systemd/system/postgres_exporter.service

[Unit]

Description=Prometheus exporter for Postgresql

Wants=network-online.target

After=network-online.target

[Service]

User=postgres

Group=postgres

WorkingDirectory=/usr/local/postgres_exporter

EnvironmentFile=/usr/local/postgres_exporter/postgres_exporter.env

ExecStart=/usr/local/postgres_exporter/postgres_exporter

Restart=always

[Install]

WantedBy=multi-user.target

服务启动

systemctl start postgres_exporter.service

systemctl enable postgres_exporter.service

systemctl status postgres_exporter.service

访问测试

curl http://172.16.1.111:9187/metrics

普罗米修斯配置

- job_name: 'postgres'

static_configs:

- targets: ['172.16.1.111:9187']

labels:

instance: postgresdb1

# 如果有其他postgres_exporter如下添加:

# - targets: ['10.10.xxx.56:9287']

# labels:

# instance: db2

重启后加载curl -X POST 172.16.1.84:9090/-/reload

grafanaid:

9628

455

在这里插入图片描述

普罗米修斯告警规则

警报规则

vi /data/prometheus/conf/rules/postgres.rules

groups:

- name: postgresql-监控告警

rules:

- alert: 警报!Postgresql宕机

expr: pg_up == 0

for: 0m

labels:

severity: 严重告警

annotations:

summary: "{{$labels.instance}} Postgresql down"

description: "Postgresql instance is down\n 当前值={{ $value }}"

- alert: 警报!Postgresql被重启

expr: time() - pg_postmaster_start_time_seconds < 60

for: 0m

labels:

severity: 严重告警

annotations:

summary: "{{$labels.instance}} Postgresql restarted"

description: "Postgresql restarted\n 当前值={{ $value }}"

- alert: 警报!PostgresqlExporterError

expr: pg_exporter_last_scrape_error > 0

for: 0m

labels:

severity: 严重告警

annotations:

summary: "{{$labels.instance}} Postgresql exporter error"

description: "Postgresql exporter is showing errors. A query may be buggy in query.yaml\n 当前值={{ $value }}"

- alert: 警报!Postgresql主从复制不同步

expr: pg_replication_lag > 30 and ON(instance) pg_replication_is_replica == 1

for: 0m

labels:

severity: 严重告警

annotations:

summary: "{{$labels.instance}} Postgresql replication lag"

description: "PostgreSQL replication lag is going up (> 30s)\n 当前值={{ $value }}"

- alert: 警报!PostgresqlTableNotVaccumed

expr: time() - pg_stat_user_tables_last_autovacuum > 60 * 60 * 24

for: 0m

labels:

severity: 一般告警

annotations:

summary: "{{$labels.instance}} Postgresql table not vaccumed"

description: "Table has not been vaccum for 24 hours\n 当前值={{ $value }}"

- alert: 警报!PostgresqlTableNotAnalyzed

expr: time() - pg_stat_user_tables_last_autoanalyze > 60 * 60 * 24

for: 0m

labels:

severity: 一般告警

annotations:

summary: "{{$labels.instance}} Postgresql table not analyzed"

description: "Table has not been analyzed for 24 hours\n 当前值={{ $value }}"

- alert: 警报!Postgresql连接数太多

expr: sum by (datname) (pg_stat_activity_count{datname!~"template.*|postgres"}) > pg_settings_max_connections * 0.8

for: 2m

labels:

severity: 一般告警

annotations:

summary: "{{$labels.instance}} Postgresql too many connections"

description: "PostgreSQL instance has too many connections (> 80%).\n 当前值={{ $value }}"

- alert: 警报!Postgresql连接数太少

expr: sum by (datname) (pg_stat_activity_count{datname!~"template.*|postgres"}) < 5

for: 2m

labels:

severity: 一般告警

annotations:

summary: "{{$labels.instance}} Postgresql not enough connections"

description: "PostgreSQL instance should have more connections (> 5)\n 当前值={{ $value }}"

- alert: 警报!Postgresql死锁

expr: increase(pg_stat_database_deadlocks{datname!~"template.*|postgres"}[1m]) > 5

for: 0m

labels:

severity: 一般告警

annotations:

summary: "{{$labels.instance}} Postgresql dead locks"

description: "PostgreSQL has dead-locks\n 当前值={{ $value }}"

- alert: 警报!Postgresql慢查询

expr: pg_slow_queries > 0

for: 2m

labels:

severity: 一般告警

annotations:

summary: "{{$labels.instance}} Postgresql slow queries"

description: "PostgreSQL executes slow queries\n 当前值={{ $value }}"

- alert: 警报!Postgresql回滚率高

expr: rate(pg_stat_database_xact_rollback{datname!~"template.*"}[3m]) / rate(pg_stat_database_xact_commit{datname!~"template.*"}[3m]) > 0.02

for: 0m

labels:

severity: 一般告警

annotations:

summary: "{{$labels.instance}} Postgresql high rollback rate"

description: "Ratio of transactions being aborted compared to committed is > 2 %\n 当前值={{ $value }}"

- alert: 警报!Postgresql提交率低

expr: rate(pg_stat_database_xact_commit[1m]) < 10

for: 2m

labels:

severity: 严重告警

annotations:

summary: "{{$labels.instance}} Postgresql commit rate low"

description: "Postgres seems to be processing very few transactions\n 当前值={{ $value }}"

- alert: 警报!PostgresqlLowXidConsumption

expr: rate(pg_txid_current[1m]) < 5

for: 2m

labels:

severity: 一般告警

annotations:

summary: "{{$labels.instance}} Postgresql low XID consumption"

description: "Postgresql seems to be consuming transaction IDs very slowly\n 当前值={{ $value }}"

- alert: 警报!PostgresqllowXlogConsumption

expr: rate(pg_xlog_position_bytes[1m]) < 100

for: 2m

labels:

severity: 一般告警

annotations:

summary: "{{$labels.instance}} Postgresqllow XLOG consumption"

description: "Postgres seems to be consuming XLOG very slowly\n 当前值={{ $value }}"

- alert: 警报!PostgresqlWaleReplicationStopped

expr: rate(pg_xlog_position_bytes[1m]) == 0

for: 0m

labels:

severity: 严重告警

annotations:

summary: "{{$labels.instance}} Postgresql WALE replication stopped"

description: "WAL-E replication seems to be stopped\n 当前值={{ $value }}"

- alert: 警报!PostgresqlHighRateStatementTimeout

expr: rate(postgresql_errors_total{type="statement_timeout"}[1m]) > 3

for: 0m

labels:

severity: 严重告警

annotations:

summary: "{{$labels.instance}} Postgresql high rate statement timeout"

description: "Postgres transactions showing high rate of statement timeouts\n 当前值={{ $value }}"

- alert: 警报!PostgresqlHighRateDeadlock

expr: increase(postgresql_errors_total{type="deadlock_detected"}[1m]) > 1

for: 0m

labels:

severity: 严重告警

annotations:

summary: "{{$labels.instance}} Postgresql high rate deadlock"

description: "Postgres detected deadlocks\n 当前值={{ $value }}"

- alert: 警报!PostgresqlReplicationLagBytes

expr: (pg_xlog_position_bytes and pg_replication_is_replica == 0) - on(environment) group_right(instance) (pg_xlog_position_bytes and pg_replication_is_replica == 1) > 1e+09

for: 0m

labels:

severity: 严重告警

annotations:

summary: "{{$labels.instance}} Postgresql replication lag bytes"

description: "Postgres Replication lag (in bytes) is high\n 当前值={{ $value }}"

- alert: 警报!PostgresqlUnusedReplicationSlot

expr: pg_replication_slots_active == 0

for: 1m

labels:

severity: 一般告警

annotations:

summary: "{{$labels.instance}} Postgresql unused replication slot"

description: "Unused Replication Slots\n 当前值={{ $value }}"

- alert: 警报!PostgresqlTooManyDeadTuples

expr: ((pg_stat_user_tables_n_dead_tup > 10000) / (pg_stat_user_tables_n_live_tup + pg_stat_user_tables_n_dead_tup)) >= 0.1 unless ON(instance) (pg_replication_is_replica == 1)

for: 2m

labels:

severity: 一般告警

annotations:

summary: "{{$labels.instance}} Postgresql too many dead tuples"

description: "PostgreSQL dead tuples is too large\n 当前值={{ $value }}"

- alert: 警报!PostgresqlSplitBrain

expr: count(pg_replication_is_replica == 0) != 1

for: 0m

labels:

severity: 严重告警

annotations:

summary: "{{$labels.instance}} Postgresql split brain"

description: "Split Brain, too many primary Postgresql databases in read-write mode\n 当前值={{ $value }}"

- alert: 警报!PostgresqlPromotedNode

expr: pg_replication_is_replica and changes(pg_replication_is_replica[1m]) > 0

for: 0m

labels:

severity: 一般告警

annotations:

summary: "{{$labels.instance}} Postgresql promoted node"

description: "Postgresql standby server has been promoted as primary node\n 当前值={{ $value }}"

- alert: 警报!PostgresqlSslCompressionActive

expr: sum(pg_stat_ssl_compression) > 0

for: 0m

labels:

severity: 严重告警

annotations:

summary: "{{$labels.instance}} Postgresql SSL compression active"

description: "Database connections with SSL compression enabled. This may add significant jitter in replication delay. Replicas should turn off SSL compression via `sslcompression=0` in `recovery.conf`.\n 当前值={{ $value }}"

- alert: 警报!PostgresqlTooManyLocksAcquired

expr: ((sum (pg_locks_count)) / (pg_settings_max_locks_per_transaction * pg_settings_max_connections)) > 0.20

for: 2m

labels:

severity: 严重告警

annotations:

summary: "{{$labels.instance}} Postgresql too many locks acquired"

description: "Too many locks acquired on the database. If this alert happens frequently, we may need to increase the postgres setting max_locks_per_transaction.\n 当前值={{ $value }}"

十三、Asterisk监控

前提加载相关模块,使查询的命令可以生效

asterisk –rvv执行命令进入cli

sip show peers 显示所有已定义的SIP peer

sip show peers

No such command 'sip show peers' (type 'core show help sip show' for other possible commands)

解决方法

module load chan_sip.so

module reload chan_sip.so

sip show peers

sip show channels 显示所有活动的SIP通道

方式一

git clone https://github.com/robinmarechal/asterisk_exporter.git

cd asterisk_exporter

vim main.go

enableAgentsCollector = kingpin.Flag("collector.agents", "Enable agents collector").Default("false").Bool()

将默认的main.go文件进行修改,将agents改为关闭,Asterisk16版本中没有这个查询指标,如果启动会导致采集服务失败

make

./asterisk_exporter <flags>

./asterisk_exporter –h

注意事项,make时需要go语言支持,且需要执行下面配置,否则提示/bin/promu无改命令

touch /root/go/bin/promu

ln -s /root/go/bin/promu /bin/promu

rm -f /root/go/bin/promu

C:\Users\Lenovo\Documents\WeChat Files\wxid_aw5nfgpynre721\FileStorage\Temp\1657528067640.png

服务启动

./asterisk_exporter --collector.bridges --collector.calendars --collector.confbridges --collector.modules

配置系统服务

vim asterisk_exporter.service

[Unit]

Description=Asterisk call center system exporter for Prometheus

After=network.target

[Service]

User=asterisk

Group=asterisk

Type=simple

ExecStart=/usr/local/asterisk_exporter/asterisk_exporter \

--web.listen-address=":9815" \

--web.telemetry-path="/metrics" \

--metrics.prefix="asterisk" \

--asterisk.path="/usr/sbin/asterisk" \

--collector.core \

--collector.sip \

--collector.bridges \

--collector.calendars \

--collector.confbridges \

--collector.modules

Restart=always

RestartSec=1

[Install]

WantedBy=multi-user.target

服务启动

cp asterisk_exporter.service /usr/lib/systemd/system/asterisk_exporter.service

systemctl enable asterisk_exporter

systemctl start asterisk_exporter

systemctl status asterisk_exporter

以上如果启动不了可以尝试修改下面参数

ExecStart=/usr/local/asterisk_exporter/asterisk_exporter --asterisk.path=/usr/sbin/asterisk --collector.core --collector.sip --collector.bridges --collector.calendars --collector.confbridges --collector.modules

方式二

Git clone https://github.com/tainguyenbp/asterisk_exporter.git

十四、JVM监控

一、概述

JMX Exporter

https://github.com/prometheus/jmx_exporter

它是Prometheus官方组件,作为一个JAVA Agent来提供本地JVM的metrics,并通过http暴露出来。这也是官方推荐的一种方式,可以获取进程的信息,比如CPU和内存使用情况。

Jmx_exporter是以代理的形式收集目标应用的jmx指标,这样做的好处在于无需对目标应用做任何的改动。

运行JMX exporter的方式:

java XXX -javaagent:/root/jmx_exporter/jmx_prometheus_javaagent-0.12.0.jar=3010:/root/jmx_exporter/config.yaml -jar XXX.jar

安装部署

cd /usr/local

mkdir jvm_exporter

wget https://repo1.maven.org/maven2/io/prometheus/jmx/jmx_prometheus_javaagent/0.12.0/jmx_prometheus_javaagent-0.12.0.jar

vim simple-config.yml

---

wercaseOutputLabelNames: true

lowercaseOutputName: true

whitelistObjectNames: ["java.lang:type=OperatingSystem"]

rules:

- pattern: 'java.lang<type=OperatingSystem><>((?!process_cpu_time)\w+):'

name: os_$1

type: GAUGE

attrNameSnakeCase: true

或者使用下面配置

--   
lowercaseOutputLabelNames: true
lowercaseOutputName: true
whitelistObjectNames: ["java.lang:type=OperatingSystem"]
blacklistObjectNames: []
rules:
  - pattern: 'java.lang<type=OperatingSystem><>(committed_virtual_memory|free_physical_memory|free_swap_space|total_physical_memory|total_swap_space)_size:'
    name: os_$1_bytes
    type: GAUGE
    attrNameSnakeCase: true
  - pattern: 'java.lang<type=OperatingSystem><>((?!process_cpu_time)\w+):'
    name: os_$1
    type: GAUGE
    attrNameSnakeCase: true -

比如我有一个rms的java应用,启动方式为:

java -jar /data/rms/RMS.jar

使用JMX Exporter插件收集数据,需要改成这样:

java -javaagent:/usr/local/jmx_exporter/jmx_prometheus_javaagent-0.12.0.jar=3010:/usr/local/jmx_exporter/simple-config.yml -jar /data/rms/RMS.jar

注意:3010是代理端口,可以随意指定。

或者下面已遵义公积金ivr流程为例

java -javaagent:/usr/local/jvm_exporter/jmx_prometheus_javaagent-0.12.0.jar=3310:/usr/local/jvm_exporter/simple-config.yml -jar ./laihu-ivr-zygjj.jar

可以正常获取JVM指标

对接普罗米修斯系统

新增job

- job_name: 'jvm_exporter'

static_configs:

- targets: ['172.16.1.254:3310']

labels:

instance: jvm_exporter

对接grafana系统

https://grafana.com/grafana/dashboards/8563

选择8563模板

附录1:Generator的yml配置模块列表

modules:

module_name: # The module name. You can have as many modules as you want.

walk: # List of OIDs to walk. Can also be SNMP object names or specific instances.

- 1.3.6.1.2.1.2 # Same as "interfaces"

- sysUpTime # Same as "1.3.6.1.2.1.1.3"

- 1.3.6.1.2.1.31.1.1.1.6.40 # Instance of "ifHCInOctets" with index "40"

version: 2 # SNMP version to use. Defaults to 2.

# 1 will use GETNEXT, 2 and 3 use GETBULK.

max_repetitions: 25 # How many objects to request with GET/GETBULK, defaults to 25.

# May need to be reduced for buggy devices.

retries: 3 # How many times to retry a failed request, defaults to 3.

timeout: 5s # Timeout for each individual SNMP request, defaults to 5s.

auth:

# Community string is used with SNMP v1 and v2. Defaults to "public".

community: public

# v3 has different and more complex settings.

# Which are required depends on the security_level.

# The equivalent options on NetSNMP commands like snmpbulkwalk

# and snmpget are also listed. See snmpcmd(1).

username: user # Required, no default. -u option to NetSNMP.

security_level: noAuthNoPriv # Defaults to noAuthNoPriv. -l option to NetSNMP.

# Can be noAuthNoPriv, authNoPriv or authPriv.

password: pass # Has no default. Also known as authKey, -A option to NetSNMP.

# Required if security_level is authNoPriv or authPriv.

auth_protocol: MD5 # MD5, SHA, SHA224, SHA256, SHA384, or SHA512. Defaults to MD5. -a option to NetSNMP.

# Used if security_level is authNoPriv or authPriv.

priv_protocol: DES # DES, AES, AES192, or AES256. Defaults to DES. -x option to NetSNMP.

# Used if security_level is authPriv.

priv_password: otherPass # Has no default. Also known as privKey, -X option to NetSNMP.

# Required if security_level is authPriv.

context_name: context # Has no default. -n option to NetSNMP.

# Required if context is configured on the device.

lookups: # Optional list of lookups to perform.

# The default for `keep_source_indexes` is false. Indexes must be unique for this option to be used.

# If the index of a table is bsnDot11EssIndex, usually that'd be the label

# on the resulting metrics from that table. Instead, use the index to

# lookup the bsnDot11EssSsid table entry and create a bsnDot11EssSsid label

# with that value.

- source_indexes: [bsnDot11EssIndex]

lookup: bsnDot11EssSsid

drop_source_indexes: false # If true, delete source index labels for this lookup.

# This avoids label clutter when the new index is unique.

overrides: # Allows for per-module overrides of bits of MIBs

metricName:

ignore: true # Drops the metric from the output.

regex_extracts:

Temp: # A new metric will be created appending this to the metricName to become metricNameTemp.

- regex: '(.*)' # Regex to extract a value from the returned SNMP walks's value.

value: '$1' # The result will be parsed as a float64, defaults to $1.

Status:

- regex: '.*Example'

value: '1' # The first entry whose regex matches and whose value parses wins.

- regex: '.*'

value: '0'

type: DisplayString # Override the metric type, possible types are:

# gauge: An integer with type gauge.

# counter: An integer with type counter.

# OctetString: A bit string, rendered as 0xff34.

# DateAndTime: An RFC 2579 DateAndTime byte sequence. If the device has no time zone data, UTC is used.

# DisplayString: An ASCII or UTF-8 string.

# PhysAddress48: A 48 bit MAC address, rendered as 00:01:02:03:04:ff.

# Float: A 32 bit floating-point value with type gauge.

# Double: A 64 bit floating-point value with type gauge.

# InetAddressIPv4: An IPv4 address, rendered as 1.2.3.4.

# InetAddressIPv6: An IPv6 address, rendered as 0102:0304:0506:0708:090A:0B0C:0D0E:0F10.

# InetAddress: An InetAddress per RFC 4001. Must be preceded by an InetAddressType.

# InetAddressMissingSize: An InetAddress that violates section 4.1 of RFC 4001 by

# not having the size in the index. Must be preceded by an InetAddressType.

# EnumAsInfo: An enum for which a single timeseries is created. Good for constant values.

# EnumAsStateSet: An enum with a time series per state. Good for variable low-cardinality enums.

# Bits: An RFC 2578 BITS construct, which produces a StateSet with a time series per bit.

附录2:Centos7配置SNMPtrapd

vi /etc/snmp/snmptrapd.conf

authCommunity log,execute,net public

# systemctl start snmptrapd

# systemctl enable snmptrapd

或者

snmptrapd -C -c /etc/snmp/snmptrapd.conf -Lf /var/log/net-snmptrap.log

netstat -nlp|grep 162

pkill -9 snmptrapd

现在您已经设置好了,让我们抛出一个 SNMP 陷阱来测试脚本是否运行。

当指定 linkDown 的 OID 时就是这种情况。

snmptrap -v 2c -c public 127.0.0.1 '' .1.3.6.1.6.3.1.1.5.3

然后我们再去我们的日志查看

cat/var/log/snmptrap/snmptrap.log

使用snmpb工具测试

先登录snmp服务端

snmptrap -v 2c -c public 172.16.3.28 '1234567' .1.3.6.1.6.3.1.1.5.3 sysLocation.0 s "test"

172.16.3.28为客户端的IP地址

附录3:SNMPB工具测试添加mib库

在C:\Program Files (x86)\SnmpB\目录新建一个项目目录如:genesys,并将mib文件放入

重新启动snmpb工具并创建modules目录参数,再选中加载的modules

附录4:Prometheus查询

gServerStatus {job="genesys_snmp"} !=4

gServerName{gServerId="130",gServerName="MCP_2"}

{__name__=~"gServerName|gServerStatus"}

{__name__=~"gServerName|gServerStatus",gServerId=~"130"}

gsServersLastTrap{job="genesys_snmp"} == 1

gsServersLastTrap{gsServersLastTrap=~"[a-zA-Z_][a-zA-Z0-9_]} == 1

gsServersLastTrap{job=~"[a-zA-Z_][a-zA-Z0-9_]*"} == 1

gsServersLastTrap{gsServersLastTrap=~"[a-zA-Z_]_[a-zA-Z0-9_]:*"} == 1

gsServersLastTrap{gsServersLastTrap==~"[a-zA-Z_][a-zA-Z0-9_]* * *"} == 1

gsServersLastTrap{gsServersLastTrap="MCP_2:server is UP", instance="172.16.5.51", job="genesys_snmp"}

gsServersLastTrap{gsServersLastTrap="confserv:server is DOWN", instance="172.16.5.51", job="genesys_snmp"}

sum ({__name__=~"gServerName|gServerStatus",gServerId=~"130"} )-1

显示instance

sum ({__name__=~"gServerName|gServerStatus",gServerId=~"130"} ) without (instance)

sum ({__name__=~"gServerName|gServerStatus",gServerId=~"130"} ) without (instance) <4

sum by (gServerName) ({__name__=~"gServerName|gServerStatus",gServerId=~"130"} ) !=4 >1

gServerStatus{gServerId=~'.*'} <4

{__name__=~"gServerName|gServerStatus",gServerId=~".*"}

查询只有106和107

gServerStatus{gServerId=~'.*',gServerId=~"(106|107)"} <4

===============

排除106/107/118写法

gServerStatus{gServerId=~'.*',gServerId!~"(106|107|118)"} <4

============

联表查询

格式:

0*gServerName + on(tempIndex) group_left(gServerStatus) gServerStatus

============

0*gServerName+ on(gServerId,instance,job) group_left(gServerStatus) gServerStatus

0*gServerStatus+ on(gServerId,instance,job) group_left(gServerName) gServerName

============

1-sum(gServerName{gServerId=~"130"}) + sum({__name__=~"gServerName|gServerStatus",gServerId=~"130"}) - 1 != 4


附录5、奔驰项目SIP中继监控

prometheus配置:

说明:14,18,19为三个不同的中继接口。当使用率大于80%后报警。

报警公式: pstn_channel_use / (pstn_channel_idle + pstn_channel_use) > 0.8 

######################

- job_name: snmp

  honor_timestamps: true

  scrape_interval: 30s

  scrape_timeout: 10s

  metrics_path: /snmp

  scheme: http

  static_configs:

  - targets:

    - "14"

    - "18"

    - "19"

  relabel_configs:

  - source_labels: [__address__]

    separator: ;

    regex: (.*)

    target_label: __param_target

    replacement: $1

    action: replace

  - source_labels: [__param_target]

    separator: ;

    regex: (.*)

    target_label: instance

    replacement: $1

    action: replace

  - separator: ;

    regex: (.*)

    target_label: __address__

    replacement: 192.168.185.85:9116

    action: replace

##################################

[附件]GENESYS-SML-MIB-G7.zip[附件]奔驰snmp.zip[附件]GENESYS-SML-MIB-G7.zip[附件]GENESYS-SML-MIB-G7.zip

adouk 2023年1月12日 10:19 收藏文档