vpsAdmin/vpsf-status metriky pro Prometheus

Napadlo mě, že bychom mohli poskytovat exporter pro Prometheus, který byste si mohli scrapovat z vašeho vlastního monitoringu, ať už ho máte ve VPS nebo jinde.

Pro vpsAdmin by to mohlo vypadat nějak takto:

# GET https://api.vpsfree.cz/metrics?token=<unikatni pristupovy token>

# HELP 0 = online, 1 = down
# TYPE gauge
vpsadmin_node_status{location_id, location_name, node_id, node_name}

# HELP 0 = no, 1 = under maintenance
# TYPE gauge
vpsadmin_node_maintenance_on{location_id, location_name, node_id, node_name}

# HELP CPU idle in percent
# TYPE gauge
vpsadmin_node_cpu_idle_percent{location_id, location_name, node_id, node_name}

# HELP 0 = online, 1 = degraded, ...?
# TYPE gauge
vpsadmin_node_pool_status{location_id, location_name, node_id, node_name, pool_id, pool_name}

# HELP 0 = none, 1 = scrub, 2 = resilver
# TYPE gauge
vpsadmin_node_pool_scan{location_id, location_name, node_id, node_name, pool_id, pool_name}

# HELP Scan progress
# TYPE gauge
vpsadmin_node_pool_scan_percent{location_id, location_name, node_id, node_name, pool_id, pool_name}

# HELP 1 if the VPS is running, 0 if stopped
# TYPE gauge
vpsadmin_vps_running{vps_id}

# HELP Number of seconds the since the VPS was started
# TYPE gauge
vpsadmin_vps_boot_time_seconds{vps_id}

# HELP Load averages
# TYPE gauge
vpsadmin_vps_load1{vps_id}
vpsadmin_vps_load5{vps_id}
vpsadmin_vps_load15{vps_id}

# HELP Number of processes
# TYPE gauge
vpsadmin_vps_processes_pids{vps_id}

# TYPE gauge
vpsadmin_vps_memory_used_bytes{vps_id}

# TYPE gauge
vpsadmin_vps_memory_total_bytes{vps_id}

# TYPE gauge
vpsadmin_vps_swap_used_bytes{vps_id}

# TYPE gauge
vpsadmin_vps_swap_total_bytes{vps_id}

# TYPE gauge
vpsadmin_vps_cpu_cores{vps_id}

# TYPE gauge
vpsadmin_vps_cpu_percent{vps_id, mode=user|system|idle}

# TYPE counter
vpsadmin_vps_cpu_nanoseconds_total{vps_id, mode=user|system|idle}

# HELP Number of transferred bytes
# TYPE counter
vpsadmin_vps_transferred_bytes{vps_id, netif_id, netif_name, direction=sent|received, year, month}

# HELP Number of transferred packets
# TYPE counter
vpsadmin_vps_transferred_packets{vps_id, netif_id, netif_name, direction=sent|received, year, month}

# HELP Number of available bytes in a dataset
# TYPE gauge
vpsadmin_dataset_avail_bytes{vps_id, dataset_id, dataset_name}

# HELP Number of used bytes in a dataset
# TYPE gauge
vpsadmin_dataset_used_bytes{vps_id, dataset_id, dataset_name}

# HELP Number of referenced bytes in a dataset
# TYPE gauge
vpsadmin_dataset_referenced_bytes{vps_id, dataset_id, dataset_name}

# HELP Dataset quota in bytes
# TYPE gauge
vpsadmin_dataset_quota_bytes{vps_id, dataset_id, dataset_name}

# HELP Dataset reference quota in bytes
# TYPE gauge
vpsadmin_dataset_refquota_bytes{vps_id, dataset_id, dataset_name}

# HELP Compression ratio of used bytes
# TYPE gauge
vpsadmin_dataset_compressratio{vps_id, dataset_id, dataset_name}

# HELP Compression ratio of referenced bytes
# TYPE gauge
vpsadmin_dataset_refcompressratio{vps_id, dataset_id, dataset_name}

# HELP Number of OOM reports
# TYPE counter
vpsadmin_oom_report_count{vps_id, cgroup, invoked_by_process, killed_process}

# HELP Number of incident reports
# TYPE counter
vpsadmin_incident_report_count{vps_id, codename}

A status.vpsf.cz takto:

# GET https://status.vpsf.cz/metrics

# HELP 1 if the service is up, 0 if down
# TYPE gauge
vpsfstatus_vpsadmin_up{service=api|console|webui}

# HELP 1 if vpsAdmin on node is up, 0 if down
# TYPE gauge
vpsfstatus_node_vpsadmin_up{location_id, location_label, node_id, node_name}

# HELP 0 if node is responding to ping, 1 if there is packet loss, 2 if it is not responding
# TYPE gauge
vpsfstatus_node_ping{location_id, location_label, node_id, node_name}

# HELP 1 if node is under maintenance, 0 if not
# TYPE gauge
vpsfstatus_node_maintenance_on{location_id, location_label, node_id, node_name}

# HELP 1 if pool status is known, 0 if not
# TYPE gauge
vpsfstatus_node_pool_up{location_id, location_label, node_id, node_name}

# HELP 0 = online, 1 = degraded
# TYPE gauge
vpsfstatus_node_pool_state{location_id, location_label, node_id, node_name}

# HELP 0 = none, 1 = scrub, 2 = resilver
# TYPE gauge
vpsfstatus_node_pool_scan{location_id, location_label, node_id, node_name}

# HELP Scan progress
# TYPE gauge
vpsfstatus_node_pool_scan_percent{location_id, location_label, node_id, node_name}

# HELP 0 = responding to ping, 1 = packet loss, 2 = not responding
# TYPE gauge
vpsfstatus_dns_resolver_ping{name}

# HELP 1 = DNS lookup operational, 0 = not working
# TYPE gauge
vpsfstatus_dns_resolver_lookup{name}

# HELP 0 = operational, 1 = not working
# TYPE gauge
vpsfstatus_web_service_status{name}

# HELP 0 = responding to ping, 1 = packet loss, 2 = not responding
# TYPE gauge
vpsfstatus_nameserver_ping{name}

# HELP 1 = DNS lookup operational, 0 = not working
# TYPE gauge
vpsfstatus_nameserver_lookup{name}

Měl by pro to někdo využití? Popř. hodily by se vám nějaké další metriky?

2 Likes

cau, za me to je dobrej napad. vps mam teda spis pro radost, ale v praci si monitoruju hlavne infra zabbixem(mel by umet cist prometheus). jestli to neni ultra drina, tak ja bych byl pro.

libi se mi jak nespite na vavrinech a porad inovujete :wink:

diky
s.

1 Like

Ahoj,

ano zájem máme

koukám, že tam jsou externí data, který z klasický vpsky nedostaneme, třeba oom killer, bude tak rychlejší alerting, než-li ten mail, teda už mi tam dlouhou dobu nic nesestřelil, ale i tak :smiley:

Díky

1 Like

Díky za odpovědi!

Po přihlášení do vpsAdminu klikněte na Edit profile → Metrics access tokens, vytvořte si přístupový token a zobrazí se vám URL. Později k tomu napíšu delší oznámení, ale ještě to chvíli nechám uzrát. Jsou tam metriky na status nodů, VPS, přenosy dat, datasety, OOM killy, incidenty, odstávky/výpadky týkající se vašich VPS a taky datum do kdy máte zaplaceno x)

Metriky pro vpsf-status jsou tady: https://status.vpsf.cz/metrics

@ oom killy, to bylo dostupne i pred timhle; v tech korenovych adresarich pod /sys/fs/cgroup (zalezi, jestli je to na node s v1 nebo v2 cgroup); budto v memory/ nebo primo je memory.stat, v nem je to oom_kill

a od kernelu 6.6.28 (akorat pripravuju pro staging) to bude virtualizovane i rovnou v /proc/vmstat

Cau, cvicne jsem zkusil natahnout do kdy mam zaplaceno a funguje to krasne.
Kdyby tu nekdo taky jezdil spis Zabbix A byl v nem vetsi lama jak ja:

do hosta macro na klic {$PROMETHEUS_API_KEY}

udelej web check typu text na adresu https://api.vpsfree.cz/metrics?access_token={$PROMETHEUS_API_KEY}
na nej dependent item
typ float, unit unixtime
v preprocessingu type numeric float
je Prometheus pattern

https://www.zabbix.com/documentation/current/en/manual/config/items/itemtypes/prometheus

no a ja si do dlouhodobyho TODO pisu na to udelat discovery rule, kdyz se podari tak se podelim, ale jak se tak znam tak to uplne brzo necekejte.
https://www.zabbix.com/documentation/current/en/manual/discovery/low_level_discovery/examples/prometheus

diky
S.

pá 19. 4. 2024 v 15:23 odesílatel aither via vpsFree.cz Discourse <discourse@vpsfree.cz> napsal:

1 Like