Trying to understand the different selfhosted monitoring solutions

dr_robot@kbin.social · 1 year ago

Trying to understand the different selfhosted monitoring solutions

Dran@lemmy.world · 1 year ago

check_mk is what I use at home and at work, it’s a fork of nagios/icinga, works with agents, nagios plugins, or snmp, and if somehow you can’t find what you want to monitor, writing custom checks is as easy as writing a bash script

1 year ago

I’m also using checkmk and have been happy with it. I had been using zabbix prior but found it so be cumbersome and sluggish.

Capillary7379@lemmy.world · 1 year ago

I opted for checkmk as well and don’t want to switch. It’s got a good default for Linux monitoring and it will tell me about random things to fix after reboots, or that memory/disc is getting low so I can fix it quickly.

When monitoring 15 virtual machines on one physical the default of checking every minute for all machines raised the temp over 80 degrees Celsius on the physical machine and triggered a warning. Checking every five minutes is more that I need, so I went with that change.

Dran@lemmy.world · 1 year ago

I have a little 4 core/ 8gb ram VM running my work instance that monitors over a thousand clients on 60s check intervals, you may want to look into your config. I honestly have no idea what could cause 15 machines to cost that much computationally

Capillary7379@lemmy.world · 1 year ago

Sorry, 70 degrees, not 80. The load was fine. It’s a machine to test things, but I kept using checkmk since I really liked it. All on one server, both monitor server and all clients.It’s an old workstation - it runs around 60 degrees normally.

That said, it could very much be a config issue, I installed with the ansible role and left most everything as default. A very easy installation, and with ansible very easy to add new hosts to monitor as well. I’m up to 36 now, including some docker containers.

I switched back to 1 minute to test, and is warned for temp within 20 minutes, from 60 degrees to hovering around 70. Load from 2 to 3.5, threads from 1k to 1.2k all on the physical side. There’s also a small change in IO that seems to be the checkmk server writing more to disk - the cpu on that host is only slighty.

I’m guessing that the temp going over is hardware related, a better fan might fix that issue.

I don’t know if the load/thread increase is reasonable, but given the amount of checks done in the agent I’m perfectly OK with giving those resources to have all the data points checkmk collects available. It’s helped a lot being able to go into details to see what’s going on, checkmk makes that so easy.