How do you observe your server functions?

Fermiverse@kbin.social · edit-2 1 year ago

How do you observe your server functions?

tko@tkohhh.social · 1 year ago

Regarding your edit: people are answering the question you posed in your post title, not necessarily giving you advice about how you should do it.

Decronym@lemmy.decronym.xyz · edit-2 1 year ago

Acronyms, initialisms, abbreviations, contractions, and other phrases which expand to something larger, that I’ve seen in this thread:

Fewer Letters	More Letters
DNS	Domain Name Service/System
HA	Home Assistant automation software
~	High Availability
SSD	Solid State Drive mass storage
VPN	Virtual Private Network

4 acronyms in this thread; the most compressed thread commented on today has 14 acronyms.

[Thread #11 for this sub, first seen 19th Jul 2023, 17:40] [FAQ] [Full list] [Contact] [Source code]

SheeEttin@lemmy.world · 1 year ago

I’ll keep it very simple: I don’t.

If I’m trying to do something and I notice an issue, then I’ll investigate it. But if it’s not affecting anything, is it really a problem?

mea_rah@lemmy.world · 1 year ago

I was kind of the same, but I still collected metrics, because I just love graphs.

Over time I ended up setting alerts for failures I wish I was aware of earlier. Some examples:

HDD monitoring - usually drive is showing signs of failure couple days before it fails, so I have time to shop around for replacement. If I had no alert set, I’d probably only notice when both sides of a mirror failed which would mean couple days of downtime, lot of work with backup restoration and very limited time to find drive for reasonable price
networking issues - especially VPN, it’s much better to know that it is broken before you leave house
some core services like DNS. With two Adguard instances it’s much better to be alerted when one is down, than to realize that you suddenly have no DNS when both fail and you can’t even google stuff without messing with your connection settings.
SSD writes - same as HDDs, but in this case the alert is around 90% declared TBW lifetime claimed by manufacturer and I tend to replace them proactively as they are usually used as system disk without mirror, which holds no valuable data, but would again lead to extended unplanned downtime
CPU usage being maxed out for long time - I had one service fail in a way where it consumed 100% of all cores. This had no impact on other services because process scheduler did its job, but I ended up burning kilowats of electricity as this continued unnoticed for weeks. This was before energy prices went up, but it was still noticeable power consumption. (Had double CPU server back then, that consumed a lot of juice when maxed out)

H2iK@lemmy.world · 1 year ago

What do you use to collect these metrics?

mea_rah@lemmy.world · 1 year ago

I use Telegraf for most of the metrics.

FrostyCaveman@lemm.ee · 1 year ago

Prometheus, Loki and Grafana.

And so so many Prometheus metric exporters.

Observability is such an endless rabbit hole, it’s so easy for me to spend huge amounts of time accomplishing not that much lol. But very enjoyable and cool to see it all come together.

My pro tips: using Kubernetes actually makes this stuff a heck of a lot easier to set up thanks to the common patterns that k8s has - lots of turnkey helm charts out there that make it all so easy and are powerful. Another tip would be to use Prometheus service discovery if you can. Also, Loki/Promtail is actually quite easy to set up - but using LogQL queries can be very tricky. Just be warned, observability is a full time hobby in itself lol

easeKItMAn@lemmy.world · edit-2 1 year ago

I set up custom bash scripts collecting information (df, docker json, smartCTL etc) Either parse existing json info or assemble json strings and push it to Homeassistant REST api (cron) In Homeassistant data is turned into sensors and displayed. HA sends messages of sensors fail.
Info served in HA:

HDD/SSD (size, smartCTL errors, spin up/down, temperature etc)
Availability/health of docker services
CPU usage/RAM/temperature
Network interface/throughput/speed/connections
fail2ban jails

Trying to keep my servers as barebones as possible. Additional services/apps put strain on CPU/RAM etc. Found out most of data necessary for monitoring is either available (docker json, smartCTL json) or can be easily caught, e.g.

df -Pht ext4 | tail -n +2 | awk '{ print $1}

It was fun learning and defining what must be monitored or not, and building a custom interface in HA.

tychosmoose@lemm.ee · 1 year ago

Monit for simple stuff and daemon restart on failure. LibreNMS for SNMP polling, graphing, logging, & alerting.

pythia@lemmy.dbzer0.com · 1 year ago

I used monitorix a long time ago. now netdata.

m_randall@sh.itjust.works · 1 year ago

I’ve used monit for maybe 2 decades now. Works great and simple to use.

https://mmonit.com/monit/

vegetaaaaaaa@lemmy.world · edit-2 1 year ago

https://github.com/awesome-foss/awesome-sysadmin#monitoring

I use netdata (agent only, not the cloud/SaaS stuff)