Initial
This commit is contained in:
parent
3fc0100dcf
commit
9fe5418c62
100
README.md
100
README.md
@ -1,2 +1,100 @@
|
||||
# prtg-smart-health-check
|
||||
# PRTG SMART Monitoring Script (`prtg-smart-health-check.v1.sh`)
|
||||
|
||||
Dieses Bash-Skript überwacht die S.M.A.R.T.-Werte einer physischen Festplatte oder SSD (SATA, SAS oder NVMe) und liefert die Daten im PRTG-kompatiblen XML-Format. Es wird über einen **SSH Script (erweitert)** Sensor eingebunden und kann automatisch erkennen, ob es sich um ein klassisches oder NVMe-Gerät handelt.
|
||||
|
||||
## 🔧 Voraussetzungen
|
||||
|
||||
- Ein Linux-System (z. B. Debian/Ubuntu/Proxmox)
|
||||
- Das Paket `smartmontools` muss installiert sein:
|
||||
```bash
|
||||
sudo apt install smartmontools
|
||||
```
|
||||
- Das Skript muss mit `bash` ausgeführt werden
|
||||
- PRTG benötigt SSH-Zugriff mit einem Benutzer, der das Skript ausführen darf
|
||||
|
||||
## 📁 Installation
|
||||
|
||||
1. Erstelle den Zielpfad für das Skript (empfohlenes Verzeichnis):
|
||||
```bash
|
||||
sudo mkdir -p /var/prtg/scriptsxml/
|
||||
```
|
||||
|
||||
2. Kopiere das Skript in dieses Verzeichnis und mache es ausführbar:
|
||||
```bash
|
||||
sudo cp prtg-smart-health-check.v1.sh /var/prtg/scriptsxml/
|
||||
sudo chmod +x /var/prtg/scriptsxml/prtg-smart-health-check.v1.sh
|
||||
```
|
||||
|
||||
3. Erlaube dem PRTG-SSH-Benutzer die Nutzung von `smartctl` ohne Passwort:
|
||||
```bash
|
||||
echo "prtguser ALL=(ALL) NOPASSWD: /usr/sbin/smartctl" | sudo tee /etc/sudoers.d/prtg-smart
|
||||
```
|
||||
> **Hinweis**: Ersetze `prtguser` durch den tatsächlichen Benutzer, den PRTG per SSH verwendet.
|
||||
|
||||
## 🧪 Testlauf
|
||||
|
||||
Vorab kann das Skript manuell getestet werden:
|
||||
```bash
|
||||
sudo /var/prtg/scriptsxml/prtg-smart-health-check.v1.sh /dev/nvme0n1
|
||||
```
|
||||
|
||||
## ⚙️ PRTG-Einbindung
|
||||
|
||||
1. Öffne dein PRTG Webinterface
|
||||
2. Wähle das gewünschte Gerät (der Linux-Host)
|
||||
3. Klicke auf **„Sensor hinzufügen“ → „SSH Script (Erweitert)“**
|
||||
4. Wähle im Dropdown das Skript `prtg-smart-health-check.v1.sh`
|
||||
5. Als **Parameter** gibst du das physikalische Device an, z. B.:
|
||||
```
|
||||
/dev/sda
|
||||
```
|
||||
oder
|
||||
```
|
||||
/dev/nvme0n1
|
||||
```
|
||||
|
||||
6. Lasse das Feld **„Mutex Name“** leer
|
||||
7. Speichern – fertig ✅
|
||||
|
||||

|
||||
|
||||
> ❗ **Wichtig**:
|
||||
> Der Benutzer, mit dem PRTG sich per SSH verbindet, muss das angegebene Device lesen dürfen (z. B. `/dev/nvme0n1`). Üblicherweise ist das nur per `sudo` mit `smartctl` möglich – genau deshalb ist die `sudoers`-Datei erforderlich.
|
||||
|
||||
## 📊 Was wird überwacht?
|
||||
|
||||
**SATA/SAS:**
|
||||
- Temperatur
|
||||
- Betriebsstunden
|
||||
- Reallocated Sectors
|
||||
- Pending Sectors
|
||||
- Uncorrectable Sectors
|
||||
- CRC-Fehler und mehr
|
||||
|
||||
**NVMe:**
|
||||
- Temperatur
|
||||
- Wear-Level (% Verbrauch)
|
||||
- Datenmenge geschrieben/gelesen (TB)
|
||||
- Unsafe Shutdowns
|
||||
- Fehlerlogs
|
||||
- Power-Cycles
|
||||
- Verlauf des Verschleißes direkt im Sensortext
|
||||
|
||||

|
||||
|
||||
## 📁 Caching für Wear-Level-Verlauf
|
||||
|
||||
Das Skript speichert den letzten Wear-Level-Wert unter:
|
||||
```
|
||||
/var/prtg/scriptsxml/.smart_wear_dev_nvme0n1.cache
|
||||
```
|
||||
Daraus ergibt sich im Sensortext z. B.:
|
||||
```
|
||||
Wear: 9 % – +1 % seit 2025-03-27
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
🧠 Ideal für Admins, die Verschleiß frühzeitig erkennen und NVMe-Smartdaten sauber im Monitoring sehen wollen.
|
||||
|
||||
Autor: Patrick Asmus – [www.techniverse.net](https://www.techniverse.net)
|
||||
|
BIN
assets/screenshot_1.png
Normal file
BIN
assets/screenshot_1.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 19 KiB |
BIN
assets/screenshot_2.png
Normal file
BIN
assets/screenshot_2.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 163 KiB |
135
prtg-smart-health-check.v1.sh
Normal file
135
prtg-smart-health-check.v1.sh
Normal file
@ -0,0 +1,135 @@
|
||||
#!/usr/bin/env bash
|
||||
# Beschreibung: Überwacht die S.M.A.R.T Werte eines physikalischen Devices und gibt diese als XML in PRTG aus
|
||||
# Parameter: /dev/sda
|
||||
# Autor: Patrick Asmus
|
||||
# Web: https://www.techniverse.net
|
||||
# Version: 1.0
|
||||
# Datum: 28.03.2025
|
||||
# Modifikation: Initial
|
||||
#####################################################
|
||||
|
||||
DEVICE="$1"
|
||||
|
||||
if [ -z "$DEVICE" ]; then
|
||||
echo "<prtg><error>1</error><text>Kein Gerät angegeben</text></prtg>"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
if ! command -v smartctl &>/dev/null; then
|
||||
echo "<prtg><error>1</error><text>smartctl nicht installiert</text></prtg>"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
if [ ! -b "$DEVICE" ]; then
|
||||
echo "<prtg><error>1</error><text>Gerät $DEVICE nicht gefunden</text></prtg>"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
SMART_OUTPUT=$(sudo smartctl -x "$DEVICE")
|
||||
|
||||
# Herstellerinformationen extrahieren
|
||||
MODEL=$(echo "$SMART_OUTPUT" | awk -F: '/Model Number|Device Model/ {print $2}' | xargs)
|
||||
SERIAL=$(echo "$SMART_OUTPUT" | awk -F: '/Serial Number/ {print $2}' | xargs)
|
||||
VENDOR=$(echo "$SMART_OUTPUT" | awk -F: '/Vendor/ {print $2}' | xargs)
|
||||
[ -z "$VENDOR" ] && VENDOR=$(echo "$MODEL" | awk '{print $1}')
|
||||
MODEL=${MODEL:-Unbekanntes Modell}
|
||||
SERIAL=${SERIAL:-Keine Seriennummer}
|
||||
VENDOR=${VENDOR:-Unbekannter Hersteller}
|
||||
|
||||
# Gerätetyp erkennen
|
||||
if echo "$SMART_OUTPUT" | grep -q "NVMe Version"; then
|
||||
TYPE="nvme"
|
||||
else
|
||||
TYPE="sata"
|
||||
fi
|
||||
|
||||
XML="<prtg>"
|
||||
|
||||
if [ "$TYPE" = "sata" ]; then
|
||||
get_value() {
|
||||
echo "$SMART_OUTPUT" | awk -v id="$1" '$1 == id {print $10}'
|
||||
}
|
||||
|
||||
TEMP=$(get_value 194)
|
||||
HOURS=$(get_value 9)
|
||||
REALLOC=$(get_value 5)
|
||||
REALLOC_EVENT=$(get_value 196)
|
||||
PENDING=$(get_value 197)
|
||||
UNCORRECTABLE=$(get_value 198)
|
||||
CRC_ERROR=$(get_value 199)
|
||||
REPORTED_UNCORRECT=$(get_value 187)
|
||||
|
||||
for var in TEMP HOURS REALLOC REALLOC_EVENT PENDING UNCORRECTABLE CRC_ERROR REPORTED_UNCORRECT; do
|
||||
eval "[ -z \$$var ] && $var=0"
|
||||
done
|
||||
|
||||
XML+="
|
||||
<result><channel>Temperature (°C)</channel><value>$TEMP</value><unit>Temperature</unit><limitmode>1</limitmode><limitmaxwarning>45</limitmaxwarning><limitmaxerror>55</limitmaxerror></result>
|
||||
<result><channel>Power-On Hours</channel><value>$HOURS</value><unit>Hours</unit></result>
|
||||
<result><channel>Reallocated Sectors</channel><value>$REALLOC</value><unit>Count</unit><limitmode>1</limitmode><limitmaxerror>10</limitmaxerror></result>
|
||||
<result><channel>Reallocated Events</channel><value>$REALLOC_EVENT</value><unit>Count</unit></result>
|
||||
<result><channel>Pending Sectors</channel><value>$PENDING</value><unit>Count</unit><limitmode>1</limitmode><limitmaxerror>1</limitmaxerror></result>
|
||||
<result><channel>Offline Uncorrectable</channel><value>$UNCORRECTABLE</value><unit>Count</unit><limitmode>1</limitmode><limitmaxerror>1</limitmaxerror></result>
|
||||
<result><channel>Reported Uncorrect</channel><value>$REPORTED_UNCORRECT</value><unit>Count</unit></result>
|
||||
<result><channel>CRC Error Count</channel><value>$CRC_ERROR</value><unit>Count</unit></result>
|
||||
<text>$VENDOR | $MODEL | SN: $SERIAL | $DEVICE</text>"
|
||||
|
||||
elif [ "$TYPE" = "nvme" ]; then
|
||||
get_nvme_value() {
|
||||
echo "$SMART_OUTPUT" | grep -E "^$1:" | head -n1 | awk -F: '{gsub(/^[ \t]+|[ \t]+$/, "", $2); print $2}'
|
||||
}
|
||||
|
||||
TEMP=$(get_nvme_value "Temperature" | awk '{print $1}')
|
||||
HOURS=$(get_nvme_value "Power On Hours" | sed 's/\.//g')
|
||||
PERCENT_USED=$(get_nvme_value "Percentage Used" | sed 's/%//')
|
||||
MEDIA_ERRORS=$(get_nvme_value "Media and Data Integrity Errors")
|
||||
ERROR_LOGS=$(get_nvme_value "Error Information Log Entries")
|
||||
UNSAFE_SHUTDOWNS=$(get_nvme_value "Unsafe Shutdowns")
|
||||
WARN_TEMP_TIME=$(get_nvme_value "Warning Comp. Temperature Time")
|
||||
CRIT_TEMP_TIME=$(get_nvme_value "Critical Comp. Temperature Time")
|
||||
POWER_CYCLES=$(get_nvme_value "Power Cycles")
|
||||
WRITTEN_TB=$(get_nvme_value "Data Units Written" | grep -o '\[[0-9.]* TB\]' | tr -d '[]TB ')
|
||||
READ_TB=$(get_nvme_value "Data Units Read" | grep -o '\[[0-9.]* TB\]' | tr -d '[]TB ')
|
||||
|
||||
# === Wear Level Verlauf verfolgen ===
|
||||
CACHE_FILE="/var/prtg/scriptsxml/.smart_wear_${DEVICE//\//_}.cache"
|
||||
NOW_DATE=$(date '+%Y-%m-%d')
|
||||
if [ -f "$CACHE_FILE" ]; then
|
||||
LAST_LEVEL=$(awk -F: '{print $1}' "$CACHE_FILE")
|
||||
LAST_DATE=$(awk -F: '{print $2}' "$CACHE_FILE")
|
||||
DIFF=$((PERCENT_USED - LAST_LEVEL))
|
||||
|
||||
if [ "$DIFF" -ne 0 ]; then
|
||||
TREND=" – +$DIFF % seit $LAST_DATE"
|
||||
echo "$PERCENT_USED:$NOW_DATE" > "$CACHE_FILE"
|
||||
else
|
||||
TREND=" – unverändert seit $LAST_DATE"
|
||||
fi
|
||||
else
|
||||
echo "$PERCENT_USED:$NOW_DATE" > "$CACHE_FILE"
|
||||
TREND=" – erster Messpunkt"
|
||||
fi
|
||||
|
||||
for var in TEMP HOURS PERCENT_USED MEDIA_ERRORS ERROR_LOGS UNSAFE_SHUTDOWNS WARN_TEMP_TIME CRIT_TEMP_TIME POWER_CYCLES WRITTEN_TB READ_TB; do
|
||||
eval "[ -z \$$var ] && $var=0"
|
||||
done
|
||||
|
||||
XML+="
|
||||
<result><channel>Temperature (°C)</channel><value>$TEMP</value><unit>Temperature</unit><limitmode>1</limitmode><limitmaxwarning>65</limitmaxwarning><limitmaxerror>80</limitmaxerror></result>
|
||||
<result><channel>Power-On Hours</channel><value>$HOURS</value><unit>Hours</unit></result>
|
||||
<result><channel>Wear Level (Percentage Used)</channel><value>$PERCENT_USED</value><unit>Percent</unit><limitmode>1</limitmode><limitmaxwarning>70</limitmaxwarning><limitmaxerror>90</limitmaxerror></result>
|
||||
<result><channel>Power Cycles</channel><value>$POWER_CYCLES</value><unit>Count</unit></result>
|
||||
<result><channel>Media/Data Errors</channel><value>$MEDIA_ERRORS</value><unit>Count</unit><limitmode>1</limitmode><limitmaxerror>1</limitmaxerror></result>
|
||||
<result><channel>SMART Error Logs</channel><value>$ERROR_LOGS</value><unit>Count</unit></result>
|
||||
<result><channel>Unsafe Shutdowns</channel><value>$UNSAFE_SHUTDOWNS</value><unit>Count</unit></result>
|
||||
<result><channel>Warning Temp Time</channel><value>$WARN_TEMP_TIME</value><unit>TimeSeconds</unit></result>
|
||||
<result><channel>Critical Temp Time</channel><value>$CRIT_TEMP_TIME</value><unit>TimeSeconds</unit><limitmode>1</limitmode><limitmaxerror>10</limitmaxerror></result>
|
||||
<result><channel>Data Written (TB)</channel><value>$WRITTEN_TB</value><unit>Custom</unit></result>
|
||||
<result><channel>Data Read (TB)</channel><value>$READ_TB</value><unit>Custom</unit></result>
|
||||
<text>$VENDOR | $MODEL | SN: $SERIAL | $DEVICE | Wear: $PERCENT_USED %$TREND</text>"
|
||||
else
|
||||
XML+="<error>1</error><text>Unbekannter Gerätetyp</text>"
|
||||
fi
|
||||
|
||||
XML+="</prtg>"
|
||||
echo "$XML"
|
Loading…
x
Reference in New Issue
Block a user