diff --git a/docs/PLUGIN_DOC.md b/docs/PLUGIN_DOC.md index 7636a3c..317031a 100644 --- a/docs/PLUGIN_DOC.md +++ b/docs/PLUGIN_DOC.md @@ -11,7 +11,7 @@ | DimmPlugin | sh -c 'dmidecode -t 17 \| tr -s " " \| grep -v "Volatile\\|None\\|Module" \| grep Size' 2>/dev/null
dmidecode
wmic memorychip get Capacity | - | **Collection Args:**
- `skip_sudo`: bool | [DimmDataModel](#DimmDataModel-Model) | [DimmCollector](#Collector-Class-DimmCollector) | - | | DkmsPlugin | dkms status
dkms --version | **Analyzer Args:**
- `dkms_status`: Union[str, list]
- `dkms_version`: Union[str, list]
- `regex_match`: bool | - | [DkmsDataModel](#DkmsDataModel-Model) | [DkmsCollector](#Collector-Class-DkmsCollector) | [DkmsAnalyzer](#Data-Analyzer-Class-DkmsAnalyzer) | | DmesgPlugin | dmesg --time-format iso -x
ls -1 /var/log/dmesg* 2>/dev/null \| grep -E '^/var/log/dmesg(\.[0-9]+(\.gz)?)?$' \|\| true | **Built-in Regexes:**
- Out of memory error: `(?:oom_kill_process.*)\|(?:Out of memory.*)`
- I/O Page Fault: `IO_PAGE_FAULT`
- Kernel Panic: `\bkernel panic\b.*`
- SQ Interrupt: `sq_intr`
- SRAM ECC: `sram_ecc.*`
- Failed to load driver. IP hardware init error.: `\[amdgpu\]\] \*ERROR\* hw_init of IP block.*`
- Failed to load driver. IP software init error.: `\[amdgpu\]\] \*ERROR\* sw_init of IP block.*`
- Real Time throttling activated: `sched: RT throttling activated.*`
- RCU preempt detected stalls: `rcu_preempt detected stalls.*`
- RCU preempt self-detected stall: `rcu_preempt self-detected stall.*`
- QCM fence timeout: `qcm fence wait loop timeout.*`
- General protection fault: `(?:[\w-]+(?:\[[0-9.]+\])?\s+)?general protectio...`
- Segmentation fault: `(?:segfault.*in .*\[)\|(?:[Ss]egmentation [Ff]au...`
- Failed to disallow cf state: `amdgpu: Failed to disallow cf state.*`
- Failed to terminate tmr: `\*ERROR\* Failed to terminate tmr.*`
- Suspend of IP block failed: `\*ERROR\* suspend of IP block <\w+> failed.*`
- amdgpu Page Fault: `(amdgpu \w{4}:\w{2}:\w{2}\.\w:\s+amdgpu:\s+\[\S...`
- Page Fault: `page fault for address.*`
- Fatal error during GPU init: `(?:amdgpu)(.*Fatal error during GPU init)\|(Fata...`
- PCIe AER Error Status: `(pcieport [\w:.]+: AER: aer_status:[^\n]*(?:\n[...`
- PCIe AER Correctable Error Status: `(.*aer_cor_status: 0x[0-9a-fA-F]+, aer_cor_mask...`
- PCIe AER Uncorrectable Error Status: `(.*aer_uncor_status: 0x[0-9a-fA-F]+, aer_uncor_...`
- PCIe AER Uncorrectable Error Severity with TLP Header: `(.*aer_uncor_severity: 0x[0-9a-fA-F]+.*)(\n.*TL...`
- Failed to read journal file: `Failed to read journal file.*`
- Journal file corrupted or uncleanly shut down: `journal corrupted or uncleanly shut down.*`
- ACPI BIOS Error: `ACPI BIOS Error`
- ACPI Error: `ACPI Error`
- Filesystem corrupted!: `EXT4-fs error \(device .*\):`
- Error in buffered IO, check filesystem integrity: `(Buffer I\/O error on dev)(?:ice)? (\w+)`
- PCIe card no longer present: `pcieport (\w+:\w+:\w+\.\w+):\s+(\w+):\s+(Slot\(...`
- PCIe Link Down: `pcieport (\w+:\w+:\w+\.\w+):\s+(\w+):\s+(Slot\(...`
- Mismatched clock configuration between PCIe device and host: `pcieport (\w+:\w+:\w+\.\w+):\s+(\w+):\s+(curren...`
- RAS Correctable Error: `(?:\d{4}-\d+-\d+T\d+:\d+:\d+,\d+[+-]\d+:\d+)?(....`
- RAS Uncorrectable Error: `(?:\d{4}-\d+-\d+T\d+:\d+:\d+,\d+[+-]\d+:\d+)?(....`
- RAS Deferred Error: `(?:\d{4}-\d+-\d+T\d+:\d+:\d+,\d+[+-]\d+:\d+)?(....`
- RAS Corrected PCIe Error: `((?:\[Hardware Error\]:\s+)?event severity: cor...`
- GPU Reset: `(?:\d{4}-\d+-\d+T\d+:\d+:\d+,\d+[+-]\d+:\d+)?(....`
- GPU reset failed: `(?:\d{4}-\d+-\d+T\d+:\d+:\d+,\d+[+-]\d+:\d+)?(....`
- ACA Error: `(Accelerator Check Architecture[^\n]*)(?:\n[^\n...`
- ACA Error: `(Accelerator Check Architecture[^\n]*)(?:\n[^\n...`
- MCE Error: `\[Hardware Error\]:.+MC\d+_STATUS.*(?:\n.*){0,5}`
- Mode 2 Reset Failed: `(?:\d{4}-\d+-\d+T\d+:\d+:\d+,\d+[+-]\d+:\d+)? (...`
- RAS Corrected Error: `(?:\d{4}-\d+-\d+T\d+:\d+:\d+,\d+[+-]\d+:\d+)?(....`
- SGX Error: `x86/cpu: SGX disabled by BIOS`
- MMP Error: `Failed to load MMP firmware qat_4xxx_mmp.bin`
- GPU Throttled: `amdgpu \w{4}:\w{2}:\w{2}.\w: amdgpu: WARN: GPU ...`
- RAS Poison Consumed: `amdgpu[ 0-9a-fA-F:.]+:(?:\s*amdgpu:)?\s+(?:{\d+...`
- RAS Poison created: `amdgpu[ 0-9a-fA-F:.]+:(?:\s*amdgpu:)?\s+(?:{\d+...`
- Bad page threshold exceeded: `(amdgpu: Saved bad pages (\d+) reaches threshol...`
- RAS Hardware Error: `Hardware error from APEI Generic Hardware Error...`
- Error Address: `Error Address.*(?:\s.*)`
- RAS EDR Event: `EDR: EDR event received`
- DPC Event: `DPC: .*`
- LNet: ko2iblnd has no matching interfaces: `(?:\[[^\]]+\]\s*)?LNetError:.*ko2iblnd:\s*No ma...`
- LNet: Error starting up LNI: `(?:\[[^\]]+\]\s*)?LNetError:\s*.*Error\s*-?\d+\...`
- Lustre: network initialisation failed: `LustreError:.*ptlrpc_init_portals\(\).*network ...` | **Collection Args:**
- `collect_rotated_logs`: bool
- `skip_sudo`: bool
- `log_dmesg_data`: bool | [DmesgData](#DmesgData-Model) | [DmesgCollector](#Collector-Class-DmesgCollector) | [DmesgAnalyzer](#Data-Analyzer-Class-DmesgAnalyzer) | -| FabricsPlugin | ibstat
ibv_devinfo
ls -l /sys/class/infiniband/*/device/net
mst start
mst status -v
ofed_info -s | - | - | [FabricsDataModel](#FabricsDataModel-Model) | [FabricsCollector](#Collector-Class-FabricsCollector) | - | +| FabricsPlugin | lspci \| grep -i cassini
lsmod \| grep cxi
cxi_stat
ibstat
ibv_devinfo
ls -l /sys/class/infiniband/*/device/net
fi_info -p cxi
mst start
mst status -v
ip link show
ofed_info -s | - | - | [FabricsDataModel](#FabricsDataModel-Model) | [FabricsCollector](#Collector-Class-FabricsCollector) | - | | JournalPlugin | journalctl --no-pager --system --output=short-iso
journalctl --no-pager --system --output=json | **Analyzer Args:**
- `check_priority`: Optional[int]
- `group`: bool | **Collection Args:**
- `boot`: Optional[int] | [JournalData](#JournalData-Model) | [JournalCollector](#Collector-Class-JournalCollector) | [JournalAnalyzer](#Data-Analyzer-Class-JournalAnalyzer) | | KernelPlugin | sh -c 'uname -a'
sh -c 'cat /proc/sys/kernel/numa_balancing'
wmic os get Version /Value | **Analyzer Args:**
- `exp_kernel`: Union[str, list]
- `exp_numa`: Optional[int]
- `regex_match`: bool | - | [KernelDataModel](#KernelDataModel-Model) | [KernelCollector](#Collector-Class-KernelCollector) | [KernelAnalyzer](#Data-Analyzer-Class-KernelAnalyzer) | | KernelModulePlugin | cat /proc/modules
modinfo amdgpu
wmic os get Version /Value | **Analyzer Args:**
- `kernel_modules`: dict[str, dict]
- `regex_filter`: list[str] | - | [KernelModuleDataModel](#KernelModuleDataModel-Model) | [KernelModuleCollector](#Collector-Class-KernelModuleCollector) | [KernelModuleAnalyzer](#Data-Analyzer-Class-KernelModuleAnalyzer) | @@ -25,7 +25,7 @@ | RdmaPlugin | rdma link -j
rdma dev
rdma link
rdma statistic -j | - | - | [RdmaDataModel](#RdmaDataModel-Model) | [RdmaCollector](#Collector-Class-RdmaCollector) | [RdmaAnalyzer](#Data-Analyzer-Class-RdmaAnalyzer) | | RocmPlugin | {rocm_path}/opencl/bin/*/clinfo
env \| grep -Ei 'rocm\|hsa\|hip\|mpi\|openmp\|ucx\|miopen'
ls /sys/class/kfd/kfd/proc/
grep -i -E 'rocm' /etc/ld.so.conf.d/*
{rocm_path}/bin/rocminfo
ls -v -d {rocm_path}*
ls -v -d {rocm_path}-[3-7]* \| tail -1
ldconfig -p \| grep -i -E 'rocm'
grep . -r {rocm_path}/.info/* | **Analyzer Args:**
- `exp_rocm`: Union[str, list]
- `exp_rocm_latest`: str
- `exp_rocm_sub_versions`: dict[str, Union[str, list]] | **Collection Args:**
- `rocm_path`: str | [RocmDataModel](#RocmDataModel-Model) | [RocmCollector](#Collector-Class-RocmCollector) | [RocmAnalyzer](#Data-Analyzer-Class-RocmAnalyzer) | | StoragePlugin | sh -c 'df -lH -B1 \| grep -v 'boot''
wmic LogicalDisk Where DriveType="3" Get DeviceId,Size,FreeSpace | - | **Collection Args:**
- `skip_sudo`: bool | [StorageDataModel](#StorageDataModel-Model) | [StorageCollector](#Collector-Class-StorageCollector) | [StorageAnalyzer](#Data-Analyzer-Class-StorageAnalyzer) | -| SysSettingsPlugin | cat /sys/{}
ls -1 /sys/{} | **Analyzer Args:**
- `checks`: Optional[list[nodescraper.plugins.inband.sys_settings.analyzer_args.SysfsCheck]] | **Collection Args:**
- `paths`: list[str]
- `directory_paths`: list[str] | [SysSettingsDataModel](#SysSettingsDataModel-Model) | [SysSettingsCollector](#Collector-Class-SysSettingsCollector) | [SysSettingsAnalyzer](#Data-Analyzer-Class-SysSettingsAnalyzer) | +| SysSettingsPlugin | cat /sys/{}
ls -1 /sys/{}
ls -l /sys/{} | **Analyzer Args:**
- `checks`: Optional[list[nodescraper.plugins.inband.sys_settings.analyzer_args.SysfsCheck]] | **Collection Args:**
- `paths`: list[str]
- `directory_paths`: list[str] | [SysSettingsDataModel](#SysSettingsDataModel-Model) | [SysSettingsCollector](#Collector-Class-SysSettingsCollector) | [SysSettingsAnalyzer](#Data-Analyzer-Class-SysSettingsAnalyzer) | | SysctlPlugin | sysctl -n | **Analyzer Args:**
- `exp_vm_swappiness`: Optional[int]
- `exp_vm_numa_balancing`: Optional[int]
- `exp_vm_oom_kill_allocating_task`: Optional[int]
- `exp_vm_compaction_proactiveness`: Optional[int]
- `exp_vm_compact_unevictable_allowed`: Optional[int]
- `exp_vm_extfrag_threshold`: Optional[int]
- `exp_vm_zone_reclaim_mode`: Optional[int]
- `exp_vm_dirty_background_ratio`: Optional[int]
- `exp_vm_dirty_ratio`: Optional[int]
- `exp_vm_dirty_writeback_centisecs`: Optional[int]
- `exp_kernel_numa_balancing`: Optional[int] | - | [SysctlDataModel](#SysctlDataModel-Model) | [SysctlCollector](#Collector-Class-SysctlCollector) | [SysctlAnalyzer](#Data-Analyzer-Class-SysctlAnalyzer) | | SyslogPlugin | ls -1 /var/log/syslog* 2>/dev/null \| grep -E '^/var/log/syslog(\.[0-9]+(\.gz)?)?$' \|\| true | - | - | [SyslogData](#SyslogData-Model) | [SyslogCollector](#Collector-Class-SyslogCollector) | - | | UptimePlugin | uptime | - | - | [UptimeDataModel](#UptimeDataModel-Model) | [UptimeCollector](#Collector-Class-UptimeCollector) | - | @@ -257,6 +257,11 @@ Collect InfiniBand/RDMA fabrics configuration details - **CMD_OFED_INFO**: `ofed_info -s` - **CMD_MST_START**: `mst start` - **CMD_MST_STATUS**: `mst status -v` +- **CMD_CASSINI_PCI**: `lspci | grep -i cassini` +- **CMD_NET_LINK**: `ip link show` +- **CMD_LIBFABRIC_INFO**: `fi_info -p cxi` +- **CMD_CXI_STAT**: `cxi_stat` +- **CMD_CXI_MODULES**: `lsmod | grep cxi` ### Provides Data @@ -264,11 +269,16 @@ FabricsDataModel ### Commands +- lspci | grep -i cassini +- lsmod | grep cxi +- cxi_stat - ibstat - ibv_devinfo - ls -l /sys/class/infiniband/*/device/net +- fi_info -p cxi - mst start - mst status -v +- ip link show - ofed_info -s ## Collector Class JournalCollector @@ -729,6 +739,7 @@ Collect sysfs settings from user-specified paths. - **SUPPORTED_OS_FAMILY**: `{}` - **CMD**: `cat /sys/{}` - **CMD_LS**: `ls -1 /sys/{}` +- **CMD_LS_LONG**: `ls -l /sys/{}` ### Provides Data @@ -738,6 +749,7 @@ SysSettingsDataModel - cat /sys/{} - ls -1 /sys/{} +- ls -l /sys/{} ## Collector Class SysctlCollector @@ -928,6 +940,7 @@ Complete InfiniBand/RDMA fabrics configuration data - **ibdev_netdev_mappings**: `List[nodescraper.plugins.inband.fabrics.fabricsdata.IbdevNetdevMapping]` - **ofed_info**: `Optional[nodescraper.plugins.inband.fabrics.fabricsdata.OfedInfo]` - **mst_status**: `Optional[nodescraper.plugins.inband.fabrics.fabricsdata.MstStatus]` +- **slingshot_data**: `Optional[nodescraper.plugins.inband.fabrics.fabricsdata.SlingshotData]` ## JournalData Model