FiWiControl/docs/fdir.md

128 lines
7.0 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Fault detection, isolation, and recovery (FDIR)
This document defines how **FiWiControl** detects faults, **isolates** their impact (fail-safe defaults, timeouts, bounded inputs), and how operators **recover**. It complements **`docs/architecture.md`** and the **`[FiWi-FDIR]`** log prefix implemented in **`fiwicontrol.fabric.fdir`**.
**Scope:** Lab automation, fabric JSON/INI workflows, USB discovery, and related CLIs. It does **not** define network datapath FDIR (that lives in the FiWi system design — see **`html/Fi-Wi-L4S.html`**).
---
## 1. Principles
1. **Detect early** — Compare **live** USB topology to **saved** `discovery_fingerprint` before trusting power or harness steps.
2. **Fail closed for strict paths**`Fabric.realize(strict=True)` and `fabric_realize.py --realize` (default strict) **abort** on fingerprint mismatch.
3. **Bound resource use** — Patch-panel JSON loads are **size- and entry-capped** (`fiwicontrol.fabric.patch_panel_json`).
4. **Observable** — Use exit codes, stderr lines prefixed with **`[FiWi-FDIR]`**, and structured log records for aggregation.
5. **No silent hardware actions** — Harnesses use **`--dry-run`** until power paths are explicitly wired; see harness **DESIGN_GAPS** in script docstrings.
---
## 2. Fault taxonomy
| Class | Meaning | Typical cause |
|-------|---------|----------------|
| **F-ENV** | Environment / dependency | Missing `[power]` / `brainstem`, USB permissions, empty Acroname enumeration. |
| **F-CFG** | Configuration | Bad INI/JSON, schema validation, missing `[fabric.rrh.*]`, strict INI verify failure without `--force`. |
| **F-FAB-STALE** | Fabric fingerprint mismatch | Wrong hub, cable, or JSON from another bench. |
| **F-FAB-DISC** | Discovery fault / timeout | Stuck USB enumeration, driver issue, **`--realize-discovery-timeout`** exceeded. |
| **F-SSH** | Remote execution | Unreachable host, auth failure, timeout — surfaced by `ssh_node` / `rexec` (see **`docs/node-control-asyncio-design.md`**). |
| **F-INV** | Inventory vs reality | `verify-inventory` / `--strict-ini` mismatches (Acroname / Monsoon counts). |
---
## 3. Detection mechanisms
### 3.1 `Fabric.binding_cache_status(path)`
| Status | Detection |
|--------|-----------|
| **MISSING** | JSON path does not exist. |
| **INVALID** | Unreadable or invalid JSON / schema validation failure. |
| **UNKNOWN** | Live fingerprint unavailable (e.g. discovery not possible). |
| **STALE** | Live fingerprint ≠ on-disk `discovery_fingerprint`. |
| **READY** | Match. |
Logs (via **`log_fdir`**) record **INVALID**, **UNKNOWN**, and **STALE** at **INFO**/**WARNING** as appropriate; **READY** at **DEBUG**.
### 3.2 `Fabric.realize(strict=…)`
- Runs **live** Acroname discovery (optional **`discovery_timeout`**).
- **strict=True:** raises **`ValueError`** on fingerprint mismatch (preceded by **`[FiWi-FDIR]`** WARNING log).
- **strict=False:** continues; emits **`[FiWi-FDIR]`** WARNING with **expected** and **live** fingerprints (audit trail).
### 3.3 `python3 -m fiwicontrol.fabric status`
Uses binding cache semantics; exit **0** only when **READY** (see **`docs/fabric-builder.md`**).
### 3.4 `scripts/system/fabric_realize.py`
- **Compose** path: discovery once; failures return **F-ENV**/**F-CFG** exit codes (below).
- **`--realize`:** second discovery pass inside **`Fabric.realize`** with **`--realize-discovery-timeout`** (default **120** s).
---
## 4. Isolation
| Mechanism | Effect |
|-----------|--------|
| **Strict realize** | Prevents proceeding with a **stale** fabric definition when strict mode is on. |
| **Discovery timeout** | Cancels hung enumeration during **`Fabric.realize`** when a timeout is set (CLI: **`--realize-discovery-timeout`**). |
| **Patch panel JSON limits** | Ignores oversized or over-count **`bdf_to_patch`** maps (see module docstring in **`patch_panel_json.py`**). |
| **Concentrator report** | Human report isolates concentrator probe failures: snapshot errors become a **parenthesized** line instead of aborting the whole fabric summary. |
| **pytest remote tests** | **`FIWI_RUN_REMOTE_TESTS`** gates SSH integration so CI does not hit live rigs by default. |
---
## 5. Recovery (operator playbook)
| Symptom | Likely class | Recovery steps |
|---------|--------------|----------------|
| Exit **1** / “No Acroname modules” | F-ENV | Install **`pip install -e ".[power]"`**, fix USB/cable, check permissions. |
| Exit **2** / RRH_BINDING_HELP | F-CFG | Add **`[fabric.rrh.<id>]`** with **`acroname_port`**, or run **`python3 -m fiwicontrol.fabric build`**. |
| Exit **3** / fingerprint mismatch | F-FAB-STALE | Re-run **`fabric build`** / **`fabric_realize --json`** on the **correct** host; fix cabling; use **`--no-strict`** only with operational approval. |
| Exit **4** / discovery timeout | F-FAB-DISC | Increase **`--realize-discovery-timeout`**; power-cycle hub; check **`brainstem`** / USB stability. |
| **`STALE` from `status`** | F-FAB-STALE | Same as exit **3**; treat harness start as **unsafe** until **READY** or explicit override (e.g. **`--strict-fabric-ready`** on harness). |
| SSH hangs / failures | F-SSH | Verify **`FIWI_REMOTE_IP`**, keys, **`FIWI_SSH_CONFIG`**; see **`docs/install.md`**. |
| INI verify failures | F-INV | Run **`python3 -m fiwicontrol.power --verify-inventory`**; align **`acroname`** / **`monsoon`** tokens with **`--discovery-json`**. |
---
## 6. Exit codes — `fabric_realize.py`
Aligned with **`fiwicontrol.fabric.fdir.FabricExitCode`**:
| Code | Constant | Meaning |
|------|----------|---------|
| **0** | `SUCCESS` | Success. |
| **1** | `ENVIRONMENT_OR_DISCOVERY` | Missing power extra, discovery exception, no modules. |
| **2** | `CONFIGURATION` | INI/JSON/validation, strict-ini without `--force`, missing INI, no RRH rows. |
| **3** | `FABRIC_STALE` | **`--realize`**, strict fingerprint mismatch. |
| **4** | `FABRIC_DISCOVERY_FAULT` | **`--realize`**, discovery timeout or wrapped discovery error. |
Use **`fabric_realize.py --help`** for the same table (epilog).
---
## 7. Logging and monitoring
- **Filter:** `grep` / log pipelines for **`[FiWi-FDIR]`**.
- **Levels:** **WARNING** for stale fabric and non-strict continuation; **INFO** for invalid/unknown binding status; **DEBUG** for READY binding checks.
- **Recommendation:** In production log aggregation, alert on **STALE** + **`--realize`** failures (exit **3**/**4**) tied to **campaign IDs** or **`fabric_id`**.
---
## 8. Out of scope
- **Datapath** packet loss, L4S marking, or WiFi MAC recovery — see **`html/Fi-Wi-L4S.html`**.
- **Automatic power cycling** in PCIe harness — not enabled in default **`--dry-run`** skeleton; see harness **DESIGN_GAPS**.
- **Medical / life-safety** FDIR — explicitly disclaimed in **`README.md`** and **`docs/install.md`**.
---
## 9. Maintenance
When adding CLIs or changing exit codes:
1. Update **`FabricExitCode`** in **`src/fiwicontrol/fabric/fdir.py`**.
2. Update **`fabric_realize.py`** epilog and this document **§6**.
3. Add or adjust **pytest** coverage for new fault paths where practical.