# Fault detection, isolation, and recovery (FDIR) This document defines how **FiWiControl** detects faults, **isolates** their impact (fail-safe defaults, timeouts, bounded inputs), and how operators **recover**. It complements **`docs/architecture.md`** and the **`[FiWi-FDIR]`** log prefix implemented in **`fiwicontrol.fabric.fdir`**. **Scope:** Lab automation, fabric JSON/INI workflows, USB discovery, and related CLIs. It does **not** define network datapath FDIR (that lives in the Fi‑Wi system design — see **`html/Fi-Wi-L4S.html`**). --- ## 1. Principles 1. **Detect early** — Compare **live** USB topology to **saved** `discovery_fingerprint` before trusting power or harness steps. 2. **Fail closed for strict paths** — `Fabric.realize(strict=True)` and `fabric_realize.py --realize` (default strict) **abort** on fingerprint mismatch. 3. **Bound resource use** — Patch-panel JSON loads are **size- and entry-capped** (`fiwicontrol.fabric.patch_panel_json`). 4. **Observable** — Use exit codes, stderr lines prefixed with **`[FiWi-FDIR]`**, and structured log records for aggregation. 5. **No silent hardware actions** — Harnesses use **`--dry-run`** until power paths are explicitly wired; see harness **DESIGN_GAPS** in script docstrings. --- ## 2. Fault taxonomy | Class | Meaning | Typical cause | |-------|---------|----------------| | **F-ENV** | Environment / dependency | Missing `[power]` / `brainstem`, USB permissions, empty Acroname enumeration. | | **F-CFG** | Configuration | Bad INI/JSON, schema validation, missing `[fabric.rrh.*]`, strict INI verify failure without `--force`. | | **F-FAB-STALE** | Fabric fingerprint mismatch | Wrong hub, cable, or JSON from another bench. | | **F-FAB-DISC** | Discovery fault / timeout | Stuck USB enumeration, driver issue, **`--realize-discovery-timeout`** exceeded. | | **F-SSH** | Remote execution | Unreachable host, auth failure, timeout — surfaced by `ssh_node` / `rexec` (see **`docs/node-control-asyncio-design.md`**). | | **F-INV** | Inventory vs reality | `verify-inventory` / `--strict-ini` mismatches (Acroname / Monsoon counts). | --- ## 3. Detection mechanisms ### 3.1 `Fabric.binding_cache_status(path)` | Status | Detection | |--------|-----------| | **MISSING** | JSON path does not exist. | | **INVALID** | Unreadable or invalid JSON / schema validation failure. | | **UNKNOWN** | Live fingerprint unavailable (e.g. discovery not possible). | | **STALE** | Live fingerprint ≠ on-disk `discovery_fingerprint`. | | **READY** | Match. | Logs (via **`log_fdir`**) record **INVALID**, **UNKNOWN**, and **STALE** at **INFO**/**WARNING** as appropriate; **READY** at **DEBUG**. ### 3.2 `Fabric.realize(strict=…)` - Runs **live** Acroname discovery (optional **`discovery_timeout`**). - **strict=True:** raises **`ValueError`** on fingerprint mismatch (preceded by **`[FiWi-FDIR]`** WARNING log). - **strict=False:** continues; emits **`[FiWi-FDIR]`** WARNING with **expected** and **live** fingerprints (audit trail). ### 3.3 `python3 -m fiwicontrol.fabric status` Uses binding cache semantics; exit **0** only when **READY** (see **`docs/fabric-builder.md`**). ### 3.4 `scripts/system/fabric_realize.py` - **Compose** path: discovery once; failures return **F-ENV**/**F-CFG** exit codes (below). - **`--realize`:** second discovery pass inside **`Fabric.realize`** with **`--realize-discovery-timeout`** (default **120** s). --- ## 4. Isolation | Mechanism | Effect | |-----------|--------| | **Strict realize** | Prevents proceeding with a **stale** fabric definition when strict mode is on. | | **Discovery timeout** | Cancels hung enumeration during **`Fabric.realize`** when a timeout is set (CLI: **`--realize-discovery-timeout`**). | | **Patch panel JSON limits** | Ignores oversized or over-count **`bdf_to_patch`** maps (see module docstring in **`patch_panel_json.py`**). | | **Concentrator report** | Human report isolates concentrator probe failures: snapshot errors become a **parenthesized** line instead of aborting the whole fabric summary. | | **pytest remote tests** | **`FIWI_RUN_REMOTE_TESTS`** gates SSH integration so CI does not hit live rigs by default. | --- ## 5. Recovery (operator playbook) | Symptom | Likely class | Recovery steps | |---------|--------------|----------------| | Exit **1** / “No Acroname modules” | F-ENV | Install **`pip install -e ".[power]"`**, fix USB/cable, check permissions. | | Exit **2** / RRH_BINDING_HELP | F-CFG | Add **`[fabric.rrh.]`** with **`acroname_port`**, or run **`python3 -m fiwicontrol.fabric build`**. | | Exit **3** / fingerprint mismatch | F-FAB-STALE | Re-run **`fabric build`** / **`fabric_realize --json`** on the **correct** host; fix cabling; use **`--no-strict`** only with operational approval. | | Exit **4** / discovery timeout | F-FAB-DISC | Increase **`--realize-discovery-timeout`**; power-cycle hub; check **`brainstem`** / USB stability. | | **`STALE` from `status`** | F-FAB-STALE | Same as exit **3**; treat harness start as **unsafe** until **READY** or explicit override (e.g. **`--strict-fabric-ready`** on harness). | | SSH hangs / failures | F-SSH | Verify **`FIWI_REMOTE_IP`**, keys, **`FIWI_SSH_CONFIG`**; see **`docs/install.md`**. | | INI verify failures | F-INV | Run **`python3 -m fiwicontrol.power --verify-inventory`**; align **`acroname`** / **`monsoon`** tokens with **`--discovery-json`**. | --- ## 6. Exit codes — `fabric_realize.py` Aligned with **`fiwicontrol.fabric.fdir.FabricExitCode`**: | Code | Constant | Meaning | |------|----------|---------| | **0** | `SUCCESS` | Success. | | **1** | `ENVIRONMENT_OR_DISCOVERY` | Missing power extra, discovery exception, no modules. | | **2** | `CONFIGURATION` | INI/JSON/validation, strict-ini without `--force`, missing INI, no RRH rows. | | **3** | `FABRIC_STALE` | **`--realize`**, strict fingerprint mismatch. | | **4** | `FABRIC_DISCOVERY_FAULT` | **`--realize`**, discovery timeout or wrapped discovery error. | Use **`fabric_realize.py --help`** for the same table (epilog). --- ## 7. Logging and monitoring - **Filter:** `grep` / log pipelines for **`[FiWi-FDIR]`**. - **Levels:** **WARNING** for stale fabric and non-strict continuation; **INFO** for invalid/unknown binding status; **DEBUG** for READY binding checks. - **Recommendation:** In production log aggregation, alert on **STALE** + **`--realize`** failures (exit **3**/**4**) tied to **campaign IDs** or **`fabric_id`**. --- ## 8. Out of scope - **Datapath** packet loss, L4S marking, or Wi‑Fi MAC recovery — see **`html/Fi-Wi-L4S.html`**. - **Automatic power cycling** in PCIe harness — not enabled in default **`--dry-run`** skeleton; see harness **DESIGN_GAPS**. - **Medical / life-safety** FDIR — explicitly disclaimed in **`README.md`** and **`docs/install.md`**. --- ## 9. Maintenance When adding CLIs or changing exit codes: 1. Update **`FabricExitCode`** in **`src/fiwicontrol/fabric/fdir.py`**. 2. Update **`fabric_realize.py`** epilog and this document **§6**. 3. Add or adjust **pytest** coverage for new fault paths where practical.