128 lines
7.0 KiB
Markdown
128 lines
7.0 KiB
Markdown
# Fault detection, isolation, and recovery (FDIR)
|
||
|
||
This document defines how **FiWiControl** detects faults, **isolates** their impact (fail-safe defaults, timeouts, bounded inputs), and how operators **recover**. It complements **`docs/architecture.md`** and the **`[FiWi-FDIR]`** log prefix implemented in **`fiwicontrol.fabric.fdir`**.
|
||
|
||
**Scope:** Lab automation, fabric JSON/INI workflows, USB discovery, and related CLIs. It does **not** define network datapath FDIR (that lives in the Fi‑Wi system design — see **`html/Fi-Wi-L4S.html`**).
|
||
|
||
---
|
||
|
||
## 1. Principles
|
||
|
||
1. **Detect early** — Compare **live** USB topology to **saved** `discovery_fingerprint` before trusting power or harness steps.
|
||
2. **Fail closed for strict paths** — `Fabric.realize(strict=True)` and `fabric_realize.py --realize` (default strict) **abort** on fingerprint mismatch.
|
||
3. **Bound resource use** — Patch-panel JSON loads are **size- and entry-capped** (`fiwicontrol.fabric.patch_panel_json`).
|
||
4. **Observable** — Use exit codes, stderr lines prefixed with **`[FiWi-FDIR]`**, and structured log records for aggregation.
|
||
5. **No silent hardware actions** — Harnesses use **`--dry-run`** until power paths are explicitly wired; see harness **DESIGN_GAPS** in script docstrings.
|
||
|
||
---
|
||
|
||
## 2. Fault taxonomy
|
||
|
||
| Class | Meaning | Typical cause |
|
||
|-------|---------|----------------|
|
||
| **F-ENV** | Environment / dependency | Missing `[power]` / `brainstem`, USB permissions, empty Acroname enumeration. |
|
||
| **F-CFG** | Configuration | Bad INI/JSON, schema validation, missing `[fabric.rrh.*]`, strict INI verify failure without `--force`. |
|
||
| **F-FAB-STALE** | Fabric fingerprint mismatch | Wrong hub, cable, or JSON from another bench. |
|
||
| **F-FAB-DISC** | Discovery fault / timeout | Stuck USB enumeration, driver issue, **`--realize-discovery-timeout`** exceeded. |
|
||
| **F-SSH** | Remote execution | Unreachable host, auth failure, timeout — surfaced by `ssh_node` / `rexec` (see **`docs/node-control-asyncio-design.md`**). |
|
||
| **F-INV** | Inventory vs reality | `verify-inventory` / `--strict-ini` mismatches (Acroname / Monsoon counts). |
|
||
|
||
---
|
||
|
||
## 3. Detection mechanisms
|
||
|
||
### 3.1 `Fabric.binding_cache_status(path)`
|
||
|
||
| Status | Detection |
|
||
|--------|-----------|
|
||
| **MISSING** | JSON path does not exist. |
|
||
| **INVALID** | Unreadable or invalid JSON / schema validation failure. |
|
||
| **UNKNOWN** | Live fingerprint unavailable (e.g. discovery not possible). |
|
||
| **STALE** | Live fingerprint ≠ on-disk `discovery_fingerprint`. |
|
||
| **READY** | Match. |
|
||
|
||
Logs (via **`log_fdir`**) record **INVALID**, **UNKNOWN**, and **STALE** at **INFO**/**WARNING** as appropriate; **READY** at **DEBUG**.
|
||
|
||
### 3.2 `Fabric.realize(strict=…)`
|
||
|
||
- Runs **live** Acroname discovery (optional **`discovery_timeout`**).
|
||
- **strict=True:** raises **`ValueError`** on fingerprint mismatch (preceded by **`[FiWi-FDIR]`** WARNING log).
|
||
- **strict=False:** continues; emits **`[FiWi-FDIR]`** WARNING with **expected** and **live** fingerprints (audit trail).
|
||
|
||
### 3.3 `python3 -m fiwicontrol.fabric status`
|
||
|
||
Uses binding cache semantics; exit **0** only when **READY** (see **`docs/fabric-builder.md`**).
|
||
|
||
### 3.4 `scripts/system/fabric_realize.py`
|
||
|
||
- **Compose** path: discovery once; failures return **F-ENV**/**F-CFG** exit codes (below).
|
||
- **`--realize`:** second discovery pass inside **`Fabric.realize`** with **`--realize-discovery-timeout`** (default **120** s).
|
||
|
||
---
|
||
|
||
## 4. Isolation
|
||
|
||
| Mechanism | Effect |
|
||
|-----------|--------|
|
||
| **Strict realize** | Prevents proceeding with a **stale** fabric definition when strict mode is on. |
|
||
| **Discovery timeout** | Cancels hung enumeration during **`Fabric.realize`** when a timeout is set (CLI: **`--realize-discovery-timeout`**). |
|
||
| **Patch panel JSON limits** | Ignores oversized or over-count **`bdf_to_patch`** maps (see module docstring in **`patch_panel_json.py`**). |
|
||
| **Concentrator report** | Human report isolates concentrator probe failures: snapshot errors become a **parenthesized** line instead of aborting the whole fabric summary. |
|
||
| **pytest remote tests** | **`FIWI_RUN_REMOTE_TESTS`** gates SSH integration so CI does not hit live rigs by default. |
|
||
|
||
---
|
||
|
||
## 5. Recovery (operator playbook)
|
||
|
||
| Symptom | Likely class | Recovery steps |
|
||
|---------|--------------|----------------|
|
||
| Exit **1** / “No Acroname modules” | F-ENV | Install **`pip install -e ".[power]"`**, fix USB/cable, check permissions. |
|
||
| Exit **2** / RRH_BINDING_HELP | F-CFG | Add **`[fabric.rrh.<id>]`** with **`acroname_port`**, or run **`python3 -m fiwicontrol.fabric build`**. |
|
||
| Exit **3** / fingerprint mismatch | F-FAB-STALE | Re-run **`fabric build`** / **`fabric_realize --json`** on the **correct** host; fix cabling; use **`--no-strict`** only with operational approval. |
|
||
| Exit **4** / discovery timeout | F-FAB-DISC | Increase **`--realize-discovery-timeout`**; power-cycle hub; check **`brainstem`** / USB stability. |
|
||
| **`STALE` from `status`** | F-FAB-STALE | Same as exit **3**; treat harness start as **unsafe** until **READY** or explicit override (e.g. **`--strict-fabric-ready`** on harness). |
|
||
| SSH hangs / failures | F-SSH | Verify **`FIWI_REMOTE_IP`**, keys, **`FIWI_SSH_CONFIG`**; see **`docs/install.md`**. |
|
||
| INI verify failures | F-INV | Run **`python3 -m fiwicontrol.power --verify-inventory`**; align **`acroname`** / **`monsoon`** tokens with **`--discovery-json`**. |
|
||
|
||
---
|
||
|
||
## 6. Exit codes — `fabric_realize.py`
|
||
|
||
Aligned with **`fiwicontrol.fabric.fdir.FabricExitCode`**:
|
||
|
||
| Code | Constant | Meaning |
|
||
|------|----------|---------|
|
||
| **0** | `SUCCESS` | Success. |
|
||
| **1** | `ENVIRONMENT_OR_DISCOVERY` | Missing power extra, discovery exception, no modules. |
|
||
| **2** | `CONFIGURATION` | INI/JSON/validation, strict-ini without `--force`, missing INI, no RRH rows. |
|
||
| **3** | `FABRIC_STALE` | **`--realize`**, strict fingerprint mismatch. |
|
||
| **4** | `FABRIC_DISCOVERY_FAULT` | **`--realize`**, discovery timeout or wrapped discovery error. |
|
||
|
||
Use **`fabric_realize.py --help`** for the same table (epilog).
|
||
|
||
---
|
||
|
||
## 7. Logging and monitoring
|
||
|
||
- **Filter:** `grep` / log pipelines for **`[FiWi-FDIR]`**.
|
||
- **Levels:** **WARNING** for stale fabric and non-strict continuation; **INFO** for invalid/unknown binding status; **DEBUG** for READY binding checks.
|
||
- **Recommendation:** In production log aggregation, alert on **STALE** + **`--realize`** failures (exit **3**/**4**) tied to **campaign IDs** or **`fabric_id`**.
|
||
|
||
---
|
||
|
||
## 8. Out of scope
|
||
|
||
- **Datapath** packet loss, L4S marking, or Wi‑Fi MAC recovery — see **`html/Fi-Wi-L4S.html`**.
|
||
- **Automatic power cycling** in PCIe harness — not enabled in default **`--dry-run`** skeleton; see harness **DESIGN_GAPS**.
|
||
- **Medical / life-safety** FDIR — explicitly disclaimed in **`README.md`** and **`docs/install.md`**.
|
||
|
||
---
|
||
|
||
## 9. Maintenance
|
||
|
||
When adding CLIs or changing exit codes:
|
||
|
||
1. Update **`FabricExitCode`** in **`src/fiwicontrol/fabric/fdir.py`**.
|
||
2. Update **`fabric_realize.py`** epilog and this document **§6**.
|
||
3. Add or adjust **pytest** coverage for new fault paths where practical.
|