7.0 KiB
Fault detection, isolation, and recovery (FDIR)
This document defines how FiWiControl detects faults, isolates their impact (fail-safe defaults, timeouts, bounded inputs), and how operators recover. It complements docs/architecture.md and the [FiWi-FDIR] log prefix implemented in fiwicontrol.fabric.fdir.
Scope: Lab automation, fabric JSON/INI workflows, USB discovery, and related CLIs. It does not define network datapath FDIR (that lives in the Fi‑Wi system design — see html/Fi-Wi-L4S.html).
1. Principles
- Detect early — Compare live USB topology to saved
discovery_fingerprintbefore trusting power or harness steps. - Fail closed for strict paths —
Fabric.realize(strict=True)andfabric_realize.py --realize(default strict) abort on fingerprint mismatch. - Bound resource use — Patch-panel JSON loads are size- and entry-capped (
fiwicontrol.fabric.patch_panel_json). - Observable — Use exit codes, stderr lines prefixed with
[FiWi-FDIR], and structured log records for aggregation. - No silent hardware actions — Harnesses use
--dry-rununtil power paths are explicitly wired; see harness DESIGN_GAPS in script docstrings.
2. Fault taxonomy
| Class | Meaning | Typical cause |
|---|---|---|
| F-ENV | Environment / dependency | Missing [power] / brainstem, USB permissions, empty Acroname enumeration. |
| F-CFG | Configuration | Bad INI/JSON, schema validation, missing [fabric.rrh.*], strict INI verify failure without --force. |
| F-FAB-STALE | Fabric fingerprint mismatch | Wrong hub, cable, or JSON from another bench. |
| F-FAB-DISC | Discovery fault / timeout | Stuck USB enumeration, driver issue, --realize-discovery-timeout exceeded. |
| F-SSH | Remote execution | Unreachable host, auth failure, timeout — surfaced by ssh_node / rexec (see docs/node-control-asyncio-design.md). |
| F-INV | Inventory vs reality | verify-inventory / --strict-ini mismatches (Acroname / Monsoon counts). |
3. Detection mechanisms
3.1 Fabric.binding_cache_status(path)
| Status | Detection |
|---|---|
| MISSING | JSON path does not exist. |
| INVALID | Unreadable or invalid JSON / schema validation failure. |
| UNKNOWN | Live fingerprint unavailable (e.g. discovery not possible). |
| STALE | Live fingerprint ≠ on-disk discovery_fingerprint. |
| READY | Match. |
Logs (via log_fdir) record INVALID, UNKNOWN, and STALE at INFO/WARNING as appropriate; READY at DEBUG.
3.2 Fabric.realize(strict=…)
- Runs live Acroname discovery (optional
discovery_timeout). - strict=True: raises
ValueErroron fingerprint mismatch (preceded by[FiWi-FDIR]WARNING log). - strict=False: continues; emits
[FiWi-FDIR]WARNING with expected and live fingerprints (audit trail).
3.3 python3 -m fiwicontrol.fabric status
Uses binding cache semantics; exit 0 only when READY (see docs/fabric-builder.md).
3.4 scripts/system/fabric_realize.py
- Compose path: discovery once; failures return F-ENV/F-CFG exit codes (below).
--realize: second discovery pass insideFabric.realizewith--realize-discovery-timeout(default 120 s).
4. Isolation
| Mechanism | Effect |
|---|---|
| Strict realize | Prevents proceeding with a stale fabric definition when strict mode is on. |
| Discovery timeout | Cancels hung enumeration during Fabric.realize when a timeout is set (CLI: --realize-discovery-timeout). |
| Patch panel JSON limits | Ignores oversized or over-count bdf_to_patch maps (see module docstring in patch_panel_json.py). |
| Concentrator report | Human report isolates concentrator probe failures: snapshot errors become a parenthesized line instead of aborting the whole fabric summary. |
| pytest remote tests | FIWI_RUN_REMOTE_TESTS gates SSH integration so CI does not hit live rigs by default. |
5. Recovery (operator playbook)
| Symptom | Likely class | Recovery steps |
|---|---|---|
| Exit 1 / “No Acroname modules” | F-ENV | Install pip install -e ".[power]", fix USB/cable, check permissions. |
| Exit 2 / RRH_BINDING_HELP | F-CFG | Add [fabric.rrh.<id>] with acroname_port, or run python3 -m fiwicontrol.fabric build. |
| Exit 3 / fingerprint mismatch | F-FAB-STALE | Re-run fabric build / fabric_realize --json on the correct host; fix cabling; use --no-strict only with operational approval. |
| Exit 4 / discovery timeout | F-FAB-DISC | Increase --realize-discovery-timeout; power-cycle hub; check brainstem / USB stability. |
STALE from status |
F-FAB-STALE | Same as exit 3; treat harness start as unsafe until READY or explicit override (e.g. --strict-fabric-ready on harness). |
| SSH hangs / failures | F-SSH | Verify FIWI_REMOTE_IP, keys, FIWI_SSH_CONFIG; see docs/install.md. |
| INI verify failures | F-INV | Run python3 -m fiwicontrol.power --verify-inventory; align acroname / monsoon tokens with --discovery-json. |
6. Exit codes — fabric_realize.py
Aligned with fiwicontrol.fabric.fdir.FabricExitCode:
| Code | Constant | Meaning |
|---|---|---|
| 0 | SUCCESS |
Success. |
| 1 | ENVIRONMENT_OR_DISCOVERY |
Missing power extra, discovery exception, no modules. |
| 2 | CONFIGURATION |
INI/JSON/validation, strict-ini without --force, missing INI, no RRH rows. |
| 3 | FABRIC_STALE |
--realize, strict fingerprint mismatch. |
| 4 | FABRIC_DISCOVERY_FAULT |
--realize, discovery timeout or wrapped discovery error. |
Use fabric_realize.py --help for the same table (epilog).
7. Logging and monitoring
- Filter:
grep/ log pipelines for[FiWi-FDIR]. - Levels: WARNING for stale fabric and non-strict continuation; INFO for invalid/unknown binding status; DEBUG for READY binding checks.
- Recommendation: In production log aggregation, alert on STALE +
--realizefailures (exit 3/4) tied to campaign IDs orfabric_id.
8. Out of scope
- Datapath packet loss, L4S marking, or Wi‑Fi MAC recovery — see
html/Fi-Wi-L4S.html. - Automatic power cycling in PCIe harness — not enabled in default
--dry-runskeleton; see harness DESIGN_GAPS. - Medical / life-safety FDIR — explicitly disclaimed in
README.mdanddocs/install.md.
9. Maintenance
When adding CLIs or changing exit codes:
- Update
FabricExitCodeinsrc/fiwicontrol/fabric/fdir.py. - Update
fabric_realize.pyepilog and this document §6. - Add or adjust pytest coverage for new fault paths where practical.