FiWiControl/docs/fdir.md

7.0 KiB
Raw Permalink Blame History

Fault detection, isolation, and recovery (FDIR)

This document defines how FiWiControl detects faults, isolates their impact (fail-safe defaults, timeouts, bounded inputs), and how operators recover. It complements docs/architecture.md and the [FiWi-FDIR] log prefix implemented in fiwicontrol.fabric.fdir.

Scope: Lab automation, fabric JSON/INI workflows, USB discovery, and related CLIs. It does not define network datapath FDIR (that lives in the FiWi system design — see html/Fi-Wi-L4S.html).


1. Principles

  1. Detect early — Compare live USB topology to saved discovery_fingerprint before trusting power or harness steps.
  2. Fail closed for strict pathsFabric.realize(strict=True) and fabric_realize.py --realize (default strict) abort on fingerprint mismatch.
  3. Bound resource use — Patch-panel JSON loads are size- and entry-capped (fiwicontrol.fabric.patch_panel_json).
  4. Observable — Use exit codes, stderr lines prefixed with [FiWi-FDIR], and structured log records for aggregation.
  5. No silent hardware actions — Harnesses use --dry-run until power paths are explicitly wired; see harness DESIGN_GAPS in script docstrings.

2. Fault taxonomy

Class Meaning Typical cause
F-ENV Environment / dependency Missing [power] / brainstem, USB permissions, empty Acroname enumeration.
F-CFG Configuration Bad INI/JSON, schema validation, missing [fabric.rrh.*], strict INI verify failure without --force.
F-FAB-STALE Fabric fingerprint mismatch Wrong hub, cable, or JSON from another bench.
F-FAB-DISC Discovery fault / timeout Stuck USB enumeration, driver issue, --realize-discovery-timeout exceeded.
F-SSH Remote execution Unreachable host, auth failure, timeout — surfaced by ssh_node / rexec (see docs/node-control-asyncio-design.md).
F-INV Inventory vs reality verify-inventory / --strict-ini mismatches (Acroname / Monsoon counts).

3. Detection mechanisms

3.1 Fabric.binding_cache_status(path)

Status Detection
MISSING JSON path does not exist.
INVALID Unreadable or invalid JSON / schema validation failure.
UNKNOWN Live fingerprint unavailable (e.g. discovery not possible).
STALE Live fingerprint ≠ on-disk discovery_fingerprint.
READY Match.

Logs (via log_fdir) record INVALID, UNKNOWN, and STALE at INFO/WARNING as appropriate; READY at DEBUG.

3.2 Fabric.realize(strict=…)

  • Runs live Acroname discovery (optional discovery_timeout).
  • strict=True: raises ValueError on fingerprint mismatch (preceded by [FiWi-FDIR] WARNING log).
  • strict=False: continues; emits [FiWi-FDIR] WARNING with expected and live fingerprints (audit trail).

3.3 python3 -m fiwicontrol.fabric status

Uses binding cache semantics; exit 0 only when READY (see docs/fabric-builder.md).

3.4 scripts/system/fabric_realize.py

  • Compose path: discovery once; failures return F-ENV/F-CFG exit codes (below).
  • --realize: second discovery pass inside Fabric.realize with --realize-discovery-timeout (default 120 s).

4. Isolation

Mechanism Effect
Strict realize Prevents proceeding with a stale fabric definition when strict mode is on.
Discovery timeout Cancels hung enumeration during Fabric.realize when a timeout is set (CLI: --realize-discovery-timeout).
Patch panel JSON limits Ignores oversized or over-count bdf_to_patch maps (see module docstring in patch_panel_json.py).
Concentrator report Human report isolates concentrator probe failures: snapshot errors become a parenthesized line instead of aborting the whole fabric summary.
pytest remote tests FIWI_RUN_REMOTE_TESTS gates SSH integration so CI does not hit live rigs by default.

5. Recovery (operator playbook)

Symptom Likely class Recovery steps
Exit 1 / “No Acroname modules” F-ENV Install pip install -e ".[power]", fix USB/cable, check permissions.
Exit 2 / RRH_BINDING_HELP F-CFG Add [fabric.rrh.<id>] with acroname_port, or run python3 -m fiwicontrol.fabric build.
Exit 3 / fingerprint mismatch F-FAB-STALE Re-run fabric build / fabric_realize --json on the correct host; fix cabling; use --no-strict only with operational approval.
Exit 4 / discovery timeout F-FAB-DISC Increase --realize-discovery-timeout; power-cycle hub; check brainstem / USB stability.
STALE from status F-FAB-STALE Same as exit 3; treat harness start as unsafe until READY or explicit override (e.g. --strict-fabric-ready on harness).
SSH hangs / failures F-SSH Verify FIWI_REMOTE_IP, keys, FIWI_SSH_CONFIG; see docs/install.md.
INI verify failures F-INV Run python3 -m fiwicontrol.power --verify-inventory; align acroname / monsoon tokens with --discovery-json.

6. Exit codes — fabric_realize.py

Aligned with fiwicontrol.fabric.fdir.FabricExitCode:

Code Constant Meaning
0 SUCCESS Success.
1 ENVIRONMENT_OR_DISCOVERY Missing power extra, discovery exception, no modules.
2 CONFIGURATION INI/JSON/validation, strict-ini without --force, missing INI, no RRH rows.
3 FABRIC_STALE --realize, strict fingerprint mismatch.
4 FABRIC_DISCOVERY_FAULT --realize, discovery timeout or wrapped discovery error.

Use fabric_realize.py --help for the same table (epilog).


7. Logging and monitoring

  • Filter: grep / log pipelines for [FiWi-FDIR].
  • Levels: WARNING for stale fabric and non-strict continuation; INFO for invalid/unknown binding status; DEBUG for READY binding checks.
  • Recommendation: In production log aggregation, alert on STALE + --realize failures (exit 3/4) tied to campaign IDs or fabric_id.

8. Out of scope

  • Datapath packet loss, L4S marking, or WiFi MAC recovery — see html/Fi-Wi-L4S.html.
  • Automatic power cycling in PCIe harness — not enabled in default --dry-run skeleton; see harness DESIGN_GAPS.
  • Medical / life-safety FDIR — explicitly disclaimed in README.md and docs/install.md.

9. Maintenance

When adding CLIs or changing exit codes:

  1. Update FabricExitCode in src/fiwicontrol/fabric/fdir.py.
  2. Update fabric_realize.py epilog and this document §6.
  3. Add or adjust pytest coverage for new fault paths where practical.