FiWiControl/docs/system-test-scripts.md

15 KiB
Raw Blame History

System test scripts (hardware harnesses)

Audience: software test engineers writing lab and bench automation in this repo.
Example: scripts/system/pcie_hotswap_harness.py — a small, readable pattern you can copy.
PCIe hot-swap setup (install, INI, JSON, commands): docs/pcie-hotswap-setup.md.


Pytest vs system scripts

tests/ + pytest scripts/system/
Goal Fast feedback, CI, mocks, gated remote tests Long runs, real power, cables, enumeration
When it runs python3 -m pytest tests/ on every change When the bench is wired and someone invokes the script
Failure meaning Regression in code or contract Often environment (wrong port, flaky USB, SSH) — design logs accordingly
Concurrency Usually isolated tests Often many logical paths sharing one USB tree or one SSH host

Keep pytest strict and deterministic. Keep system scripts explicit about assumptions (CLI flags, env vars, dry-run) and safe defaults (no silent hardware actions).


What the example script does

scripts/system/pcie_hotswap_harness.py models a fronthaul (PCIe) hot-swap campaign:

  1. Build a Fabric: either load --fabric-json (FabricDefinition from disk → Fabric.rrhs, rrh_power_ports, fingerprint) or build N placeholder RadioHead instances (each with a FrontHaul) via --paths and wrap them in Fabric (optional concentrator ssh_node, power_lock).
  2. For each iteration, run asyncio.TaskGroup: every RRH runs one_cycle concurrently (stressing shared-resource design: one BrainStem, one rig SSH target, and so on).
  3. Each cycle: log remove/restore phases ( --dry-run ) or placeholders for future Power calls, then optionally SSH to the concentrator for a minimal smoke command (uname, sample lspci output).
  4. Exit non-zero if the async campaign raises (including TaskGroup child failures), using except* Exception so ExceptionGroup surfaces every underlying error.

The scripts module docstring lists DESIGN_GAPS — known extension points so harness scope stays explicit.


Fabric JSON (discovery + bindings, one pass)

Full workflow (INI → discovery → prompts → JSON): docs/fabric-builder.md.

pip install -e ".[power]" on the workstation that sees the Acroname hub.

  1. Fabric builder — use build when a lab INI must be loaded first; bind is the same with INI optional if the default path is missing:

    python3 -m fiwicontrol.fabric build -o configs/my-fabric.json -c configs/default.ini
    python3 -m fiwicontrol.fabric bind -o configs/my-fabric.json -c configs/default.ini
    
  2. Check freshness — exit 0 only if on-disk fingerprint matches live USB discovery:

    python3 -m fiwicontrol.fabric status -f configs/my-fabric.json
    
  3. Harness — load that graph (optional --strict-fabric-ready to require READY status):

    python3 scripts/system/pcie_hotswap_harness.py --fabric-json configs/my-fabric.json --dry-run
    

Types live under fiwicontrol.fabric (FabricDefinition, FabricRRHBinding, Fabric.binding_cache_status).


Concentrator dump (scripts/system/dump_concentrator.py)

Purpose: capture this machines concentrator-relevant facts in one place: CPU summary from /proc/cpuinfo, and (by default) a local host probelspci -tv, /sys/bus/pci/devices/*/current_link_width (and related link fields), and dmidecode -t baseboard when the binary succeeds (often after sudo, because SMBIOS is not always readable as a normal user).

Default output is human text, not JSON: a short CPU block; one line with the total count of sysfs PCI devices that expose negotiated link width/speed; a WiFi / wireless-only table (K of N) for PCI class 0x028… (network + wireless) with w/W lanes, GT/s current/max, class, and a chip column from lspci -nn (preferred) or sysfs vendor / device hex pair (long chip strings are truncated); a peek at the first --lspci-lines rows of lspci -tv (default 18, remainder summarized); and the first 14 lines of dmidecode -t baseboard when that command succeeds (often requires sudo on Fedora).

Flag Meaning
--json Emit the full ConcentratorPlatformSnapshot.to_json_dict() document (large): CPU fields, optional lspci_tree, compact pci_device_links as {"cols":[...],"rows":[...]} (columns bdf, w, W, s, S, c = lanes and GT/s tokens and class), optional dmidecode_baseboard string.
--no-host-probe CPU-only; skip lspci, sysfs PCI enumeration, and dmidecode.
--pci-sysdir DIR Override /sys/bus/pci/devices (testing or nonstandard roots).
--pci-all After the WiFi table, append a second table of other “interesting” non-wireless links (wide ports / downgrades), still capped by --pci-max-rows.
--pci-max-rows N Cap for the optional second table (default 40).
--lspci-lines N Lines of lspci -tv in human output (0 = omit that block; default 18).
--label NAME Shown in the human header only.
--proc-cpuinfo PATH Override /proc/cpuinfo (tests or chroots).

Examples:

# Human summary (default); WiFi table + short lspci tree + DMI if allowed
python3 scripts/system/dump_concentrator.py

# Same with baseboard text (often needs root on Fedora)
sudo python3 scripts/system/dump_concentrator.py

# Machine JSON for tooling / CI artifacts
python3 scripts/system/dump_concentrator.py --json > /tmp/concentrator.json

Python API: fiwicontrol.concentrator.ConcentratorPlatform, ConcentratorPlatformSnapshot, PciDeviceLinkSnapshot, format_concentrator_platform_snapshot_human() (same layout as the scripts default text; optional lspci_nn_by_bdf= for tests). Implementation lives in src/fiwicontrol/concentrator/host.py (package fiwicontrol.concentrator — local workstation facts, parallel to fiwicontrol.radio for RRH aggregates; not part of fabric JSON).

When the harness (or your script) loads --fabric-json, it merges lab INI by default (same file as fiwicontrol.lab: FIWI_LAB_INI, else configs/default.ini if present). Pass --lab-ini PATH to point at another file. Merged keys include optional [fabric] (fabric_id, concentrator[machine.*] SSH target) and optional [fabric.rrh.<radio_id>] to override Acroname port / patch panel / module serial for rows already present in the JSON. Use --no-lab-ini to skip. JSON supplies discovery_fingerprint and the RRH binding list (key rrhs; Python: FabricDefinition.rrhs) from fabric build / bind or fabric_realize.py --json.


Acroname discovery smoke test (scripts/system/test_acroname_usb_discovery.py)

Runs BrainStem USB enumeration per [machine.*] row in the lab INI: usb=local on the workstation you run from, usb=remote over SSH (same interpreter contract as fiwicontrol.power --discovery-json). Prints a short table per machine, brainstem_version from discovery JSON (with an SSH fallback pip probe when the remote build omits that field), and a total module count across hosts.

python3 scripts/system/test_acroname_usb_discovery.py
python3 scripts/system/test_acroname_usb_discovery.py -c configs/default.ini --json
python3 scripts/system/test_acroname_usb_discovery.py --local-only

Use --local-only to skip the INI and probe only this machines USB. See docs/power-control-and-inventory.md for INI fields.


Fabric compose + realize (scripts/system/fabric_realize.py --realize)

Loads the lab INI, runs local Acroname discovery, compose_definition, builds Fabric, then await fab.realize() (strict fingerprint check against live USB). Default stdout is an OK line plus print(fabric) (human Fabric.__str__ summary). Pass --json for stdout-only FabricDefinition JSON after a successful realize. -v adds discovery / pre-realize fabric lines on stderr; --no-strict passes strict=False into Fabric.realize(). --realize-discovery-timeout SEC bounds Acroname discovery during --realize (default 120). Exit codes and FDIR semantics: docs/fdir.md and fabric_realize.py --help (epilog).

Without --realize, fabric_realize.py only composes the definition and prints a human workstation report (or --json / -o for definition JSON without calling Fabric.realize()). The human report can merge patch-panel labels into the WiFi PCIe table when --patch-panel-json PATH is set or when <lab_ini_stem>_panel.json exists beside the lab INI (see fiwicontrol.fabric.patch_panel_json).


Prerequisites

  1. Editable install from the repo root (see docs/install.md):

    cd ~/Code/FiWiControl
    python3 -m pip install -e ".[dev]"
    
  2. Python 3.11+ — the example uses asyncio.TaskGroup and except* Exception.

  3. Optional SSH to the rig — same contract as elsewhere: passwordless root@<host> for sshtype="ssh". Optional FIWI_SSH_CONFIG is documented in docs/node-control-asyncio-design.md.

  4. Power / Acroname — not wired in the example yet. When you add fiwicontrol.power, use pip install -e ".[power]" and follow docs/power-control-and-inventory.md.


How to run the example

From the repository root (the script prepends src to sys.path if needed):

# Safe: no SSH, no hardware — exercises structure only
python3 scripts/system/pcie_hotswap_harness.py --dry-run --paths 2 --iterations 1

# With saved fabric JSON (after build/bind; merge lab INI at run time)
python3 scripts/system/pcie_hotswap_harness.py --fabric-json configs/my-fabric.json --lab-ini configs/default.ini --dry-run

# With SSH smoke on the concentrator (replace IP)
FIWI_REMOTE_IP=192.168.1.39 python3 scripts/system/pcie_hotswap_harness.py --dry-run --paths 2
# or
python3 scripts/system/pcie_hotswap_harness.py --dry-run --paths 2 --rig-ip 192.168.1.39
Flag Meaning
--fabric-json PATH Load FabricDefinition from JSON; sets Fabric.rrhs and rrh_power_ports. Without it, uses --paths placeholders.
--lab-ini PATH Lab INI merged after JSON (default: FIWI_LAB_INI, else configs/default.ini if present).
--no-lab-ini Skip INI merge; JSON only.
--strict-fabric-ready Exit 2 unless Fabric.binding_cache_status is READY (requires live Acroname discovery). Only meaningful with --fabric-json.
--dry-run Log only; no programmable power (none hooked up in this skeleton).
--paths N Placeholder RRH count (ignored when --fabric-json is set).
--iterations M Outer loop: run M sequential TaskGroup rounds.
--settle SEC Sleep between conceptual phases inside one_cycle.
--rig-ip SSH target; defaults to FIWI_REMOTE_IP. Overrides JSON concentrator when set. If unset and JSON has no IP, remote checks are skipped.

Patterns to reuse in your own harness

1. Thin main() — parse, configure logging, call asyncio.run

Keep I/O policy (flags, env) in main(). Keep async logic in async def functions so tests or imports can reuse the coroutines without a second event loop.

2. One coroutine per “story”: one_cycle, run_campaign

Name coroutines after user-visible steps (cycle, campaign, smoke). Pass explicit parameters (dry_run, settle_s, label) instead of hidden globals.

3. Concurrency with TaskGroup

When multiple RRHs run together, async with asyncio.TaskGroup() as tg: + tg.create_task(...) fails fast and bundles errors in an ExceptionGroup. Catch with except* Exception at the boundary that owns asyncio.run, log each sub-exception, and return a process exit code.

4. Dry-run first

Always provide a path that does not touch hardware so engineers can validate logging, SSH, and timing on a laptop. Real power transitions should be clearly gated (extra flag or explicit “I know this is live”).

5. Domain types from the library

Attach FrontHaul to RadioHead even when fields are None — it documents intent and keeps the harness aligned with production models. Pass a Fabric into the async campaign so shared resources (concentrator SSH, bench Power, asyncio.Lock, rrh_power_ports) have one home. Prefer --fabric-json (bound once via python3 -m fiwicontrol.fabric bind) over ad hoc placeholders; reserve --paths for laptop-only smoke.

6. Remote checks via ssh_node

Use await node.rexec(cmd="...", ...) for one-shot remote work. For periodic sampling, prefer Command / CommandManager from fiwicontrol.commands (see docs/node-control-asyncio-design.md).

7. Document gaps in the script

A short DESIGN_GAPS or TODO block at the top of the harness documents how enumeration, telemetry, or SPC relate to this script.


Checklist for a new system script

  1. Lives under scripts/system/ with a #!/usr/bin/env python3 shebang.
  2. argparse (or equivalent) documents every assumption; --help is accurate.
  3. --dry-run (or equivalent) when hardware is involved.
  4. logging at INFO for operator visibility; avoid print for control flow.
  5. Async entry is async def + single asyncio.run(...) from main().
  6. Concurrent work uses TaskGroup (or gather with a documented error policy).
  7. Non-zero exit on failure; ExceptionGroup handled if you use TaskGroup.
  8. README or this doc updated if you add a new category of harness or dependency.

  • docs/pcie-hotswap-setup.md — PCIe harness prerequisites and JSON generation.
  • docs/fabric-builder.md — lab INI + python3 -m fiwicontrol.fabric build / bind.
  • docs/install.md — workstation and rig setup, pip install -e.
  • docs/node-control-asyncio-design.mdssh_node, Command, timeouts, running tests.
  • docs/power-control-and-inventory.md — Acroname / Monsoon, INI, --verify-inventory.
  • docs/spc.md — when campaigns need statistical control charts after KPI extraction.
  • README.mdscripts/system/ vs tests/ overview.