FiWiControl/docs/system-test-scripts.md

21 KiB
Raw Blame History

System test scripts (hardware harnesses)

Audience: software test engineers writing lab and bench automation in this repo.
Example: **scripts/system/pcie_hotswap_harness.py** — a small, readable pattern you can copy.
PCIe hot-swap setup (install, INI, JSON, commands): **docs/pcie-hotswap-setup.md**.


Pytest vs system scripts

**tests/ + pytest** **scripts/system/**
Goal Fast feedback, CI, mocks, gated remote tests Long runs, real power, cables, enumeration
When it runs python3 -m pytest tests/ on every change When the bench is wired and someone invokes the script
Failure meaning Regression in code or contract Often environment (wrong port, flaky USB, SSH) — design logs accordingly
Concurrency Usually isolated tests Often many logical paths sharing one USB tree or one SSH host

Keep pytest strict and deterministic. Keep system scripts explicit about assumptions (CLI flags, env vars, dry-run) and safe defaults (no silent hardware actions).


What the example script does

**scripts/system/pcie_hotswap_harness.py** models a fronthaul (PCIe) hot-swap campaign:

  1. Build a **Fabric: either load **--fabric-json (**FabricDefinition** from disk → **Fabric.rrhs**, **rrh_power_ports**, fingerprint) or build N placeholder **RadioHead** instances (each with a **FrontHaul**) via **--paths** and wrap them in **Fabric** (optional concentrator **ssh_node**, **power_lock**).
  2. For each iteration, run **asyncio.TaskGroup: every RRH runs **one_cycle concurrently (stressing shared-resource design: one BrainStem, one rig SSH target, and so on).
  3. Each cycle: log remove/restore phases ( **--dry-run** ) or placeholders for future **Power** calls, then optionally SSH to the concentrator for a minimal smoke command (uname, sample lspci output).
  4. Exit non-zero if the async campaign raises (including **TaskGroup** child failures), using **except* Exception** so **ExceptionGroup** surfaces every underlying error.

The scripts module docstring lists DESIGN_GAPS — known extension points so harness scope stays explicit.


Fabric JSON (discovery + bindings, one pass)

Full workflow (INI → discovery → prompts → JSON): **docs/fabric-builder.md**.

**pip install -e ".[power]"** on the workstation that sees the Acroname hub.

  1. Fabric builder — use **build** when a lab INI must be loaded first; **bind** is the same with INI optional if the default path is missing:
 python3 -m fiwicontrol.fabric build -o configs/my-fabric.json -c configs/default.ini
 python3 -m fiwicontrol.fabric bind -o configs/my-fabric.json -c configs/default.ini
  1. Check freshness — exit 0 only if on-disk fingerprint matches live USB discovery:
 python3 -m fiwicontrol.fabric status -f configs/my-fabric.json
  1. Harness — load that graph (optional **--strict-fabric-ready** to require **READY** status):
 python3 scripts/system/pcie_hotswap_harness.py --fabric-json configs/my-fabric.json --dry-run

Types live under **fiwicontrol.fabric** (**FabricDefinition**, **FabricRRHBinding**, **Fabric.binding_cache_status**).


Concentrator dump (scripts/system/dump_concentrator.py)

Purpose: capture this machines concentrator-relevant facts in one place: CPU summary from **/proc/cpuinfo, and (by default) a local host probe — **lspci -tv, **/sys/bus/pci/devices/*/current_link_width** (and related link fields), and **dmidecode -t baseboard** when the binary succeeds (often after **sudo**, because SMBIOS is not always readable as a normal user).

Default output is human text, not JSON: a short CPU block; one line with the total count of sysfs PCI devices that expose negotiated link width/speed; a WiFi / wireless-only table (**K of N) for PCI class **0x028… (network + wireless) with **w/W** lanes, GT/s current/max, **class**, and a chip column from **lspci -nn** (preferred) or sysfs **vendor** / **device** hex pair (long chip strings are truncated); a peek at the first **--lspci-lines** rows of **lspci -tv** (default 18, remainder summarized); and the first 14 lines of **dmidecode -t baseboard** when that command succeeds (often requires **sudo** on Fedora).

Flag Meaning
**--json** Emit the full **ConcentratorPlatformSnapshot.to_json_dict()** document (large): CPU fields, optional **lspci_tree**, compact **pci_device_links** as **{"cols":[...],"rows":[...]}** (columns **bdf**, **w**, **W**, **s**, **S**, **c** = lanes and GT/s tokens and class), optional **dmidecode_baseboard** string.
**--no-host-probe** CPU-only; skip **lspci**, sysfs PCI enumeration, and **dmidecode**.
**--pci-sysdir DIR** Override **/sys/bus/pci/devices** (testing or nonstandard roots).
**--pci-all** After the WiFi table, append a second table of other “interesting” non-wireless links (wide ports / downgrades), still capped by **--pci-max-rows**.
**--pci-max-rows N** Cap for the optional second table (default 40).
**--lspci-lines N** Lines of **lspci -tv** in human output (0 = omit that block; default 18).
**--label NAME** Shown in the human header only.
**--proc-cpuinfo PATH** Override **/proc/cpuinfo** (tests or chroots).

Examples:

# Human summary (default); WiFi table + short lspci tree + DMI if allowed
python3 scripts/system/dump_concentrator.py

# Same with baseboard text (often needs root on Fedora)
sudo python3 scripts/system/dump_concentrator.py

# Machine JSON for tooling / CI artifacts
python3 scripts/system/dump_concentrator.py --json > /tmp/concentrator.json

Python API: **fiwicontrol.concentrator.ConcentratorPlatform, **ConcentratorPlatformSnapshot, **PciDeviceLinkSnapshot**, **format_concentrator_platform_snapshot_human()** (same layout as the scripts default text; optional **lspci_nn_by_bdf=** for tests). Implementation lives in **src/fiwicontrol/concentrator/host.py** (package **fiwicontrol.concentrator** — local workstation facts, parallel to **fiwicontrol.radio** for RRH aggregates; not part of fabric JSON).

When the harness (or your script) loads **--fabric-json**, it merges lab INI by default (same file as **fiwicontrol.lab: **FIWI_LAB_INI, else **configs/default.ini** if present). Pass **--lab-ini PATH** to point at another file. Merged keys include optional **[fabric]** (**fabric_id**, **concentrator****[machine.*]** SSH target) and optional **[fabric.rrh.<radio_id>]** to override Acroname port / patch panel / module serial for rows already present in the JSON. Use **--no-lab-ini** to skip. JSON supplies **discovery_fingerprint** and the RRH binding list (key **rrhs**; Python: **FabricDefinition.rrhs**) from **fabric build** / **bind** or **fabric_realize.py --json**.


Acroname discovery smoke test (scripts/system/test_acroname_usb_discovery.py)

Runs BrainStem USB enumeration per [machine.*] row in the lab INI: **usb=local** on the workstation you run from, **usb=remote** over SSH (same interpreter contract as **fiwicontrol.power --discovery-json**). Prints a short table per machine, **brainstem_version** from discovery JSON (with an SSH fallback pip probe when the remote build omits that field), and a total module count across hosts.

python3 scripts/system/test_acroname_usb_discovery.py
python3 scripts/system/test_acroname_usb_discovery.py -c configs/default.ini --json
python3 scripts/system/test_acroname_usb_discovery.py --local-only

Use **--local-only** to skip the INI and probe only this machines USB. See **docs/power-control-and-inventory.md** for INI fields.


Wi-Fi PCI chip discovery (scripts/system/discover_wifi_pci.py)

Prints Wi-Fi / MediaTek-looking **lspci -nn** lines with suggested **pcie_bdf** values.

# Just discover chips/BDFs visible right now
python3 scripts/system/discover_wifi_pci.py

# JSON output for tooling
python3 scripts/system/discover_wifi_pci.py --json

Optional: power all RRHs ON first

When radios are currently off, add **--power-all-on** so PCI devices are present before discovery:

python3 scripts/system/discover_wifi_pci.py --power-all-on -f configs/clubhouse-uax-24.json

**--power-all-on** powers both local and relay-host RRH ports (using **-c/--lab-ini** for relay resolution):

python3 scripts/system/discover_wifi_pci.py --power-all-on -f configs/clubhouse-uax-24.json -c configs/default.ini

Safety preview (no toggles):

python3 scripts/system/discover_wifi_pci.py --power-all-on -f configs/clubhouse-uax-24.json -c configs/default.ini --dry-run

**--power-all-on** uses RRH bindings from the fabric JSON (**acroname_module_serial** + **acroname_port**) and drives Acroname ports locally via BrainStem and remotely via SSH (relay hosts from the lab INI).


Interactive RRH -> PCI mapping (scripts/system/assign_rrh_pcie_bdf.py)

Use this when you want to assign **pcie_bdf** values to each **radio_id** and write them to the lab INI:

python3 scripts/system/assign_rrh_pcie_bdf.py --power-on-first -f configs/clubhouse-uax-24.json -c configs/default.ini

Behavior:

  • Powers local + relay RRHs on first (same logic as **discover_wifi_pci.py --power-all-on**).
  • Lists Wi-Fi PCI candidates from local **lspci -nn**.
  • Prompts once per **radio_id** to pick a candidate index or type a BDF directly.
  • Writes/updates **[fabric.rrh.<radio_id>] pcie_bdf = ...** in the INI.

Prompt shortcuts:

  • **Enter** — keep existing value
  • **-** — clear pcie_bdf
  • **<number>** — select from candidate list
  • **<bdf>** — set explicit value (e.g. **03:00.0**)

Fabric compose + realize (scripts/system/fabric_realize.py --realize)

Loads the lab INI, runs local Acroname discovery, **compose_definition, builds **Fabric, then **await fab.realize()** (strict fingerprint check against live USB). Default stdout is an OK line plus **print(fabric)** (human **Fabric.__str__** summary). Pass **--json** for stdout-only **FabricDefinition** JSON after a successful realize. **-v** adds discovery / pre-realize fabric lines on stderr; **--no-strict** passes **strict=False** into **Fabric.realize()**. **--realize-discovery-timeout SEC** bounds Acroname discovery during **--realize** (default 120). Exit codes and FDIR semantics: **docs/fdir.md** and **fabric_realize.py --help** (epilog).

Without **--realize**, **fabric_realize.py** only composes the definition and prints a human workstation report (or **--json** / **-o** for definition JSON without calling **Fabric.realize()). The human report can merge patch-panel labels into the WiFi PCIe table when **--patch-panel-json PATH is set or when **<lab_ini_stem>_panel.json** exists beside the lab INI (see **fiwicontrol.fabric.patch_panel_json**).


Prerequisites

  1. Editable install from the repo root (see **docs/install.md**):
 cd ~/Code/FiWiControl
 python3 -m pip install -e ".[dev]"
  1. Python 3.11+ — the example uses **asyncio.TaskGroup** and **except* Exception**.
  2. Optional SSH to the rig — same contract as elsewhere: passwordless **root@<host>** for **sshtype="ssh"**. Optional **FIWI_SSH_CONFIG** is documented in **docs/node-control-asyncio-design.md**.
  3. Power / Acroname — not wired in the example yet. When you add **fiwicontrol.power, use **pip install -e ".[power]" and follow **docs/power-control-and-inventory.md**.

How to run the example

From the repository root (the script prepends **src** to **sys.path** if needed):

# Safe: no SSH, no hardware — exercises structure only
python3 scripts/system/pcie_hotswap_harness.py --dry-run --paths 2 --iterations 1

# With saved fabric JSON (after build/bind; merge lab INI at run time)
python3 scripts/system/pcie_hotswap_harness.py --fabric-json configs/my-fabric.json --lab-ini configs/default.ini --dry-run

# With SSH smoke on the concentrator (replace IP)
FIWI_REMOTE_IP=192.168.1.39 python3 scripts/system/pcie_hotswap_harness.py --dry-run --paths 2
# or
python3 scripts/system/pcie_hotswap_harness.py --dry-run --paths 2 --rig-ip 192.168.1.39
Flag Meaning
**--fabric-json PATH** Load **FabricDefinition** from JSON; sets **Fabric.rrhs** and **rrh_power_ports**. Without it, uses **--paths** placeholders.
**--lab-ini PATH** Lab INI merged after JSON (default: **FIWI_LAB_INI**, else **configs/default.ini** if present).
**--no-lab-ini** Skip INI merge; JSON only.
**--strict-fabric-ready** Exit 2 unless **Fabric.binding_cache_status** is **READY** (requires live Acroname discovery). Only meaningful with **--fabric-json**.
**--dry-run** Log only; no programmable power (none hooked up in this skeleton).
**--paths N** Placeholder RRH count (ignored when **--fabric-json** is set).
**--iterations M** Outer loop: run **M** sequential **TaskGroup** rounds.
**--settle SEC** Sleep between conceptual phases inside **one_cycle**.
**--rig-ip** SSH target; defaults to **FIWI_REMOTE_IP**. Overrides JSON concentrator when set. If unset and JSON has no IP, remote checks are skipped.

Patterns to reuse in your own harness

1. Thin main() — parse, configure logging, call asyncio.run

Keep I/O policy (flags, env) in **main(). Keep async logic in **async def functions so tests or imports can reuse the coroutines without a second event loop.

2. One coroutine per “story”: one_cycle, run_campaign

Name coroutines after user-visible steps (cycle, campaign, smoke). Pass explicit parameters (dry_run, settle_s, label) instead of hidden globals.

3. Concurrency with TaskGroup

When multiple RRHs run together, **async with asyncio.TaskGroup() as tg:** + **tg.create_task(...)** fails fast and bundles errors in an **ExceptionGroup**. Catch with **except* Exception** at the boundary that owns **asyncio.run**, log each sub-exception, and return a process exit code.

4. Dry-run first

Always provide a path that does not touch hardware so engineers can validate logging, SSH, and timing on a laptop. Real power transitions should be clearly gated (extra flag or explicit “I know this is live”).

5. Domain types from the library

Attach **FrontHaul** to **RadioHead** even when fields are **None** — it documents intent and keeps the harness aligned with production models. Pass a **Fabric** into the async campaign so shared resources (concentrator SSH, bench **Power, **asyncio.Lock, **rrh_power_ports**) have one home. Prefer **--fabric-json** (bound once via **python3 -m fiwicontrol.fabric bind**) over ad hoc placeholders; reserve **--paths** for laptop-only smoke.

6. Remote checks via ssh_node

Use **await node.rexec(cmd="...", ...)** for one-shot remote work. For periodic sampling, prefer **Command** / **CommandManager** from **fiwicontrol.commands** (see **docs/node-control-asyncio-design.md**).

7. Document gaps in the script

A short DESIGN_GAPS or TODO block at the top of the harness documents how enumeration, telemetry, or SPC relate to this script.


Checklist for a new system script

  1. Lives under **scripts/system/** with a **#!/usr/bin/env python3** shebang.
  2. **argparse** (or equivalent) documents every assumption; **--help** is accurate.
  3. **--dry-run** (or equivalent) when hardware is involved.
  4. **logging** at INFO for operator visibility; avoid **print** for control flow.
  5. Async entry is **async def** + single **asyncio.run(...)** from **main()**.
  6. Concurrent work uses **TaskGroup** (or **gather** with a documented error policy).
  7. Non-zero exit on failure; **ExceptionGroup** handled if you use **TaskGroup**.
  8. README or this doc updated if you add a new category of harness or dependency.

  • **docs/pcie-hotswap-setup.md** — PCIe harness prerequisites and JSON generation.
  • **docs/fabric-builder.md** — lab INI + **python3 -m fiwicontrol.fabric build** / **bind**.
  • **docs/install.md** — workstation and rig setup, **pip install -e**.
  • **docs/node-control-asyncio-design.md****ssh_node**, **Command**, timeouts, running tests.
  • **docs/power-control-and-inventory.md** — Acroname / Monsoon, INI, **--verify-inventory**.
  • **docs/spc.md** — when campaigns need statistical control charts after KPI extraction.
  • **README.md****scripts/system/** vs **tests/** overview.