FiWiControl/docs/system-test-scripts.md

224 lines
15 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# System test scripts (hardware harnesses)
**Audience:** software test engineers writing **lab and bench automation** in this repo.
**Example:** **`scripts/system/pcie_hotswap_harness.py`** — a small, readable pattern you can copy.
**PCIe hot-swap setup (install, INI, JSON, commands):** **`docs/pcie-hotswap-setup.md`**.
---
## Pytest vs system scripts
| | **`tests/` + pytest** | **`scripts/system/`** |
|--|------------------------|------------------------|
| **Goal** | Fast feedback, CI, mocks, gated remote tests | Long runs, real power, cables, enumeration |
| **When it runs** | `python3 -m pytest tests/` on every change | When the bench is wired and someone invokes the script |
| **Failure meaning** | Regression in code or contract | Often **environment** (wrong port, flaky USB, SSH) — design logs accordingly |
| **Concurrency** | Usually isolated tests | Often **many logical paths** sharing one USB tree or one SSH host |
Keep **pytest** strict and deterministic. Keep **system scripts** explicit about assumptions (CLI flags, env vars, dry-run) and safe defaults (no silent hardware actions).
---
## What the example script does
**`scripts/system/pcie_hotswap_harness.py`** models a **fronthaul (PCIe) hot-swap campaign**:
1. Build a **`Fabric`**: either load **`--fabric-json`** (**`FabricDefinition`** from disk → **`Fabric.rrhs`**, **`rrh_power_ports`**, fingerprint) or build **N placeholder** **`RadioHead`** instances (each with a **`FrontHaul`**) via **`--paths`** and wrap them in **`Fabric`** (optional concentrator **`ssh_node`**, **`power_lock`**).
2. For each **iteration**, run **`asyncio.TaskGroup`**: every RRH runs **`one_cycle`** **concurrently** (stressing shared-resource design: one BrainStem, one rig SSH target, and so on).
3. Each cycle: **log** remove/restore phases ( **`--dry-run`** ) or placeholders for future **`Power`** calls, then optionally **SSH** to the concentrator for a minimal **smoke** command (`uname`, sample `lspci` output).
4. Exit **non-zero** if the async campaign raises (including **`TaskGroup`** child failures), using **`except* Exception`** so **`ExceptionGroup`** surfaces every underlying error.
The scripts module docstring lists **DESIGN_GAPS** — known extension points so harness scope stays explicit.
---
## Fabric JSON (discovery + bindings, one pass)
Full workflow (INI → discovery → prompts → JSON): **`docs/fabric-builder.md`**.
**`pip install -e ".[power]"`** on the workstation that sees the Acroname hub.
1. **Fabric builder** — use **`build`** when a lab INI must be loaded first; **`bind`** is the same with INI optional if the default path is missing:
```bash
python3 -m fiwicontrol.fabric build -o configs/my-fabric.json -c configs/default.ini
python3 -m fiwicontrol.fabric bind -o configs/my-fabric.json -c configs/default.ini
```
2. **Check freshness** — exit **0** only if on-disk fingerprint matches **live** USB discovery:
```bash
python3 -m fiwicontrol.fabric status -f configs/my-fabric.json
```
3. **Harness** — load that graph (optional **`--strict-fabric-ready`** to require **`READY`** status):
```bash
python3 scripts/system/pcie_hotswap_harness.py --fabric-json configs/my-fabric.json --dry-run
```
Types live under **`fiwicontrol.fabric`** (**`FabricDefinition`**, **`FabricRRHBinding`**, **`Fabric.binding_cache_status`**).
---
## Concentrator dump (`scripts/system/dump_concentrator.py`)
**Purpose:** capture **this machines** concentrator-relevant facts in one place: CPU summary from **`/proc/cpuinfo`**, and (by default) a **local host probe****`lspci -tv`**, **`/sys/bus/pci/devices/*/current_link_width`** (and related link fields), and **`dmidecode -t baseboard`** when the binary succeeds (often after **`sudo`**, because SMBIOS is not always readable as a normal user).
**Default output is human text**, not JSON: a short CPU block; one line with the **total count** of sysfs PCI devices that expose negotiated link width/speed; a **WiFi / wireless-only** table (**`K of N`**) for PCI class **`0x028…`** (network + wireless) with **`w`/`W`** lanes, **GT/s** current/max, **`class`**, and a **chip** column from **`lspci -nn`** (preferred) or sysfs **`vendor`** / **`device`** hex pair (long chip strings are truncated); a **peek** at the first **`--lspci-lines`** rows of **`lspci -tv`** (default **18**, remainder summarized); and the **first 14 lines** of **`dmidecode -t baseboard`** when that command succeeds (often requires **`sudo`** on Fedora).
| Flag | Meaning |
|------|---------|
| **`--json`** | Emit the full **`ConcentratorPlatformSnapshot.to_json_dict()`** document (large): CPU fields, optional **`lspci_tree`**, compact **`pci_device_links`** as **`{"cols":[...],"rows":[...]}`** (columns **`bdf`**, **`w`**, **`W`**, **`s`**, **`S`**, **`c`** = lanes and GT/s tokens and class), optional **`dmidecode_baseboard`** string. |
| **`--no-host-probe`** | CPU-only; skip **`lspci`**, sysfs PCI enumeration, and **`dmidecode`**. |
| **`--pci-sysdir DIR`** | Override **`/sys/bus/pci/devices`** (testing or nonstandard roots). |
| **`--pci-all`** | After the WiFi table, append a second table of **other** “interesting” non-wireless links (wide ports / downgrades), still capped by **`--pci-max-rows`**. |
| **`--pci-max-rows N`** | Cap for the optional second table (default **40**). |
| **`--lspci-lines N`** | Lines of **`lspci -tv`** in human output (**0** = omit that block; default **18**). |
| **`--label NAME`** | Shown in the human header only. |
| **`--proc-cpuinfo PATH`** | Override **`/proc/cpuinfo`** (tests or chroots). |
**Examples:**
```bash
# Human summary (default); WiFi table + short lspci tree + DMI if allowed
python3 scripts/system/dump_concentrator.py
# Same with baseboard text (often needs root on Fedora)
sudo python3 scripts/system/dump_concentrator.py
# Machine JSON for tooling / CI artifacts
python3 scripts/system/dump_concentrator.py --json > /tmp/concentrator.json
```
**Python API:** **`fiwicontrol.concentrator.ConcentratorPlatform`**, **`ConcentratorPlatformSnapshot`**, **`PciDeviceLinkSnapshot`**, **`format_concentrator_platform_snapshot_human()`** (same layout as the scripts default text; optional **`lspci_nn_by_bdf=`** for tests). Implementation lives in **`src/fiwicontrol/concentrator/host.py`** (package **`fiwicontrol.concentrator`** — local workstation facts, parallel to **`fiwicontrol.radio`** for RRH aggregates; not part of fabric JSON).
When the harness (or your script) loads **`--fabric-json`**, it **merges lab INI by default** (same file as **`fiwicontrol.lab`**: **`FIWI_LAB_INI`**, else **`configs/default.ini`** if present). Pass **`--lab-ini PATH`** to point at another file. Merged keys include optional **`[fabric]`** (**`fabric_id`**, **`concentrator`** → **`[machine.*]`** SSH target) and optional **`[fabric.rrh.<radio_id>]`** to override Acroname port / patch panel / module serial for rows already present in the JSON. Use **`--no-lab-ini`** to skip. JSON supplies **`discovery_fingerprint`** and the RRH binding list (key **`rrhs`**; Python: **`FabricDefinition.rrhs`**) from **`fabric build`** / **`bind`** or **`fabric_realize.py --json`**.
---
## Acroname discovery smoke test (`scripts/system/test_acroname_usb_discovery.py`)
Runs BrainStem USB enumeration **per `[machine.*]` row** in the lab INI: **`usb=local`** on the workstation you run from, **`usb=remote`** over SSH (same interpreter contract as **`fiwicontrol.power --discovery-json`**). Prints a short table per machine, **`brainstem_version`** from discovery JSON (with an SSH fallback pip probe when the remote build omits that field), and a **total module count** across hosts.
```bash
python3 scripts/system/test_acroname_usb_discovery.py
python3 scripts/system/test_acroname_usb_discovery.py -c configs/default.ini --json
python3 scripts/system/test_acroname_usb_discovery.py --local-only
```
Use **`--local-only`** to skip the INI and probe only this machines USB. See **`docs/power-control-and-inventory.md`** for INI fields.
---
## Fabric compose + realize (`scripts/system/fabric_realize.py --realize`)
Loads the lab INI, runs **local** Acroname discovery, **`compose_definition`**, builds **`Fabric`**, then **`await fab.realize()`** (strict fingerprint check against live USB). Default **stdout** is an **OK** line plus **`print(fabric)`** (human **`Fabric.__str__`** summary). Pass **`--json`** for **stdout**-only **`FabricDefinition`** JSON after a successful realize. **`-v`** adds discovery / pre-realize fabric lines on **stderr**; **`--no-strict`** passes **`strict=False`** into **`Fabric.realize()`**. **`--realize-discovery-timeout SEC`** bounds Acroname discovery during **`--realize`** (default **120**). **Exit codes** and FDIR semantics: **`docs/fdir.md`** and **`fabric_realize.py --help`** (epilog).
Without **`--realize`**, **`fabric_realize.py`** only composes the definition and prints a **human** workstation report (or **`--json`** / **`-o`** for definition JSON **without** calling **`Fabric.realize()`**). The human report can merge patch-panel labels into the WiFi PCIe table when **`--patch-panel-json PATH`** is set or when **`<lab_ini_stem>_panel.json`** exists beside the lab INI (see **`fiwicontrol.fabric.patch_panel_json`**).
---
## Prerequisites
1. **Editable install** from the repo root (see **`docs/install.md`**):
```bash
cd ~/Code/FiWiControl
python3 -m pip install -e ".[dev]"
```
2. **Python 3.11+** — the example uses **`asyncio.TaskGroup`** and **`except* Exception`**.
3. **Optional SSH to the rig** — same contract as elsewhere: passwordless **`root@<host>`** for **`sshtype="ssh"`**. Optional **`FIWI_SSH_CONFIG`** is documented in **`docs/node-control-asyncio-design.md`**.
4. **Power / Acroname** — not wired in the example yet. When you add **`fiwicontrol.power`**, use **`pip install -e ".[power]"`** and follow **`docs/power-control-and-inventory.md`**.
---
## How to run the example
From the **repository root** (the script prepends **`src`** to **`sys.path`** if needed):
```bash
# Safe: no SSH, no hardware — exercises structure only
python3 scripts/system/pcie_hotswap_harness.py --dry-run --paths 2 --iterations 1
# With saved fabric JSON (after build/bind; merge lab INI at run time)
python3 scripts/system/pcie_hotswap_harness.py --fabric-json configs/my-fabric.json --lab-ini configs/default.ini --dry-run
# With SSH smoke on the concentrator (replace IP)
FIWI_REMOTE_IP=192.168.1.39 python3 scripts/system/pcie_hotswap_harness.py --dry-run --paths 2
# or
python3 scripts/system/pcie_hotswap_harness.py --dry-run --paths 2 --rig-ip 192.168.1.39
```
| Flag | Meaning |
|------|---------|
| **`--fabric-json PATH`** | Load **`FabricDefinition`** from JSON; sets **`Fabric.rrhs`** and **`rrh_power_ports`**. Without it, uses **`--paths`** placeholders. |
| **`--lab-ini PATH`** | Lab INI merged after JSON (default: **`FIWI_LAB_INI`**, else **`configs/default.ini`** if present). |
| **`--no-lab-ini`** | Skip INI merge; JSON only. |
| **`--strict-fabric-ready`** | Exit **2** unless **`Fabric.binding_cache_status`** is **`READY`** (requires live Acroname discovery). Only meaningful with **`--fabric-json`**. |
| **`--dry-run`** | Log only; no programmable power (none hooked up in this skeleton). |
| **`--paths N`** | Placeholder RRH count (ignored when **`--fabric-json`** is set). |
| **`--iterations M`** | Outer loop: run **`M`** sequential **`TaskGroup`** rounds. |
| **`--settle SEC`** | Sleep between conceptual phases inside **`one_cycle`**. |
| **`--rig-ip`** | SSH target; defaults to **`FIWI_REMOTE_IP`**. Overrides JSON concentrator when set. If unset and JSON has no IP, remote checks are skipped. |
---
## Patterns to reuse in your own harness
### 1. Thin `main()` — parse, configure logging, call `asyncio.run`
Keep **I/O policy** (flags, env) in **`main()`**. Keep **async** logic in **`async def`** functions so tests or imports can reuse the coroutines without a second event loop.
### 2. One coroutine per “story”: `one_cycle`, `run_campaign`
Name coroutines after **user-visible steps** (cycle, campaign, smoke). Pass **explicit** parameters (`dry_run`, `settle_s`, `label`) instead of hidden globals.
### 3. Concurrency with `TaskGroup`
When multiple RRHs run together, **`async with asyncio.TaskGroup() as tg:`** + **`tg.create_task(...)`** fails fast and bundles errors in an **`ExceptionGroup`**. Catch with **`except* Exception`** at the boundary that owns **`asyncio.run`**, log each sub-exception, and return a process exit code.
### 4. Dry-run first
Always provide a path that **does not touch hardware** so engineers can validate **logging, SSH, and timing** on a laptop. Real power transitions should be clearly gated (extra flag or explicit “I know this is live”).
### 5. Domain types from the library
Attach **`FrontHaul`** to **`RadioHead`** even when fields are **`None`** — it documents **intent** and keeps the harness aligned with production models. Pass a **`Fabric`** into the async campaign so **shared** resources (concentrator SSH, bench **`Power`**, **`asyncio.Lock`**, **`rrh_power_ports`**) have one home. Prefer **`--fabric-json`** (bound once via **`python3 -m fiwicontrol.fabric bind`**) over ad hoc placeholders; reserve **`--paths`** for laptop-only smoke.
### 6. Remote checks via `ssh_node`
Use **`await node.rexec(cmd="...", ...)`** for one-shot remote work. For **periodic** sampling, prefer **`Command`** / **`CommandManager`** from **`fiwicontrol.commands`** (see **`docs/node-control-asyncio-design.md`**).
### 7. Document gaps in the script
A short **DESIGN_GAPS** or **TODO** block at the top of the harness documents how **enumeration**, **telemetry**, or **SPC** relate to this script.
---
## Checklist for a new system script
1. [ ] Lives under **`scripts/system/`** with a **`#!/usr/bin/env python3`** shebang.
2. [ ] **`argparse`** (or equivalent) documents every assumption; **`--help`** is accurate.
3. [ ] **`--dry-run`** (or equivalent) when hardware is involved.
4. [ ] **`logging`** at INFO for operator visibility; avoid **`print`** for control flow.
5. [ ] Async entry is **`async def`** + single **`asyncio.run(...)`** from **`main()`**.
6. [ ] Concurrent work uses **`TaskGroup`** (or **`gather`** with a documented error policy).
7. [ ] Non-zero exit on failure; **`ExceptionGroup`** handled if you use **`TaskGroup`**.
8. [ ] README or this doc updated if you add a **new** category of harness or dependency.
---
## Related docs
- **`docs/pcie-hotswap-setup.md`** — PCIe harness prerequisites and JSON generation.
- **`docs/fabric-builder.md`** — lab INI + **`python3 -m fiwicontrol.fabric build`** / **`bind`**.
- **`docs/install.md`** — workstation and rig setup, **`pip install -e`**.
- **`docs/node-control-asyncio-design.md`** — **`ssh_node`**, **`Command`**, timeouts, running tests.
- **`docs/power-control-and-inventory.md`** — Acroname / Monsoon, INI, **`--verify-inventory`**.
- **`docs/spc.md`** — when campaigns need statistical control charts after KPI extraction.
- **`README.md`** — **`scripts/system/`** vs **`tests/`** overview.