PLAN-v1.6.6 — UX patches + error handling for watch_run
IMPLEMENTATION RULES: Before implementing this plan, read and follow:
- WORKFLOW.md — the implementation process
- PLANS.md — plan structure and best practices
Status: Backlog
Goal: Land the UX gaps surfaced by the live v1.6.x smoke runs as a single coherent patch. Four items: command-header line, add-service step counter, watch_run polling-dot cleanup, and watch_run failure handling that fetches the ADO timeline + prints actionable hints instead of just an URL.
Last Updated: 2026-05-29
Driver: Live-smoke session 2026-05-29 (v1.6.0 → 1.6.5). The smoke surfaced 5 real-az-API bugs (shipped as v1.6.1 → 1.6.5) AND multiple UX gaps. The bugs are patched; the UX gaps are this plan.
Branch: standalone — feat/v1.6.6-ux-errors. Single PR, single version bump (1.6.5 → 1.6.6).
Out of scope: PR-merge-failure handling (separate plan: PLAN-v1.6.7), SCRIPT_EXAMPLE_OUTPUT field (separate plan: PLAN-help-example-output), v1-cleanup (separate plan: PLAN-G).
What changes
1. Command header line on every command
Every bin/<cmd>.sh prints, immediately after arg-parsing:
noclickops <cmd> v<version> — <one-line summary of what's about to happen>
Examples:
noclickops info v1.6.6 — frontend (test)noclickops logs v1.6.6 — frontend (test, --tail 5)noclickops add-service v1.6.6 — scaffolding 'smk1' (~1-3 min, 4 steps)noclickops deploy v1.6.6 — 'smk1' → test (FIRST-TIME, 4 pipelines, ~6-10 min)
Implementation: a single helper in lib/metadata.sh:
nco_command_header() {
local summary="$1"
printf '\n%s — %s\n\n' "$NCO_HEADER_PREFIX" "$summary"
}
# NCO_HEADER_PREFIX expands to "noclickops <SCRIPT_NAME> v<version>"
# Computed once from SCRIPT_NAME + nco_load_version.
Each bin/<cmd>.sh calls nco_command_header "<summary>" once.
2. Step counter for add-service
Mirror deploy's [1/4] ... [4/4] pattern. Each step prints estimated duration inline based on observed times from the v1.6.x smoke:
[1/4] Trigger add-service pipeline (~30-60s) run 28578 … succeeded (0m 23s)
[2/4] Wait for PR-A in source repo (~10-30s) found #4849 … merged
[3/4] Wait for PR-B in IaC/platform-infra (~10-30s) found #4850 … merged
[4/4] Sync local main (~5s) ok
3. watch_run polling-dot cleanup
Currently dots get printed inline on the same line as the step label, then the succeeded (Xm Ys) summary appears on a new line. Looks like:
[1/4] FrontendPlatform/<repo>-<svc>-build run 28580 … ...
succeeded (1m 3s)
Fix: dots go on their own line after the label, summary replaces the dots when done. Target:
[1/4] FrontendPlatform/<repo>-<svc>-build run 28580 ▶ in-progress (45s)…
[1/4] FrontendPlatform/<repo>-<svc>-build run 28580 ✓ succeeded (1m 3s)
(Or simpler: rewrite the same line with \r and overwrite on each poll.)
4. watch_run failure handling — fetch timeline + actionable hint
When a pipeline reports failed, watch_run currently prints the elapsed time + a URL. Replace with the full block:
✗ FAILED: TCH900001-mrkmedlem-smkpub-deploy-test (run 28585, 2m 25s)
Step: AzureResourceManagerTemplateDeployment
Error: Template validation failed.
The deployment 'acaService' in rg-test-nrx-tch900001 cannot be
saved because an existing deployment is still active
(started 21:09:40 UTC).
Action: → Wait ~3 min for the prior deploy to complete, then retry:
noclickops deploy smkpub test
Full log: https://dev.azure.com/RedCrossNorway/IaC/_build/results?buildId=28585
Implementation: new helper in lib/service-v2.sh:
report_pipeline_failure <project> <run-id>
# 1. GET /_apis/build/builds/<run-id>/timeline?api-version=7.0 via _nco_ado_rest_get
# 2. Extract records with result=failed && issues != null
# 3. From each record's issues[], pick the last non-empty message (most specific)
# 4. Match the message against the action-pattern table; emit Action line
# 5. Print the formatted block (header / step / error / action / URL)
Action-pattern table (initial set — extensible):
| Signature in error message | Action |
|---|---|
cannot be saved, because this would overwrite an existing deployment | Wait ~3 min and retry: noclickops deploy svc env (where <svc> and <env> are placeholders for the actual service + env) |
ContainerAppInvalidName | Service name too long. Full container app name ca-env-TENANT-svc must fit 32 chars; rename service to ≤ 20 chars. |
AuthorizationFailed | You don't have the required role on the subscription. Ask the admin or check PIM eligibility for the subscription named in the error. |
ResourceGroupNotFound | The IaC PR-B may not be merged yet. Check IaC/platform-infrastructure for an open PR titled "Add service <svc>". |
Trivy ... CRITICAL (image scan) | Image has critical CVEs. Bump the base image in services/<svc>/Dockerfile and re-deploy. |
| (no match) | See full log for details. If this happens repeatedly, file a finding. |
watch_run calls report_pipeline_failure instead of just printf 'failed (...)'.
Phase 1: Header + version variable — DONE
Tasks
- 1.1 Add
nco_command_header <summary>helper tolib/metadata.sh(or a newlib/header.shif metadata.sh gets crowded). Usesnco_load_versionfor the version. - 1.2 Add a
nco_command_headercall to eachbin/<cmd>.sh(11 commands;noclickops-lister doesn't need it). Pick a meaningful summary per command:info:<svc> (<env>)logs:<svc> (<env>, --tail N[, --follow][, --system])shell:<svc> (<env>, cmd: <cmd>)status:recent runs in <repo>(orrun <id>when an id is given)deploy:<svc> → <env> (FIRST-TIME, 4 pipelines, ~6-10 min)OR<svc> → <env> (subsequent, ~30-45s)add-service:scaffolding '<svc>' (~1-3 min, 4 steps)clean-sample:strip Express+OIDC sample from services/<svc>/app/create-pr:'<title>' (<branch> → main) in <repo>merge-pr:PR #<id> in <repo>update:pull latest noclickopssync-lovable:sync <source> → <svc>
Validation
for cmd in info logs shell status deploy add-service clean-sample create-pr merge-pr update sync-lovable; do
noclickops $cmd --help | head -2
done
Each must show the header line right after the lister banner (or instead of, when --help doesn't dispatch to the function).
Wait — --help exits early. Decide: should the header print on --help too, or only when the command is actually doing work? Reasonable answer: only when doing work (header is "what's about to happen", and --help isn't doing it). Detailed in 1.3.
- 1.3 Decision: header prints AFTER
case ... in -h|--help) show_helpearly-exit. Document.
User confirms phase is complete.
Phase 2: add-service step counter — DONE
Tasks
- 2.1 Rewrite the step output in
bin/add-service.shto use the[N/4]format with inline estimated durations matchingdeploy. Each step:- Print step label + estimate on a partial line:
[1/4] Trigger add-service pipeline (~30-60s) - Do the work
- Replace
(~estimate)portion with the actual outcome:run 28578 … succeeded (0m 23s) - Newline.
- Print step label + estimate on a partial line:
- 2.2 The four step labels:
[1/4] Trigger add-service pipeline (~30-60s)[2/4] Wait for PR-A in source repo (~10-30s)[3/4] Wait for PR-B in IaC/platform-infra (~10-30s)[4/4] Sync local main (~5s)
- 2.3 Update test fixtures + assertions in
tests/test-PLAN-F-add-service.shto expect the new format.
Validation
bash tests/run-all.sh green. Manual run shows the new step format.
User confirms phase is complete.
Phase 3: watch_run polling-dot cleanup — DONE
Tasks
- 3.1 In
lib/service-v2.sh'swatch_run, replace the inline dot pattern with line-rewrite via\r+printf(when stdout is a tty). Format:- In-progress poll:
[N/4] <label> run <id> ▶ in-progress (<elapsed-s>s)…\r - Terminal:
[N/4] <label> run <id> ✓ succeeded (Xm Ys)\n(overwrites the in-progress line; the\ncommits it)
- In-progress poll:
- 3.2 When stdout is NOT a tty (CI / piped output), fall back to the current dot pattern or print each poll on its own line — the line-rewrite trick doesn't work without a tty.
- 3.3 Update tests that grep for the old dot output. Most tests use
NCO_WATCH_INTERVAL=0so the polling loop runs once and the format change is contained.
Validation
bash tests/run-all.sh green. Manual deploy run shows live progress overwriting the same line instead of accumulating dots.
User confirms phase is complete.
Phase 4: report_pipeline_failure helper + wire into watch_run — DONE
Tasks
- 4.1 Add
report_pipeline_failure <project> <run-id>tolib/service-v2.sh:- Calls
_nco_ado_rest_geton/_apis/build/builds/<id>/timeline?api-version=7.0. - Parses the JSON (uses
python3—python3is already a documented dependency forscripts/generate-docs.sh; alternative: usejqif available; or a careful awk/sed pipeline). - For each failed record with issues:
- Pick the last non-empty
issues[].message(most specific). - Pattern-match against the action table (Phase 4.2 below).
- Format the block.
- Pick the last non-empty
- If no failed-with-issues records: fall back to printing the original "failed + URL" message (so we never make things worse).
- Calls
- 4.2 Action-pattern table — initially defined inline in the helper. Each entry: a regex/glob and a hint template (templates can include
<svc>,<env>,<sub>— passed in or pulled from env). Five initial patterns per the table above. - 4.3 Update
watch_runto callreport_pipeline_failureon terminal-failed instead ofprintf 'failed (...)'. Keep the elapsed time in the summary so we still see "how long until it failed". - 4.4 Tests:
- Unit-test
report_pipeline_failureagainst canned timeline JSON (fixture file). Cover: each known pattern matches, unknown pattern falls through, empty issues array falls back. - Update
watch_runfailure tests to expect the new output (or wrap thewatch_runfail-path test to assert on the formatted block).
- Unit-test
Validation
bash tests/run-all.sh green. Manual smoke: trigger a known-failing pipeline (e.g. re-deploy smkpub before the "active deployment" lock clears) — expect the formatted block + "wait + retry" hint.
User confirms phase is complete.
Phase 5: Docs + version bump — DONE
Tasks
- 5.1 Update
website/docs/contributors/lib-service-v2.md: addreport_pipeline_failureto the Public API table. - 5.2 Update
website/docs/contributors/v2-smoke-test.md: each pipeline-failure case in the smoke now expects the formatted-failure block; update assertions if any. - 5.3 Bump
version.txt→1.6.6.
Validation
cd website && npm run build clean.
User confirms phase is complete.
Acceptance Criteria
- Every
bin/<cmd>.sh(except the lister) printsnoclickops <cmd> v<version> — <summary>when doing work (where each placeholder in angle brackets is filled at runtime) -
add-serviceshows[1/4]…[4/4]with inline(~estimate)and final actual time -
watch_runpolling progress overwrites the same line (tty mode) instead of accumulating dots -
watch_runon terminal-failed prints the formatted block (Step / Error / Action / Full log) — at minimum for the 5 known patterns -
tests/run-all.shgreen -
version.txtshows1.6.6
Files to Modify
lib/metadata.sh(addnco_command_header)lib/service-v2.sh(addreport_pipeline_failure; updatewatch_run)bin/info.sh,logs.sh,shell.sh,status.sh,deploy.sh,add-service.sh,clean-sample.sh,create-pr.sh,merge-pr.sh,update.sh,sync-lovable.sh(header calls)bin/add-service.sh(step counter rewrite)tests/test-PLAN-A-service-discovery.sh(lib-level tests forreport_pipeline_failure)tests/test-PLAN-F-add-service.sh(step-format assertions)tests/test-PLAN-C-deploy.sh(watch-format assertions)website/docs/contributors/lib-service-v2.mdversion.txt
Implementation Notes
Why one PR for all four items
Each item alone is too small to justify its own PR (sub-100-line patches). Bundling them into v1.6.6 keeps the version-bump rhythm matching the patch density. If any item turns out larger than expected during implementation, split at that point.
python3 for JSON parsing
scripts/generate-docs.sh already uses Python for JSON parsing. report_pipeline_failure can do the same. Alternative: require jq (already de-facto on every dev machine but not currently a noclickops dep). Decide during implementation; lean toward python3 to avoid adding a hard dep.
Action-pattern table extensibility
Initial 5 patterns cover what we've seen in the live smoke. Real-world use will surface more. Format the patterns + actions as a simple data structure (e.g. a bash assoc array or a heredoc parsed at runtime) so adding a 6th pattern is one new entry, not a code change.
Out of scope for v1.6.6
report_pr_merge_failure+report_rest_failure(PLAN-v1.6.7)SCRIPT_EXAMPLE_OUTPUTfield (separate plan)- The smoke-test doc's Run history block for the v1.6.x session — that ships whenever the user signs off on the captured results.