PLAN-v1.6.6 — UX patches + error handling for `watch_run`

IMPLEMENTATION RULES: Before implementing this plan, read and follow:

WORKFLOW.md — the implementation process

PLANS.md — plan structure and best practices

Status: Backlog

Goal: Land the UX gaps surfaced by the live v1.6.x smoke runs as a single coherent patch. Four items: command-header line, add-service step counter, watch_run polling-dot cleanup, and watch_run failure handling that fetches the ADO timeline + prints actionable hints instead of just an URL.

Last Updated: 2026-05-29

Driver: Live-smoke session 2026-05-29 (v1.6.0 → 1.6.5). The smoke surfaced 5 real-az-API bugs (shipped as v1.6.1 → 1.6.5) AND multiple UX gaps. The bugs are patched; the UX gaps are this plan.

Branch: standalone — feat/v1.6.6-ux-errors. Single PR, single version bump (1.6.5 → 1.6.6).

Out of scope: PR-merge-failure handling (separate plan: PLAN-v1.6.7), SCRIPT_EXAMPLE_OUTPUT field (separate plan: PLAN-help-example-output), v1-cleanup (separate plan: PLAN-G).

What changes

1. Command header line on every command

Every bin/<cmd>.sh prints, immediately after arg-parsing:

noclickops <cmd> v<version> — <one-line summary of what's about to happen>

Examples:

noclickops info v1.6.6 — frontend (test)
noclickops logs v1.6.6 — frontend (test, --tail 5)
noclickops add-service v1.6.6 — scaffolding 'smk1' (~1-3 min, 4 steps)
noclickops deploy v1.6.6 — 'smk1' → test (FIRST-TIME, 4 pipelines, ~6-10 min)

Implementation: a single helper in lib/metadata.sh:

nco_command_header() {
  local summary="$1"
  printf '\n%s — %s\n\n' "$NCO_HEADER_PREFIX" "$summary"
}
# NCO_HEADER_PREFIX expands to "noclickops <SCRIPT_NAME> v<version>"
# Computed once from SCRIPT_NAME + nco_load_version.

Each bin/<cmd>.sh calls nco_command_header "<summary>" once.

2. Step counter for `add-service`

Mirror deploy's [1/4] ... [4/4] pattern. Each step prints estimated duration inline based on observed times from the v1.6.x smoke:

  [1/4] Trigger add-service pipeline             (~30-60s)   run 28578 … succeeded (0m 23s)
  [2/4] Wait for PR-A in source repo             (~10-30s)   found #4849 … merged
  [3/4] Wait for PR-B in IaC/platform-infra      (~10-30s)   found #4850 … merged
  [4/4] Sync local main                          (~5s)       ok

3. `watch_run` polling-dot cleanup

Currently dots get printed inline on the same line as the step label, then the succeeded (Xm Ys) summary appears on a new line. Looks like:

  [1/4] FrontendPlatform/<repo>-<svc>-build   run 28580 … ...
succeeded (1m 3s)

Fix: dots go on their own line after the label, summary replaces the dots when done. Target:

  [1/4] FrontendPlatform/<repo>-<svc>-build   run 28580 ▶ in-progress (45s)…
  [1/4] FrontendPlatform/<repo>-<svc>-build   run 28580 ✓ succeeded (1m 3s)

(Or simpler: rewrite the same line with \r and overwrite on each poll.)

4. `watch_run` failure handling — fetch timeline + actionable hint

When a pipeline reports failed, watch_run currently prints the elapsed time + a URL. Replace with the full block:

✗ FAILED: TCH900001-mrkmedlem-smkpub-deploy-test (run 28585, 2m 25s)

  Step:    AzureResourceManagerTemplateDeployment

  Error:   Template validation failed.
           The deployment 'acaService' in rg-test-nrx-tch900001 cannot be
           saved because an existing deployment is still active
           (started 21:09:40 UTC).

  Action:  → Wait ~3 min for the prior deploy to complete, then retry:
               noclickops deploy smkpub test

  Full log: https://dev.azure.com/RedCrossNorway/IaC/_build/results?buildId=28585

Implementation: new helper in lib/service-v2.sh:

report_pipeline_failure <project> <run-id>
  # 1. GET /_apis/build/builds/<run-id>/timeline?api-version=7.0 via _nco_ado_rest_get
  # 2. Extract records with result=failed && issues != null
  # 3. From each record's issues[], pick the last non-empty message (most specific)
  # 4. Match the message against the action-pattern table; emit Action line
  # 5. Print the formatted block (header / step / error / action / URL)

Action-pattern table (initial set — extensible):

Signature in error message	Action
`cannot be saved, because this would overwrite an existing deployment`	Wait ~3 min and retry: `noclickops deploy svc env` (where `<svc>` and `<env>` are placeholders for the actual service + env)
`ContainerAppInvalidName`	Service name too long. Full container app name `ca-env-TENANT-svc` must fit 32 chars; rename service to ≤ 20 chars.
`AuthorizationFailed`	You don't have the required role on the subscription. Ask the admin or check PIM eligibility for the subscription named in the error.
`ResourceGroupNotFound`	The IaC PR-B may not be merged yet. Check `IaC/platform-infrastructure` for an open PR titled "Add service <svc>".
`Trivy ... CRITICAL` (image scan)	Image has critical CVEs. Bump the base image in `services/<svc>/Dockerfile` and re-deploy.
(no match)	See full log for details. If this happens repeatedly, file a finding.

watch_run calls report_pipeline_failure instead of just printf 'failed (...)'.

Phase 1: Header + version variable — DONE

Tasks

1.1 Add nco_command_header <summary> helper to lib/metadata.sh (or a new lib/header.sh if metadata.sh gets crowded). Uses nco_load_version for the version.
1.2 Add a nco_command_header call to each bin/<cmd>.sh (11 commands; noclickops-lister doesn't need it). Pick a meaningful summary per command:
- info: <svc> (<env>)
- logs: <svc> (<env>, --tail N[, --follow][, --system])
- shell: <svc> (<env>, cmd: <cmd>)
- status: recent runs in <repo> (or run <id> when an id is given)
- deploy: <svc> → <env> (FIRST-TIME, 4 pipelines, ~6-10 min) OR <svc> → <env> (subsequent, ~30-45s)
- add-service: scaffolding '<svc>' (~1-3 min, 4 steps)
- clean-sample: strip Express+OIDC sample from services/<svc>/app/
- create-pr: '<title>' (<branch> → main) in <repo>
- merge-pr: PR #<id> in <repo>
- update: pull latest noclickops
- sync-lovable: sync <source> → <svc>

Validation

for cmd in info logs shell status deploy add-service clean-sample create-pr merge-pr update sync-lovable; do
  noclickops $cmd --help | head -2
done

Each must show the header line right after the lister banner (or instead of, when --help doesn't dispatch to the function).

Wait — --help exits early. Decide: should the header print on --help too, or only when the command is actually doing work? Reasonable answer: only when doing work (header is "what's about to happen", and --help isn't doing it). Detailed in 1.3.

1.3 Decision: header prints AFTER case ... in -h|--help) show_help early-exit. Document.