Security & Secrets Handling
This page documents how opcgw expects operators to inject credentials and how the gateway protects them at runtime. It is the single source of truth for the secret-handling contract introduced in Story 7-1 (Epic 7 — Security Hardening).
If you are setting up a fresh deployment, jump to Quick start. For a deeper dive on the contract, read the rest of the page in order.
The env-var convention
opcgw loads its configuration from config/config.toml and merges
environment variables on top, so any field can be overridden at startup.
The canonical name for an env var is
OPCGW_<SECTION>__<FIELD_UPPERCASE>
(double-underscore between section and field — figment splits on __ to
walk into nested TOML keys).
| Field | Env var | Required for new deployments? |
|---|---|---|
chirpstack.api_token |
OPCGW_CHIRPSTACK__API_TOKEN |
yes — placeholder rejected at startup |
opcua.user_password |
OPCGW_OPCUA__USER_PASSWORD |
yes — placeholder rejected at startup |
chirpstack.tenant_id |
OPCGW_CHIRPSTACK__TENANT_ID |
optional — placeholder UUID is a valid format; ChirpStack will reject calls until set |
chirpstack.server_address |
OPCGW_CHIRPSTACK__SERVER_ADDRESS |
optional |
opcua.host_port |
OPCGW_OPCUA__HOST_PORT |
optional |
[logging].dir |
OPCGW_LOGGING__DIR or OPCGW_LOG_DIR (bootstrap short form) |
optional |
[logging].level |
OPCGW_LOGGING__LEVEL or OPCGW_LOG_LEVEL (bootstrap short form) |
optional |
The bootstrap short forms (OPCGW_LOG_DIR, OPCGW_LOG_LEVEL) exist only
because the logging subsystem starts before figment runs (Story 6-1/6-2).
Do not introduce a third short form for any other field unless it has the
same bootstrap-phase requirement.
Precedence rules
Configuration values are resolved in this order (highest priority last):
- Defaults — hard-coded in
src/config.rs. config/config.toml— values from the TOML file.- Environment variables — figment merges env on top of TOML, so an env var of the canonical name above always wins.
Placeholder detection
The shipped config/config.toml contains placeholder values for
api_token and user_password:
api_token = "REPLACE_ME_WITH_OPCGW_CHIRPSTACK__API_TOKEN_ENV_VAR"
user_password = "REPLACE_ME_WITH_OPCGW_OPCUA__USER_PASSWORD_ENV_VAR"
AppConfig::validate runs after the env-merge step, so:
- If the TOML still has a
REPLACE_ME_WITH_*value and no env var override is supplied, the gateway exits with an actionable error like: ``` Configuration validation failed:- chirpstack.api_token: placeholder value detected (starts with “REPLACE_ME_WITH_”). Set OPCGW_CHIRPSTACK__API_TOKEN to inject the real secret. See docs/security.md. ``` The error names the field, the env var to set, and points back here. The operator’s literal value is never echoed back into the error message (avoid log-injection-style risk if a near-miss real secret is pasted in).
- If the TOML has a
REPLACE_ME_WITH_*value and the env var is set to a real secret, validation passes — env precedence beats the placeholder check.
This means the placeholder is a red flag for “operator forgot to set the env var”, not a blanket ban on the literal string ever appearing.
Quick start
1. Local / cargo run
export OPCGW_CHIRPSTACK__API_TOKEN='paste-your-token-here'
export OPCGW_OPCUA__USER_PASSWORD='paste-your-password-here'
cargo run
2. Docker / Compose recipe
The shipped docker-compose.yml references .env so secrets stay outside
the image. Workflow:
cp .env.example .env # creates a placeholder-only .env
chmod 600 .env # tighten file permissions
$EDITOR .env # replace each REPLACE_ME_WITH_* with the real secret
docker compose up
The Compose service block:
environment:
- OPCGW_CHIRPSTACK__API_TOKEN=${OPCGW_CHIRPSTACK__API_TOKEN}
- OPCGW_OPCUA__USER_PASSWORD=${OPCGW_OPCUA__USER_PASSWORD}
Compose reads .env from the project directory and substitutes the
host-side value into the container’s environment. .env itself is
ignored by git (.gitignore “# Config & Secrets” block); the committed
.env.example file ships placeholders only.
3. Kubernetes recipe
Mount each secret as an env var via valueFrom.secretKeyRef. Same env-var
names work:
env:
- name: OPCGW_CHIRPSTACK__API_TOKEN
valueFrom:
secretKeyRef:
name: opcgw-secrets
key: chirpstack-api-token
- name: OPCGW_OPCUA__USER_PASSWORD
valueFrom:
secretKeyRef:
name: opcgw-secrets
key: opcua-user-password
Migration path (existing deployments)
The committed config/config.toml shipped with previous opcgw releases
contained real ChirpStack JWTs, real tenant UUIDs, real device EUIs, and a
literal user_password = "user1". After Story 7-1 lands, operators who
git pull will get a conflict on config/config.toml if they have local
edits. The recipe:
⚠️ Step 3 below is destructive — it overwrites your local
config/config.tomlwith the new template. Do not skip step 1’s backup. If you’d rather keep the merge reversible, use thegit stashalternative shown after step 6.
- Before pulling: back up your local copy. Verify the backup file
exists before continuing.
cp config/config.toml ~/opcgw-config-backup.toml ls -l ~/opcgw-config-backup.toml # confirm the backup is on disk - Pull the change. A conflict on
config/config.tomlis expected.git pull - Resolve by keeping the new template. This discards your local
config/config.toml— your backup from step 1 is the only copy.git checkout --theirs config/config.toml - Restore your application list. Copy your
[[application]]blocks from the backup into the newconfig/config.toml. Leave theapi_token/user_passwordfields with theirREPLACE_ME_WITH_*placeholders. - Move secrets to env vars. Create
.envfrom.env.example, fill in the real values from your backup, then tighten permissions.cp .env.example .env chmod 600 .env $EDITOR .env - Verify.
cargo run(ordocker compose up) should start cleanly. If it exits with a placeholder error, you missed step 5.
Reversible alternative — git stash workflow
If you’d prefer to keep the original config/config.toml in your working
tree until you’ve manually merged the changes, use git stash instead of
checkout --theirs:
# Save your local config (includes uncommitted edits anywhere in the tree).
git stash push -m "pre-7-1 config" config/config.toml
# Pull the new template cleanly — no conflict because the file is stashed.
git pull
# Diff the stashed version against the new template to plan the merge.
git stash show -p --name-only stash@{0}
git diff stash@{0} -- config/config.toml
# Manually merge your `[[application]]` blocks into the new template,
# then drop the stash when you're done.
$EDITOR config/config.toml
git stash drop stash@{0}
This path leaves both versions recoverable until you explicitly drop the stash. It costs one extra command vs. step 3 above and is the safer default if you’re not sure about the merge.
A one-shot helper (
scripts/migrate-config-7-1.sh) was considered and deferred. The manual steps above are short and one-time per operator.
What the gateway will / won’t redact
The hand-written Debug impls on ChirpstackPollerConfig and
OpcUaConfig (Story 7-1, AC#3) emit ***REDACTED*** for the two fields
classified as secrets by the epic spec. Everything else uses the default
Debug formatting so existing log lines are unchanged.
| Struct | Field | Redacted in Debug? |
Why |
|---|---|---|---|
ChirpstackPollerConfig |
api_token |
yes | NFR7 secret |
ChirpstackPollerConfig |
tenant_id |
no | Not classified as a secret by the epic spec. Substituted with the all-zeros placeholder UUID in the shipped template (so the operator’s tenant identity isn’t published) but not redacted in logs. Tracked as a follow-up enhancement (see _bmad-output/implementation-artifacts/deferred-work.md). |
ChirpstackPollerConfig |
server_address |
no | Already in startup info! line; well-established as non-secret |
OpcUaConfig |
user_password |
yes | NFR7 secret |
OpcUaConfig |
user_name |
no | Not a secret in the OPC UA model |
OpcUaConfig |
certificate_path, private_key_path |
no | Paths, not key material — but the content of private_key_path is sensitive; file-permission enforcement is Story 7-2 (NFR9) |
Anything not in this table is not secret-protected. If you add a new
sensitive field, extend the table here and the Debug impl in
src/config.rs together.
The redaction protects against format!("{:?}", config) and
tracing::trace!(?config, ...) reaching any appender. It does not by
itself protect against a future contributor wiring a tower-http /
tonic middleware that logs gRPC request metadata at trace level — see
the next section.
Anti-patterns
- Do not bake secrets into Docker images. Build the image once, inject secrets at runtime via env vars.
- Do not commit
.envto git. The shipped.gitignoreexcludes it in the “# Config & Secrets” block; do not add overrides. - Do not paste tokens into bug reports, Slack threads, or screenshots. If a token leaks, rotate it on the ChirpStack side first, then update the env var.
- Do not introduce a parallel short-form env var for
api_token/user_password(e.g.OPCGW_API_TOKEN). The figment nested form (OPCGW_CHIRPSTACK__API_TOKEN) is the canonical name and is pinned by regression tests insrc/config.rs. - Do not wire
tower-http::trace::TraceLayeror any tonic interceptor that logs request metadata. TheDebugredaction above only protectsChirpstackPollerConfig; theapi_tokenis also copied intoAuthInterceptor.api_token(src/chirpstack.rs) and inserted asBearer {token}into the gRPCauthorizationmetadata header on every outbound call. Wiring aTraceLayerre-opens the bearer-token leak vector that Story 7-1 audits and avoids. Tracked as a follow-up GitHub issue (see_bmad-output/implementation-artifacts/deferred-work.md). - Do not rewrite the figment loader. The two-phase bootstrap in
src/main.rsis correct and pinned by tests.
Audit findings: tonic 0.14.5 metadata logging (Story 7-1, AC#5)
opcgw uses tonic 0.14.5 for the ChirpStack gRPC client. Audit results at
the time of Story 7-1 implementation:
tonic 0.14.5has eighttracing::*!sites, all on error conditions (connection errors, accept-loop errors, TLS errors,grpc-timeoutparse errors, reconnect errors). None of them include request headers or metadata in the event fields.- No
#[instrument]attributes capture request fields. grep -rnE 'TraceLayer|trace_layer|tower_http' src/ Cargo.tomlreturned nothing — opcgw does not depend ontower-httpand does not wire anyTraceLayer.
Conclusion: at the time of writing, no EnvFilter mitigation is
needed. If a future opcgw change adds tower-http TraceLayer wiring or
upgrades to a tonic version that logs request metadata, add an
EnvFilter directive in src/main.rs clamping tonic and
tonic::transport targets to info level so trace-level header dumps
are filtered before reaching any appender, and update this section.
A proactive mitigation (a tower::Layer that strips the authorization
header before logging) is tracked as a follow-up GitHub issue.
OPC UA security endpoints and authentication
Story 7-2 hardens the OPC UA server’s exposure surface so a default
deployment is safe to expose on a LAN. The endpoint plumbing was already
in place from earlier epics; Story 7-2 pins the contract by tests, adds
a custom audit-trail authenticator, enforces filesystem permissions on
the private key, and ships a sane create_sample_keypair default.
Endpoint matrix
The gateway advertises three endpoints on the same path (/) and
the same TCP port (4840 by default):
| Endpoint id | Security policy | Security mode | Security level | Intended use |
|---|---|---|---|---|
null |
None |
None |
0 | Development and first-run smoke tests on trusted LANs / behind VPN. |
basic256_sign |
Basic256 |
Sign |
3 | Signed traffic, no encryption — useful when LAN traffic must remain inspectable. |
basic256_sign_encrypt |
Basic256 |
SignAndEncrypt |
13 | Production default. Highest level the gateway advertises today. |
Endpoint ids and security levels are pinned by the integration test
tests/opc_ua_security_endpoints.rs::test_three_endpoints_accept_correct_credentials
— changes to configure_end_points in src/opc_ua.rs that drift any of
the three tuples will fail this test.
User-token model
The gateway uses a single user/password (Story 7-2 Out of Scope: multi-user RBAC). Configure via:
| Field | Env var | Notes |
|---|---|---|
[opcua].user_name |
OPCGW_OPCUA__USER_NAME |
Display name. |
[opcua].user_password |
OPCGW_OPCUA__USER_PASSWORD |
Always set via env var — the placeholder in the shipped TOML is rejected at startup. |
Internally the user-token id is default-user
(crate::utils::OPCUA_USER_TOKEN_ID). It is decoupled from the operator’s
configured user_name so a future multi-user expansion has a clean
single-tenant baseline.
PKI directory layout
pki_dir (default ./pki) must contain four subdirectories:
pki/
├── own/ # 0o755 — server's own certificate (cert.der)
├── private/ # 0o700 — server's private key (private.pem, mode 0o600)
├── trusted/ # 0o755 — client certificates accepted without prompt
└── rejected/ # 0o755 — client certificates rejected on first connect
If any subdirectory is missing, OpcUa::create_server auto-creates it
with the correct mode (src/security.rs::ensure_pki_directories).
Loose modes on private/ are tightened to 0o700 automatically.
The private/*.pem file mode is checked at startup. The gateway
refuses to start if any private-key file is not at 0o600 (NFR9).
Error text includes the observed mode and the chmod recipe.
Production setup recipe
# 1. Generate a self-signed keypair (or supply a CA-signed equivalent).
openssl req -x509 -newkey rsa:4096 -nodes -days 3650 \
-keyout pki/private/private.pem -out pki/own/cert.der -outform DER \
-subj "/CN=opcgw" -addext "subjectAltName=URI:urn:chirpstack:opcua:gateway"
# 2. Tighten file/directory permissions.
chmod 600 pki/private/private.pem
chmod 700 pki/private
# 3. Set create_sample_keypair = false in config/config.toml (the
# shipped default since Story 7-2 — verify it has not been flipped).
# 4. Inject the OPC UA password via env var.
export OPCGW_OPCUA__USER_PASSWORD='your-real-password-here'
# 5. Start the gateway and confirm the boot log shows
# `event="pki_dir_initialised"` events with the correct modes.
cargo run --release
grep 'pki_dir_initialised' log/opc_ua_gw.log
Upgrading from Story 7-1
Story 7-1 left pki/private/private.pem at mode 0o644 (async-opcua’s
auto-generation default). Story 7-2’s startup file-permission check is a
hard error — a Story-7-1 deployment will refuse to start until the
operator runs:
find pki/private -type f -name '*.pem' -exec chmod 600 {} \;
chmod 700 pki/private
The fail-closed behaviour is intentional: silently running with a world-readable private key is worse than refusing to start.
Audit trail
Every failed OPC UA authentication emits a structured warn! event in
log/opc_ua.log:
2026-04-28T14:22:18.041234Z WARN opcgw::opc_ua_auth: OPC UA authentication failed event="opcua_auth_failed" user="alice" endpoint="/"
The submitted username is sanitised (control characters escaped, truncated to 64 chars) before logging so a malicious client cannot inject fake log lines or ANSI escapes. The attempted password is never logged.
Source IP is not in the auth event — async-opcua 0.17.1’s
AuthManager trait does not receive the peer’s SocketAddr. NFR12 is
satisfied via two-event correlation: async-opcua emits an info! event
on connection accept that includes the peer address, then milliseconds
later the gateway emits the auth-failed event. Operators correlate by
timestamp:
# Step 1: find auth failures.
grep 'event="opcua_auth_failed"' log/opc_ua.log
# Step 2: find the matching accept event (typically <100ms before).
grep 'Accept new connection from' log/opc_ua.log | tail -50
# 2026-04-28T14:22:18.039012Z INFO opcua_server::server: Accept new connection from 192.168.1.42:54321 (3)
The audit-event redaction matrix:
| Field | Logged? | Notes |
|---|---|---|
user |
yes | Sanitised — control chars escaped, capped at 64 chars |
endpoint |
yes | Endpoint path (always /) |
attempted_password |
never | Hard rule — no level, no redaction placeholder |
source_ip |
no (correlate) | Carried by async-opcua’s accept event |
A first-class source-IP-in-the-auth-event is tracked as an upstream
follow-up against async-opcua (see
_bmad-output/implementation-artifacts/deferred-work.md).
Required log levels for NFR12 correlation
The two-event correlation only works when both events reach the log
sink. async-opcua emits the connection-accept event at info!
level on the opcua_server::server target; the gateway emits the
auth-failed event at warn! level on the opcgw::opc_ua_auth
target. Both targets must be at info! level or below for NFR12
to hold. Concretely:
- The default
OPCGW_LOG_LEVEL=infois sufficient — do not raise it towarnorerroron the global console. - The per-module file appender for
opc_ua.logalready captures async-opcua atDEBUGandopcgw::opc_uaatTRACE(seeconfig/config.example.toml“Logging configuration”), so the on-disk audit trail is unaffected by the global console level. - If you set
OPCGW_LOG_LEVEL=warnto reduce console volume, the console will still receive the auth-failed event but not the preceding accept event. Operators must rely onlog/opc_ua.log(the file appender) for the correlation in that case — the global console becomes a “username only” view.
Loud check at startup: as of issue #91 (Epic 7 retrospective action
item, 2026-04-29), the gateway emits a one-shot
warn!(operation="nfr12_correlation_check", level=...) immediately
after the Resolved global log level info line whenever the resolved
level is more restrictive than info. The warn is visible at
OPCGW_LOG_LEVEL=warn (the most common volume-reduction case) but
filtered at error / off — operators choosing to silence everything
below ERROR are presumed to know they’re trading off the audit trail.
The startup warn does not fail-fast (operators may legitimately want
quieter console output when running headless under systemd). The
correlation recipe above tells operators which log file to grep when
console output is intentionally minimal.
Verifying OPC UA security
A small smoke-test client ships under examples/opcua_client_smoke.rs:
# Connect to None endpoint with valid credentials.
cargo run --example opcua_client_smoke -- \
--endpoint none --user opcua-user --password "$OPCGW_OPCUA__USER_PASSWORD"
# Expected: prints "Session established on endpoint=None" and exits 0.
# Connect to Basic256 SignAndEncrypt with valid credentials.
cargo run --example opcua_client_smoke -- \
--endpoint sign-encrypt --user opcua-user --password "$OPCGW_OPCUA__USER_PASSWORD"
# Expected: prints "Session established on endpoint=Basic256/SignAndEncrypt" and exits 0.
# Wrong password — expect failure + a warn line in log/opc_ua.log.
cargo run --example opcua_client_smoke -- \
--endpoint none --user opcua-user --password wrong
# Expected: exits with non-zero status. Tail log/opc_ua.log:
# grep 'event="opcua_auth_failed"' log/opc_ua.log
Docker deployment
When pki/ is mounted as a Docker volume, host-side file permissions
are authoritative. The container’s UID must own (or have the right
group on) the mounted files. The ensure_pki_directories chmod runs
inside the container — it only succeeds if the container user can chmod
the host files, which is typically true when the host volume is owned by
the container’s UID. If you run rootless Docker or with a non-default UID
mapping, ensure the UID alignment before mounting.
Anti-patterns
- Do not run with
create_sample_keypair = truein production. The shipped default since Story 7-2 isfalse. Release builds emit a startupwarn!if the flag istrue. - Do not rely on
create_sample_keypair = trueto “fix” a missing keypair on a running deployment. When the configured private-key file is absent andcreate_sample_keypair = true, async-opcua regenerates the keypair on next start with the default umask (typically0o644— world-readable). The startup file-permission check short-circuits on the missing-file path and does not catch it; the next-restart validation does, but the gateway runs once with a world-readable key in the meantime. Production deployments must provision the keypair manually withchmod 600and ship withcreate_sample_keypair = falseso this regen path can never trigger. This is intentional — the alternative (post-create chmod or hard fail) would prevent operators from usingcreate_sample_keypairfor development, where the world-readable window is acceptable. - Do not leave
private/*.pemat0o644. The startup check is a hard error — fix the mode rather than relaxing the check. - Do not configure the
nullendpoint as the only available endpoint on a network reachable from outside the LAN. Operators on the same trust domain can use it; remote clients should always go throughbasic256_sign_encrypt. - Do not add multi-user support, mTLS, or rate-limiting failed
attempts as part of casual changes — those are tracked separately
(see
_bmad-output/implementation-artifacts/deferred-work.mdand the follow-up GitHub issues opened with Story 7-2).
OPC UA connection limiting
Story 7-3 caps the number of concurrent OPC UA client sessions the gateway will host so a misbehaving SCADA client (runaway reconnect loop, leaked sessions, deliberate flood) cannot exhaust file descriptors, memory, or CPU. This closes FR44 and the OT Security / Connection rate limiting PRD line item.
What it is
A configurable cap on concurrent OPC UA sessions (not raw TCP
connections — async-opcua’s enforcement point is CreateSession,
which is the first wire-level signal that the peer is a real OPC UA
client). New sessions beyond the cap are rejected by async-opcua with
the OPC UA status code BadTooManySessions. Existing sessions are
unaffected — the cap is checked on the (N+1)th attempt only.
Default: 10 concurrent sessions. Range: 1 to 4096 (the upper bound is a “you almost certainly want a deployment review” guard against fd-exhaustion DoS — see Story 7-3 spec for the back-of- envelope rationale).
Configuration
# config/config.toml
[opcua]
max_connections = 10
Env-var override (figment __-split convention):
OPCGW_OPCUA__MAX_CONNECTIONS=20 cargo run
max_connections = 0 and values above 4096 are rejected at startup
by AppConfig::validate with a clear error message. Single-client
lockdown (max_connections = 1) is a legitimate “engineering-only-
access” configuration for a final commissioning window.
Worked sizing example. 10 SCADA clients × 1 session each = 10. Reserve 2-3 slots for overlap during reconfiguration / failover, so 12-13 is a typical Phase A choice. Going above 50 should prompt a deployment review — most LAN-internal SCADA scenarios saturate well before that point.
What you’ll see in the logs
Two events, both on the opcgw::opc_ua_session_monitor target:
event="opcua_session_count" current=N limit=Latinfo!level, every 5 seconds (gauge — operators graph this for capacity planning). Period controlled byOPCUA_SESSION_GAUGE_INTERVAL_SECS.event="opcua_session_count_at_limit" source_ip=<addr> limit=L current=Natwarn!level, fired on every TCP accept while the gateway is at the cap. Thesource_ipfield comes from async-opcua’s pre-existinginfo!("Accept new connection from {addr}")line — we correlate to it from a tracing-Layer (same NFR12 two-event pattern Story 7-2 used for failed-auth audit).
Grep recipes
# See current utilisation.
grep 'event="opcua_session_count"' log/opc_ua.log | tail -5
# Find at-limit rejections.
grep 'event="opcua_session_count_at_limit"' log/opc_ua.log
# 2026-04-29T10:14:22.105Z WARN opcgw::opc_ua_session_monitor: ... source_ip=192.168.1.42:54311 limit=10 current=10
Anti-patterns
- Do not set
max_connections = 0. Refuses operators too — startup will fail-fast. - Do not set above 4096. File-descriptor exhaustion risk on default Linux ulimits; startup will fail-fast.
- Do not combine
max_connections = <any>withdiagnostics_enabled = false. The session-count gauge and the at-limit warn both read async-opcua’sCurrentSessionCountdiagnostics variable; with diagnostics disabled the counter never increments, the gauge logscurrent=0forever, and the at-limit warn never fires (the cap is still enforced viaSessionManager.sessions.len(), but operator observability is silent). Startup will fail-fast with a remediation hint. - Do not rely on the cap as a brute-force defence. Per-IP throttling is a separate, deferred concern (issue #88). The cap stops a single misbehaving SCADA but does not stop a distributed flood.
Expected at-limit log noise
When the gateway is at the cap, every TCP accept fires an
event="opcua_session_count_at_limit" warn — including port scans
and partial-handshake probes that never request a session. This is
the correct trade-off (operators want full visibility into
rejection-window connection attempts) but means a misconfigured
upstream firewall, a busy nmap scan, or a confused SCADA reconnect
loop can produce a high rate of warns. The warn event is the
symptom; investigate the source IPs and either tighten the firewall
or raise the cap.
Tuning checklist
- Inventory expected SCADA clients × sessions each.
- Add 20% headroom.
- Gauge over a representative day.
- Raise the cap if
currentis consistently within 90% oflimit.
What’s out of scope
- Per-source-IP rate limiting / token-bucket throttling. Tracked at issue #88.
- Per-endpoint or per-user session caps. Differentiated quotas (e.g. “5 SignAndEncrypt + 5 None”) are not in scope.
- Hot-reload of the cap at runtime. Currently read at startup only — Phase B Epic 9 hot-reload covers runtime reconfiguration (issue #90).
Subscription and message-size limits
Story 8-2 (Phase B) extends the connection-limiting surface with four
configurable Limits knobs that shape subscription / message-size
load. They share the validation pattern, env-var convention, and
hard-cap shape established by max_connections.
What they are
| Knob | Purpose | Default | Range |
|---|---|---|---|
max_subscriptions_per_session |
Per-session cap on simultaneous subscriptions. The (cap+1)th CreateSubscription from a session is rejected with BadTooManySubscriptions. |
10 | 1–1000 |
max_monitored_items_per_sub |
Per-subscription cap on monitored items. Past the cap, async-opcua returns BadTooManyMonitoredItems (service-level error in 0.17.1, observed empirically). |
1000 | 1–100 000 |
max_message_size |
Per-message byte ceiling (inbound + outbound, including DataChangeNotification payloads). |
327 675 (= 65 535 × 5) | 1–268 431 360 (≈ 256 MiB; = 4096 × 65535) |
max_chunk_count |
Per-message chunk count ceiling. Together with max_message_size, bounds per-message resource cost. |
5 | 1–4096 |
The two subscription-related defaults match async-opcua 0.17.1’s
library defaults (MAX_SUBSCRIPTIONS_PER_SESSION = 10,
DEFAULT_MAX_MONITORED_ITEMS_PER_SUB = 1000); the two message-size
defaults match opcua_types::constants::MAX_MESSAGE_SIZE /
MAX_CHUNK_COUNT. Unsetting in TOML is a true no-op against the
library.
Configuration
[opcua]
# Subscription / message-size limits — uncomment only if a deployment
# scenario requires tuning. All four default to the async-opcua
# library defaults.
#max_subscriptions_per_session = 10 # Range: 1-1000
#max_monitored_items_per_sub = 1000 # Range: 1-100000
#max_message_size = 327675 # Range: 1-268431360 (≈ 256 MiB)
#max_chunk_count = 5 # Range: 1-4096
Env-var overrides (figment __-split convention):
OPCGW_OPCUA__MAX_SUBSCRIPTIONS_PER_SESSION=20
OPCGW_OPCUA__MAX_MONITORED_ITEMS_PER_SUB=500
OPCGW_OPCUA__MAX_MESSAGE_SIZE=131072
OPCGW_OPCUA__MAX_CHUNK_COUNT=10
Validation (AppConfig::validate) rejects each knob with Some(0)
(misconfiguration — would refuse all subscriptions / items / messages
including operators’ clients) and Some(n) > HARD_CAP (structural
ceiling — values above signal a misconfiguration rather than a
deliberate sizing). Errors accumulate so a single startup pass
surfaces every violation.
What you’ll see in the logs
At startup, the gateway emits a one-shot diagnostic event with the resolved values for all five session / subscription / message-size limits:
grep 'event="opcua_limits_configured"' log/opcgw.log | tail -1
# 2026-04-30T08:14:22Z INFO opcgw::opc_ua: event="opcua_limits_configured"
# max_sessions=10 max_subscriptions_per_session=10
# max_monitored_items_per_sub=1000 max_message_size=327675
# max_chunk_count=5 "OPC UA limits configured"
Operators grep this line on every restart to verify the resolved configuration matches expectations.
Subscription-flood / monitored-item-flood rejections are silent
in async-opcua 0.17.1 — SubscriptionService::create_subscription
returns BadTooManySubscriptions and MonitoredItemService returns
BadTooManyMonitoredItems without log emission. The contract is
the OPC UA status code on the wire, not a log line. Tracked as a
candidate for an upstream feature request (analogous to issue #94’s
session-rejected-callback gap).
Stale-status notifications and the DataChangeFilter contract
Story 5-2’s stale-status logic propagates through subscription
notifications only when the client supplies a DataChangeFilter
with trigger: StatusValue or StatusValueTimestamp (OPC UA
Part 4 §7.22.2 DataChangeFilter). The library default for
DataChangeTrigger is Status (annotated #[opcua(default)] on
DataChangeTrigger::Status in async-opcua-types) — that default
would fire only on status changes and miss value-only changes, so
compliant SCADA clients like FUXA, Ignition, and UaExpert override
the trigger to StatusValue or StatusValueTimestamp to fire on
either. With the filter present, async-opcua’s is_changed() in
async-opcua-types::data_change detects status-only transitions
even when the numeric value is unchanged, so a Good→Uncertain
transition during a ChirpStack outage fires a notification and
SCADA dashboards show the stale state.
If a client supplies no filter (ExtensionObject::null()),
async-opcua falls into the unfiltered path in
MonitoredItem::notify_data_value
(async-opcua-server::subscriptions::monitored_item) which dedupes
on value.value only — status-only transitions are silently
suppressed and dashboards would freeze on the last-good value. This
Plan-A fallback is pinned by
tests/opcua_subscription_spike.rs::test_subscription_unfiltered_dedupes_status_only_transitions
as a regression baseline against issue #94.
Anti-patterns
- Setting any knob to
0— refuses all subscriptions / items / messages, including operators’. Validation rejects it. - Setting
max_message_sizeabovemax_chunk_count × 65535without understanding the chunk geometry — see async-opcua docs. - Relying on
max_subscriptions_per_sessionfor distributed-flood defence. It is a per-session cap, not a per-IP cap. Per-IP throttling is deferred (issue #88).
Tuning checklist
- Inventory expected SCADA clients × subscriptions per client (typically 1–3); add 30% headroom.
- Inventory monitored items per subscription (typically 10–100 for FUXA dashboards); leave the 1000 default unless headroom demands more.
max_message_size/max_chunk_countonly matter ifReadoperations return very large arrays; default opcgw deployments expose scalar metrics and the defaults are oversized.- Pair with
max_connections: subscription clients consume one session each, somax_connections × max_subscriptions_per_session × max_monitored_items_per_subis the upper bound on the publish pipeline’s work.
Subscription clients and the audit trail
Subscription-creating clients pass through the existing
OpcgwAuthManager (Story 7-2) and AtLimitAcceptLayer (Story 7-3)
identically to read-only clients. The event="opcua_auth_failed"
and event="opcua_session_count_at_limit" audit events from those
stories cover them. No new audit infrastructure was introduced by
Story 8-2 (NFR12 carry-forward acknowledgment). The regression
baseline is two existing tests in
tests/opcua_subscription_spike.rs:
test_subscription_client_rejected_by_auth_manager and
test_subscription_client_rejected_by_at_limit_layer.
The new event="opcua_limits_configured" is a diagnostic
startup-config event (same shape as Story 7-2’s
pki_dir_initialised), not an audit event.
What’s out of scope (subscription / message-size knobs)
- Per-source-IP subscription throttling. Tracked at issue #88.
- Upstream FR for rejection-time audit events in async-opcua
(
BadTooManySubscriptions/BadTooManyMonitoredItemsare silent in 0.17.1) — operator-pending follow-up. - The five “advanced” subscription knobs surfaced by the spike
report (
max_pending_publish_requests,max_publish_requests_per_subscription,min_sampling_interval_ms,max_keep_alive_count,max_queued_notifications) — deferred unless an operator’s--load-probenumbers (issue #95) reveal a back-pressure scenario the four mandatory knobs can’t shape.
OPC UA NodeId format (Issue #99 fix, 2026-05-02)
opcgw constructs OPC UA NodeIds in namespace ns=2 using stable
identifiers rather than human-readable display names:
| Node | NodeId identifier (string form) | Browse name + display name |
|---|---|---|
| Application folder | application_id (UUID from [[application]].application_id) |
application_name |
| Device folder | device_id (DevEUI / chirpstack ID) |
device_name |
| Metric variable | format!("{}/{}", device_id, metric_name) (e.g., "0000000000000001/Moisture") |
metric_name |
| Gateway folder + members | hard-coded strings (e.g., "Gateway", "LastPollTimestamp") |
same as NodeId |
The metric NodeId embeds device_id so two devices that share a
metric_name (e.g., both have a “Moisture” metric) resolve to two
distinct NodeIds — "device_a/Moisture" vs "device_b/Moisture" —
instead of colliding on a single "Moisture" node where the second
registration would silently overwrite the first.
Anti-pattern: hard-coding NodeId strings in SCADA configurations
that bypass the browse step. A FUXA / Ignition project that hard-codes
"ns=2;s=Moisture" (the pre-fix shape) breaks after the fix; even
post-fix, hard-coded strings break when the operator changes
device_id in config.toml. Always use the browse path to
resolve NodeIds at SCADA project setup time, and re-resolve on
configuration changes.
Migration impact: existing SCADA configurations that browsed the address space and stored the resulting NodeIds will need to re-resolve after upgrading. The browse-name and display-name are unchanged, so the browse tree looks identical to operators — only the underlying NodeId identifier string is new.
Historical data access
Story 8-3 closes FR22 by exposing the metric_history SQLite table
(populated by the poller’s append-only write path, Story 2-3b) as OPC UA
HistoryRead results. A SCADA client (FUXA, Ignition, UaExpert) issues a
HistoryRead request for a metric NodeId and receives a list of
timestamped values that fit the requested time window. This unlocks the
“show me the past 7 days of soil moisture” use case without polling.
What it is
When a SCADA client issues an OPC UA HistoryRead request with
HistoryReadDetails::ReadRawModified, opcgw resolves the inbound NodeId
to the (device_id, chirpstack_metric_name) pair that the address-space
construction loop registered for that variable, queries
metric_history via the existing (device_id, timestamp) composite
index, and writes the typed values back to the wire as a HistoryData
extension object. The new code surface lives in
src/opc_ua_history.rs (a thin wrap around async-opcua’s
SimpleNodeManagerImpl) and src/storage/sqlite.rs::query_metric_history
(the storage method).
What you get on the wire is exactly what the poller stored, with one
caveat: rows whose value column doesn’t parse to the declared type
(e.g. "NaN" for a Float metric, "garbage" for a Bool metric) are
silently skipped with a trace! log. This is the partial-success
contract — a single bad row never terminates a 600k-row scan.
Known limitations of the historized record
- All historical rows are reported
StatusCode::Good— themetric_historySQLite table has nostatuscolumn, so theOpcgwHistoryNodeManagerImplcannot reconstruct the per-row status that the live read path computes via the Story 5-2 stale-detection logic. A SCADA client reviewing a flaky sensor’s history will see “all green” even if the live reads for that period wereUncertain. Use the liveReadservice alongsideHistoryReadif status interpretation matters for your workflow. - Timestamps are microsecond-precise on the wire. The storage layer
uses
SecondsFormat::AutoSiRFC3339 (which caps at microsecond resolution), thenOpcDateTimere-encodes as 100-nanosecond ticks since 1601. Sub-microsecond detail fromSystemTimeis lost; this is not a regression — it’s the same precision the poller writes.
[storage].retention_days and HistoryRead
The [storage].retention_days knob (and its env-var override
OPCGW_STORAGE__RETENTION_DAYS) governs both the prune loop’s
deletion horizon and the effective HistoryRead window. Story 8-3
extended this single field rather than adding a separate
history_retention_days — one source of truth, validated against the
FR22 floor of 7 days and the storage-cost hard cap of 365 days. The
field is written to the SQLite retention_config table at every
startup, overriding the migration default of 90 days.
Configuration
Two new knobs land in [storage] and [opcua]:
| Knob | TOML key | Default | Range | Env var |
|---|---|---|---|---|
Retention period for metric_history |
[storage].retention_days |
7 |
7-365 | OPCGW_STORAGE__RETENTION_DAYS |
| Per-call HistoryRead response cap | [opcua].max_history_data_results_per_node |
10000 |
1-1_000_000 | OPCGW_OPCUA__MAX_HISTORY_DATA_RESULTS_PER_NODE |
The 7-day floor on retention_days matches FR22 (“a minimum of 7 days
of historical data must be retained”). Values below 7 are rejected at
startup. The 365-day cap is a deployment review trigger — at 10s polling
× ~400 metric pairs × 365 days the table approaches 1.3 billion rows
and pruning + HistoryRead query latency need a separate look. Operators
that need longer retention should open a follow-up issue.
The 10000-row default for max_history_data_results_per_node is
roughly 28 hours of poll data at 10s polling — sufficient for typical
FUXA dashboard time-windows. SCADA clients that want longer windows
page manually (see Anti-patterns below).
[storage].retention_days is written into the SQLite retention_config
table at every startup via INSERT OR REPLACE, overriding the migration
default of 90 days that v001_initial.sql seeds at first boot. This
keeps the prune loop and the operator-config in sync.
What you’ll see in the logs
On a successful HistoryRead with rows returned:
DEBUG history_read_raw_modified: returning rows
node_id=ns=2;s=Moisture
device_id=0000000000000001
metric_name=moisture
row_count=42
On a HistoryRead for an unregistered NodeId (typo, or a node that’s not a metric variable):
TRACE history_read_raw_modified: NodeId not registered for HistoryRead
node_id=ns=2;s=DefinitelyNotARegisteredMetric
The wire-level surface for that case is BadNodeIdUnknown — the SCADA
client sees the correct error, the gateway logs at TRACE so a noisy
client doesn’t flood the log file.
On an inverted time range (end < start) — typically a SCADA bug:
(no log line — the rejection is silent on the gateway side)
The wire-level surface is BadInvalidArgument per OPC UA Part 11 §6.4.2.
Anti-patterns
-
Don’t use the in-memory backend for historical data.
InMemoryBackendis intentionally a lossy non-persistent backend. Itsquery_metric_historyreturnsOk(Vec::new())for every window. The OPC UA client sees aGood-status empty response, so the client thinks “no data in range” — which is technically accurate but operationally misleading. UseSqliteBackendfor any deployment where HistoryRead matters. -
Don’t expect continuation-point round-tripping. Story 8-3 does not implement OPC UA Part 11 §6.4.4
ByteStringcontinuation points. Truncated responses surface asdata_values.len() == max_history_data_results_per_nodewithGoodstatus. SCADA clients that want more rows must page manually:// First call: HistoryRead(start = T0, end = T1, num_values_per_node = 10000) // → 10000 rows back, status Good // Second call: bump start by 1µs past the last returned timestamp let next_start = last_returned_row.timestamp + 1µs; HistoryRead(start = next_start, end = T1, num_values_per_node = 10000) // → next page, status Good // Loop until data_values.len() < max_history_data_results_per_nodeThe 1-microsecond bump matches the storage layer’s microsecond- precision timestamp format (
%Y-%m-%dT%H:%M:%S%.6fZ). Anything smaller would re-yield the last row of the previous page. -
Don’t issue HistoryRead with
num_values_per_node = 0unless you trust your time window. A zeronum_values_per_nodemeans “use the server default” — and if the server is configured withmax_history_data_results_per_node = 1_000_000, a stray query for a 365-day range against a high-frequency metric could pull back over a million rows and saturate the publish pipeline. Themax_history_data_results_per_nodecap is the safety net; SCADA clients should still set their own cap. -
Don’t rely on
HistoryReadProcessed(aggregations). opcgw leaves async-opcua’s defaultBadHistoryOperationUnsupportedforHistoryReadProcessedandHistoryReadAtTime. SCADA clients that need min/max/avg/sum over rolling buckets must compute them client-side from the raw rows this story returns. Tracked at GitHub issue #98. -
Don’t expect
HistoryUpdateto work. opcgw is a read-only gateway from ChirpStack’s perspective;HistoryUpdatefrom the SCADA side doesn’t make sense and returnsBadHistoryOperationUnsupported.
Tuning checklist
For a 7-day retention deployment with FUXA dashboards:
- Set
[storage].retention_days = 7(the default). - Leave
[opcua].max_history_data_results_per_node = 10000(the default) unless dashboard latency profiling reveals a need. - Verify NFR15 by issuing a 7-day query during commissioning; the
bench_history_read_7_day_full_retentionbenchmark intests/opcua_history_bench.rsdocuments the contract. - If query latency exceeds 2 s, run
EXPLAIN QUERY PLANagainst the underlying SQLite to confirm theidx_metric_history_device_timestampindex is hit; if not, add a covering index(device_id, metric_name, timestamp)and re-measure. - Per-metric retention overrides (e.g. “moisture keeps 30 days, all others keep 7”) are out of scope for Story 8-3 — tracked at GitHub issue #98.
Web UI authentication
Story 9-1 ships an embedded Axum web server gated by HTTP Basic auth.
The server is opt-in ([web].enabled = false by default) so existing
operators upgrading from Phase A see no behavioural change unless they
explicitly enable it.
What it is
A single Router mounted at the namespace root with one Layer enforcing
Basic auth on every request. Routes:
GET /api/health— minimal smoke endpoint, returns{"status":"ok"}. Used by integration tests; not operator-facing.GET /(and any path under it) — static files served fromstatic/. Story 9-1 ships placeholder HTML; Stories 9-2 / 9-3 / 9-4 / 9-5 / 9-6 fill them in.
The auth path reuses Story 7-2’s HMAC-SHA-256 keyed credential digest
(extracted into src/security_hmac.rs). Submitted credentials are hashed
under a per-process random key, then constant-time compared against the
digests of the configured credentials. A direct content compare would
leak the credential length via the timing of the comparison; HMAC into
fixed-length digests closes that oracle.
Credentials are shared with [opcua]. The web server reads
[opcua].user_name / [opcua].user_password directly — no separate
[web] user/password pair. Rationale: the threat model is symmetric (an
operator with LAN access; one credential rotation step covers both
surfaces; one less credential pair for operators to forget to rotate).
Required reading before enabling
The web UI binds an HTTP listener that any client on the configured
network can probe. Before flipping [web].enabled = true, confirm:
- You’re on a trusted LAN. Story 9-1 ships HTTP-only — credentials
transit in cleartext. If your gateway is reachable from the public
internet, deploy a reverse proxy (nginx, Caddy, Traefik) with TLS
termination + a deny-all firewall on the gateway port. The default
bind_address = "0.0.0.0"listens on every interface; if a reverse proxy on the same host fronts the gateway, override tobind_address = "127.0.0.1"so the listener is loopback-only. - You’ve rotated the placeholder password. The shipped
config/config.tomlhas a placeholder[opcua].user_passwordvalue the gateway refuses to start with. The same protection extends to the web surface (since credentials are shared). Verify yourOPCGW_OPCUA__USER_PASSWORDenv var injection before flipping[web].enabled = true.
Deployment requirements
The web server’s static/ directory must be reachable from the
gateway’s working directory at runtime. Story 9-1 resolves
std::path::PathBuf::from("static") relative to the gateway’s CWD,
so static/ must live next to the binary or under
WorkingDirectory (systemd) / WORKDIR (Docker):
- Local development (
cargo runfrom project root): the shippedstatic/index.htmletc. are picked up automatically. - Docker: the shipped
Dockerfilecopiesstatic/into/usr/local/bin/staticnext to the binary. If you customise theDockerfile, preserve thisCOPY. - systemd: set
WorkingDirectory=/var/lib/opcgw(or whereverstatic/lives) in the service unit; otherwiseGET /index.htmlreturns 404 even after auth succeeds.
Tracked as a Story 9-X follow-up: a [web].static_dir config knob
that lets operators specify the path explicitly. For now the
project root / binary location is the convention.
Configuration
[web]
enabled = true # default false — opt-in to expose
port = 8080 # default 8080; range 1024-65535
bind_address = "0.0.0.0" # default "0.0.0.0"; must parse as IpAddr
auth_realm = "opcgw" # default "opcgw"; max 64 chars, ASCII-only,
# no `"`, no `\`, no leading/trailing whitespace
Env-var overrides via figment’s nested-key convention:
| Knob | Env var |
|---|---|
[web].enabled |
OPCGW_WEB__ENABLED=true |
[web].port |
OPCGW_WEB__PORT=8080 |
[web].bind_address |
OPCGW_WEB__BIND_ADDRESS=127.0.0.1 |
[web].auth_realm |
OPCGW_WEB__AUTH_REALM=my-gateway |
AppConfig::validate rejects port=0 / port<1024, unparseable
bind_address, empty auth_realm, auth_realm containing ", and
auth_realm longer than 64 chars. All checks accumulate so a single
startup pass surfaces every violation.
What you’ll see in the logs
Successful startup (info-level diagnostic):
INFO event="web_server_started" bind_address=0.0.0.0 port=8080 realm="opcgw"
Disabled (plain info line — no event= field; the spec caps Story 9-1
at exactly two structured event names):
INFO [web].enabled = false; embedded web server not started (set OPCGW_WEB__ENABLED=true to enable)
Graceful shutdown (plain info line — same rationale):
INFO bind_address=0.0.0.0 port=8080 Embedded web server stopped (graceful shutdown)
Failed authentication (warn-level audit event — NFR12):
WARN event="web_auth_failed" source_ip=192.168.1.42 user=evil-user path="/index.html" reason="user_mismatch" "Web UI authentication failed"
The reason field discriminates the failure mode for triage:
| Reason | Meaning |
|---|---|
missing |
No Authorization header. |
malformed_scheme |
Header doesn’t start with Basic . |
malformed_base64 |
Base64 decode failed (or non-UTF8 bytes). |
missing_colon |
Decoded blob has no : between user and pass. |
user_mismatch |
Submitted username doesn’t match the configured one. |
password_mismatch |
Username matched but password didn’t. |
The wire response is identical across all reasons (constant-time
401 + WWW-Authenticate: Basic realm="..."); the discrimination exists
only in the audit log for forensic purposes.
NFR12 source-IP — direct vs. correlated
Story 7-2’s OPC UA path needs two-event correlation because async-opcua’s
AuthManager doesn’t receive peer SocketAddr — operators correlate the
event="opcua_auth_failed" audit event against async-opcua’s own
info!-level “Accept new connection from {addr} (…)” line by timestamp.
Story 9-1’s web path gets the source IP directly via Axum’s
ConnectInfo<SocketAddr> extractor — the audit event carries
source_ip=... natively. No correlation step needed; the asymmetry is a
strict improvement over the OPC UA path.
The same NFR12 startup warn from Story 7-2 (event="nfr12_correlation_check")
applies to the web path: at log levels stricter than info async-opcua’s
accept event is filtered out, but the web’s source_ip field survives at
warn (the minimum level the audit event itself uses). Operators running
at error/off lose the audit trail entirely (their explicit choice).
Anti-patterns
- Don’t roll your own credential comparison. The HMAC-keyed digest +
constant_time_eqshape exists to close two specific weaknesses (the length oracle of a direct compare; replay across instances). Phase-B carry-forward rule (epics.md:782). - Don’t put symlinks in
static/.tower-http = "0.6"’sServeDirdoesn’t expose a symlink-disable knob (verified against upstream source during Story 9-1 review iter-1). On Linux,tokio::fs::File::openfollows symlinks by default. A symlink instatic/pointing outside the directory (e.g. to/etc/passwd) would let an authenticated user read it. Restrictstatic/to plain files. Tracked as a follow-up: a customtower::Servicewrapper that canonicalises every request path against the canonicalstatic/root before dispatch would close this gap, but Story 9-1’s scope didn’t include it. - Don’t introduce a separate
[web]user/password pair without symmetric rotation procedures. Story 9-1’s single-source-of-truth shape (credentials live under[opcua]) means one rotation step covers both surfaces; splitting them creates a footgun where one surface gets rotated and the other is forgotten. - Don’t add
POST/PUT/DELETEroutes without CSRF protection. Story 9-1 ships onlyGETroutes — no CSRF surface. Stories 9-4 / 9-5 / 9-6 will add mutating routes for application / device / command CRUD; those need either strict same-origin policy enforcement (CORS rejecting cross-origin requests) or a double-submit cookie / synchronizer-token pattern. Audit each before merging. - Don’t enable the web server without rotating the placeholder
password. The shipped
config/config.tomlhas a placeholder[opcua].user_passwordvalue the gateway refuses to start with — the same protection extends to the web surface (since credentials are shared). Verify yourOPCGW_OPCUA__USER_PASSWORDenv var injection before flipping[web].enabled = true.
Tuning checklist
- Set
[web].enabled = true(orOPCGW_WEB__ENABLED=true) only after verifying the operator’s LAN threat model. - Pick
[web].bind_address = "127.0.0.1"if a reverse proxy on the same host fronts the gateway — no need to listen on every interface. - Pick
[web].auth_realmper-deployment (e.g."opcgw-prod-east") so browser credential prompts are distinguishable across environments. - TLS / HTTPS hardening is out of scope for Story 9-1 — tracked at GitHub issue #104. Until that lands, deploy an upstream reverse proxy if your environment requires TLS.
- Per-IP rate limiting (
#88) becomes structurally relevant once the web auth surface is exposed — consider opening a follow-up issue if brute-force probing becomes a near-term operator concern.
API endpoints (Story 9-2+)
All /api/* endpoints (/api/health, /api/status, future
/api/applications, /api/devices, /api/commands) inherit the
same basic_auth_middleware that gates the static-file routes.
There is no anonymous probe surface — every route, including
/api/health, requires the same [opcua].user_name /
[opcua].user_password credentials. An unauthenticated request is
indistinguishable from any other unauthenticated request: same
401 Unauthorized + same WWW-Authenticate header + same
event="web_auth_failed" audit event.
Story 9-2 ships GET /api/status (gateway health summary read from
the gateway_status SQLite table); Story 9-3 ships GET /api/devices
(per-device live metric values read from the metric_values table,
joined against the configured [[application.device]] topology);
Stories 9-4 / 9-5 / 9-6 will add more endpoints. All future routes
inherit the auth middleware automatically via the route(...) →
fallback_service(...) → layer(...) ordering invariant in
src/web/mod.rs::build_router — no per-route auth wiring is needed
(and a contributor adding a new route that bypasses the middleware
would have to actively work around the layer composition).
Storage-layer failures on /api/status, /api/devices (and future
read-side endpoints) return 500 Internal Server Error with a generic
body ({"error":"internal server error"}). The inner error is logged
via event="api_status_storage_error" or
event="api_devices_storage_error" (warn) — operators see the
underlying cause in the gateway log, not in the HTTP response. This
mirrors the NFR7 invariant that error messages must not leak
internal state (SQLite paths, table names, etc.) to clients.
The /api/devices JSON contract returns server-side as_of plus the
two staleness thresholds (stale_threshold_secs, bad_threshold_secs)
so the dashboard JS computes per-row staleness client-side without
hard-coding either boundary. The stale_threshold_secs field reflects
[opcua].stale_threshold_seconds (default 120) — same staleness
contract Story 5-2 established for the OPC UA path. A configured-but-
not-yet-polled metric appears with value: null + timestamp: null
(rendered as a “missing” badge in the UI) rather than being omitted.
Configuration hot-reload
Story 9-7 adds operator-driven configuration hot-reload via SIGHUP.
Sending SIGHUP to the gateway PID re-reads config/config.toml
through the same figment chain used at startup (TOML +
OPCGW_* env-var overlay), validates the candidate, classifies
which knobs changed, atomically swaps the live Arc<AppConfig>
into a tokio::sync::watch channel, and notifies the in-process
subscribers (poller, web AppState, OPC UA listener stub for
Story 9-8) to pick up the new values at their next safe checkpoint.
SIGHUP trigger surface
# Send SIGHUP to the running gateway. The PID is whatever the init
# system (systemd / Docker / supervisor) tracks for the opcgw
# process. systemd users: wire `ExecReload=` to this kill recipe.
kill -HUP "$(pgrep opcgw)"
There is no POST /api/config/reload endpoint in v1 — SIGHUP-only
minimises CSRF / auth-surface area until Stories 9-4 / 9-5 / 9-6
land web-based CRUD endpoints (which will trigger reloads
programmatically by calling the same routine). There is also no
filesystem watch (notify crate) — editor-save races and
dependency-surface expansion ruled it out.
Knob taxonomy
The reload routine classifies every knob into one of three buckets
(see src/config_reload.rs::classify_diff for the canonical list):
Hot-reload-safe — applied without restart. Changes here are picked up by subscribers at their next safe checkpoint:
chirpstack.polling_frequency— next poll cyclechirpstack.retry,chirpstack.delay— next entry to the Story 4-4 recovery loop (read-at-entry semantics; in-flight recovery unaffected)chirpstack.list_page_size— next pagination call[opcua].stale_threshold_seconds— next web-dashboard request (v1 limitation: the OPC UA path captures the threshold into per-variable read-callback closures at startup, so this knob affects only the web dashboard’s “Good → Uncertain” boundary in v1; OPC UA reads continue using the startup value)
Restart-required — reloads that mutate any of these are
rejected with event="config_reload_failed" reason="restart_required"
and a changed_knob field naming the offending field. Operators
restart the gateway after applying the change:
chirpstack.server_address,chirpstack.api_token,chirpstack.tenant_id— gRPC channel + interceptor are bound at startup[opcua].host_ip_address,[opcua].host_port— bound socket[opcua].application_name,[opcua].application_uri,[opcua].product_uri— embedded in OPC UA endpoint discovery responses cached by clients[opcua].pki_dir,[opcua].certificate_path,[opcua].private_key_path— server identity[opcua].max_connections,[opcua].max_subscriptions_per_session,[opcua].max_monitored_items_per_sub,[opcua].max_message_size,[opcua].max_chunk_count— fed intoasync-opcuaServerBuilderat startup[opcua].user_name,[opcua].user_password— v1 limitation: rotating credentials at runtime would require modifying theWebAuthStateandOpcgwAuthManagerdigests captured at startup by the auth middleware. The auth-middleware refactor is deferred to a future story; v1 classifies credential changes as restart-required so a hot-reload that bumps the password is rejected loudly rather than silently ignoredweb.port,web.bind_address,web.enabled— bound socketweb.auth_realm— captured intoWebAuthState.realmat startupstorage.database_path,storage.retention_days— DB connection pool init; retention is read at startup by the pruner
Address-space-mutating (Story 9-8 territory) — adding /
removing applications, devices, or metrics from application_list.
Story 9-7 logs an info-level topology_change_detected event with
added_devices / removed_devices / modified_devices field
counts and updates the web dashboard, but does NOT call
address_space.write().add_variables(...) or delete(...).
Until Story 9-8 lands, an operator who hot-reloads a topology
change sees the dashboard update but the OPC UA address space
stays frozen at startup state. This intentional limitation is
the v1 scope split per epics.md:916-931.
Audit events
Three new structured events are emitted by the SIGHUP listener:
event="config_reload_attempted"(info) — every SIGHUP. Carriestrigger="sighup". The next line is eithersucceededorfailed.event="config_reload_succeeded"(info) — validate + classify + swap completed. Carriestrigger,changed_section_count,includes_topology_change,duration_ms.changed_section_count = 0means the candidate equalled the live config (no swap).event="config_reload_failed"(warn / audit) — reload was rejected. Carriestrigger,reason ∈ {validation, io, restart_required},changed_knob(only forreason="restart_required"), and a sanitisederrorfield. Per NFR7, theerrorfield never includes secrets — theReloadError::Displayimpl is curated to surface only the validation diagnostic, file path, or knob name.
The classifier rejects on the first restart-required violation it finds (so the operator gets a single actionable line rather than a wall of “this also changed” noise). Iterate by fixing each flagged knob and re-issuing SIGHUP until the reload succeeds.
Limitations (Story 9-8 dependency + v1 scope)
- OPC UA address-space mutation is stubbed. Story 9-7 logs the topology diff; Story 9-8 implements the apply.
- Credential rotation requires restart in v1 (see “Restart-required” above).
[opcua].stale_threshold_secondshot-reload affects only the web dashboard in v1. The OPC UA path’s per-variable read-callback closures capture the threshold at startup.- No HTTP trigger. SIGHUP-only; web-driven reload arrives with Stories 9-4 / 9-5 / 9-6.
- No filesystem watch. Editor-save races +
notify-crate dependency-surface expansion ruled it out.
Configuration mutations
Story 9-4 ships the first state-changing routes on the embedded
web server: a CRUD surface for [[application]] blocks. This
section documents the trust model, the CSRF defence, the TOML
round-trip discipline, and v1 limitations.
CRUD endpoint surface
| Method | Path | Purpose |
|---|---|---|
GET |
/api/applications |
List configured applications + per-application device counts. |
GET |
/api/applications/:application_id |
Single application detail. 404 on miss. |
POST |
/api/applications |
Create a new (initially empty) application. |
PUT |
/api/applications/:application_id |
Rename an existing application (application_id is immutable). |
DELETE |
/api/applications/:application_id |
Remove an application. Rejected with 409 if it still has devices, or if it is the only configured application. |
All five routes inherit Basic auth via the Story 9-1 layer-after-route invariant. State-changing methods (POST/PUT/DELETE) additionally pass through the Story 9-4 CSRF middleware.
CSRF defence (v1)
Story 9-1 deferred CSRF to “Stories 9-4/9-5/9-6 mutating routes”. Story 9-4 ships the canonical defence — a hybrid of two checks hardened by the Story 9-4 review:
- Origin same-origin enforcement. Every POST/PUT/DELETE/PATCH
request MUST carry an
Originheader whosescheme://host[:port]matches one of the configured[web].allowed_originsentries. TheRefererheader is NOT consulted (Story 9-4 review iter-1 D2-P): per OWASP, Referer is forgeable from non-browser callers and unreliable on HTTPS→HTTP downgrade, so trusting it as a fallback widens the threat model. Strict-Referrer-Policy clients that suppressOriginare explicitly rejected; operators who hit that case should configure their browser to sendOriginon same-origin XHR/fetch. -
JSON-only
Content-Type. The body content type must be exactlyapplication/json(with optional RFC 7231;parameter suffix). Theapplication/json-followed-by-space-and-garbage non-standard form is rejected (iter-1 review P12). This rejects<form>POST CSRF.Body-less requests still require
Content-Type: application/json— DELETE without a body is the common case, and uniform CT gating is intentional defence-in-depth (an attacker mounting a CSRF DELETE via<form method="post">would forge aContent-Typeofapplication/x-www-form-urlencodedormultipart/form-data, neverapplication/json). Clients MUST sendContent-Type: application/jsonon every state-changing method, including DELETE with no body. Behavior pinned bytests/web_device_crud.rs::delete_device_without_content_type_returns_415(Story 9-5 iter-1 review D2). A relaxation must update both the middleware and that pinning test in lockstep.
Both checks are applied after Basic auth and before the
handler. Failures emit event="application_crud_rejected"
reason="csrf" warn logs.
Method handling uses a positive allow-list (iter-1 review P13): only GET, HEAD, and OPTIONS bypass CSRF. CONNECT, TRACE, PATCH, and any custom method are treated as state-changing and CSRF-checked.
Default-port equivalence: http://gateway.local:80 and
http://gateway.local compare equal; same for https/:443 (iter-1
review P10). Browsers omit the default port on standard scheme/port
pairs, so the allow-list normalisation must follow.
Multi-Origin header bypass attempts are rejected (iter-1 review
P11): a request with more than one Origin header is treated as
malformed and refused.
TLS prerequisite (operator action)
The CSRF Origin defence presumes the Origin header reaches the
gateway un-tampered. On plain HTTP over a hostile LAN (DNS
spoofing, captive-portal MITM, ARP poisoning), an attacker can
falsify the Origin header. Operators deploying opcgw on a
non-trusted network MUST front it with a TLS-terminating reverse
proxy (nginx, Caddy, Traefik); the reverse proxy must enforce TLS
client→proxy AND must NOT rewrite the Origin header before
forwarding. Story 9-1’s TLS-via-reverse-proxy stance (issue #104)
remains the canonical recipe.
[web].allowed_origins knob
Default (when the key is omitted) is
vec!["http://<bind_address>:<port>"]. Operators whose browser
hits the gateway via a different URL (hostname, VPN tunnel,
reverse proxy) must extend the list explicitly. Each entry must
parse as scheme://host[:port] with no path/query/fragment.
Hot-reload of this knob is restart-required in v1 (the CSRF
state is captured at router-build time; the live-borrow refactor
is tracked in GH #113).
TOML round-trip via toml_edit
CRUD writes go through src/web/config_writer.rs::ConfigWriter
which uses toml_edit::DocumentMut to preserve operator-edited
comments, key order, and whitespace on round-trip. The figment
chain (src/config.rs) remains the read side; toml_edit is the
write side.
Writes are atomic via tempfile::NamedTempFile::new_in(parent)
persist(target)(POSIX-atomic rename on the same filesystem).
Lock acquire-order invariant
CRUD handlers MUST hold ConfigWriter::lock() across the entire
write + reload + (rollback) sequence so concurrent CRUD requests
cannot lose updates. Story 9-7’s reload mutex is independent and
acquired after the write_lock — no deadlock risk.
SIGHUP-vs-CRUD-snapshot race (operator action — Story 9-4 review iter-1 D4-P)
ConfigWriter::lock() only serialises CRUD-vs-CRUD requests. A
SIGHUP-triggered reload runs on the SIGHUP listener task (Story 9-7)
and does NOT contend on ConfigWriter::write_lock. Sequence at
risk:
- CRUD handler acquires
lock(). - Operator sends SIGHUP → reload reads disk + swaps watch channel.
- CRUD handler reads
original_bytesfor rollback snapshot — captures post-SIGHUP bytes. - CRUD handler writes its delta.
- CRUD handler calls
reload()again; if it fails, rollback restores step-3 bytes (NOT pre-SIGHUP).
Pre-SIGHUP TOML state is lost on rollback in this window. Operator
mitigation: do not SIGHUP while a CRUD request is in flight; the
window is small (sub-millisecond on a healthy gateway) but
operationally distinguishable. A future hardening story will gate
SIGHUP on the ConfigWriter::lock() mutex; tracked alongside
issue #113.
Rollback discipline on reload failure
When ConfigReloadHandle::reload() returns Err(_) after a
successful TOML write, the handler rolls back the on-disk TOML to
the pre-write bytes (held in memory) via ConfigWriter::rollback.
The HTTP response maps:
ReloadError::Validation(_)→ 422 Unprocessable Entity (with the validation message).ReloadError::Io(_)/RestartRequired→ 500 Internal Server Error (operator log carries the detailed error).
If the rollback itself fails, the gateway logs an
event="application_crud_rejected" reason="rollback_failed" warn
event and poisons the ConfigWriter (Story 9-4 review iter-1
D3-P). All subsequent CRUD requests short-circuit with HTTP 503 +
event="application_crud_rejected" reason="poisoned" warn until the
gateway is restarted; operators MUST manually restore the TOML from
a backup before restart, otherwise the next startup will fail
validation. The poisoning is process-local, so a fresh cargo run
or container restart clears it.
Story 9-4 review iter-1 P4: the atomic-write also fsyncs the
tempfile data BEFORE persist + fsyncs the parent directory AFTER
persist, so a power loss during the write cannot leave the file
zero-length or the rename lost. POSIX-atomic rename + fsync(file) +
fsync(parent_dir) covers the durability gap that tempfile::persist
alone leaves.
Story 9-4 review iter-1 D1-P: when the post-write reload()
returns RestartRequired { knob }, the handler checks whether the
offending knob is in the just-written CRUD delta. If NOT (i.e., the
operator made an unrelated manual edit to the TOML between gateway
start and the CRUD POST), the handler returns 409 + reason=
"ambient_drift" and DOES NOT roll back — the operator’s manual
edit is preserved. If the offending knob IS in our delta (defence-
in-depth; should not happen for application_list mutations), the
standard 500 + rollback path applies.
Audit events
Four new event names land with Story 9-4:
event="application_created"(info) — POST succeeded.event="application_updated"(info) — PUT succeeded.event="application_deleted"(info) — DELETE succeeded.event="application_crud_rejected"(warn / audit) — request rejected at one of: handler-level validation, CSRF check, conflict (delete pre-conditions), reload failure, or rollback failure. Carriesreason ∈ {validation, csrf, conflict, reload_failed, io, immutable_field, rollback_failed}.
The grep contract git grep -hoE 'event = "application_[a-z_]+"' src/
must return exactly 4 lines.
application_id semantics (case-sensitivity)
application_id matching is case-sensitive throughout the
gateway (Story 9-4 review iter-2 P37 — documented design call).
App-1 and app-1 are distinct identifiers in:
- The pre-write CRUD duplicate check (
src/web/api.rs::create_application). AppConfig::validate’s cross-application uniqueness HashSet.- The poller’s per-application bookkeeping.
- The OPC UA address-space NodeId generation.
This means an operator can create both App-1 and app-1 and the
gateway will treat them as separate applications. If a future
deployment needs case-insensitive matching, all four sites above
must change in lockstep + a TOML migration must merge any colliding
identifiers. Tracked as a possible future hardening if operators
report case-collision confusion.
validate() amendments (additive)
Story 9-4 makes three additive changes to AppConfig::validate
(src/config.rs:1374-1426):
- Cross-application
application_iduniqueness is now enforced. - Empty
device_listper application is now a warn, not a hard error. - Empty
read_metric_listper device is now a warn, not a hard error.
The two warn-demotions allow POST /api/applications to create an
application that the operator subsequently fills in via Story 9-5
endpoints. Existing operator configs with at least one device per
app see no behavioural change.
Env-var-overrides-disk-edit gotcha
If an operator has set OPCGW_APPLICATION__N__APPLICATION_NAME="X"
as an environment variable, a CRUD edit to that same field via
PUT /api/applications/... writes the new value to config.toml
on disk — but the post-write reload re-runs the figment chain
(TOML + env-var overlay), and the env-var value silently
overrides the disk edit. Operator action: unset
OPCGW_APPLICATION__* env vars before using the web UI to edit
those fields.
v1 limitations
- No SQLite-side persistence. TOML is the single source of truth.
- No cookie-based CSRF token. Origin/Referer + Content-Type defence is sufficient for the LAN single-operator threat model.
- No cascade-delete. Operators must remove devices via Story 9-5 endpoints before deleting the parent application.
- OPC UA address-space mutation stubbed. Inherited from Story 9-7 — without Story 9-8, a CRUD edit updates the dashboard but the OPC UA address space stays at startup state.
- Best-effort rollback. Manual operator action required if the rollback write itself fails.
- No ChirpStack-side existence check. v1 trusts the
operator-supplied
application_id.
Anti-patterns
- Do NOT roll a custom CSRF token implementation. When
stronger CSRF is needed, the canonical upgrade is a
double-submit pattern signed with the existing per-process
hmac_key(Story 9-1 / 7-2 reuse). - Do NOT switch the write side to
toml::to_string. It loses comments + key order on round-trip. - Do NOT bypass the
ConfigWriter::lock()discipline. - Do NOT use the same
metric_name(orchirpstack_metric_name) twice within one device’sread_metric_list. The post-#99 NodeId construction (format!("{}/{}", device_id, metric_name)) collapses duplicates onto the same address-space slot via last-wins semantics — same root-cause class as issue #99 itself. Story 9-5 hardensAppConfig::validateto reject these duplicates at config-load + post-write reload time. The CRUD layer also rejects duplicatemetric_name/chirpstack_metric _nameshapes pre-write where it can. - Do NOT serialise a
ChirpstackDeviceback viatoml::Valueon PUT. It silently strips the[[application.device.command]]sub-table sinceUpdateDeviceRequestdoesn’t carry commands. Story 9-5’s PUT mutation operates ontoml_edit::DocumentMutat the table level — replacing onlydevice_nameand theread_metricsub-array — so the command sub-table is preserved byte-for-byte.
Device + metric mapping CRUD (Story 9-5)
Story 9-5 lands the second mutating CRUD surface — devices and their metric mappings, nested under the existing application surface.
Endpoint surface:
| Method | Path | Purpose |
|---|---|---|
GET |
/api/applications/:application_id/devices |
List devices under an application + per-device metric counts. |
GET |
/api/applications/:application_id/devices/:device_id |
Single device detail (full metric mapping list). |
POST |
/api/applications/:application_id/devices |
Create a new device with its initial metric mappings. |
PUT |
/api/applications/:application_id/devices/:device_id |
Replace device_name and the full read_metric_list (device_id is immutable). |
DELETE |
/api/applications/:application_id/devices/:device_id |
Remove a device. v1 leaves orphaned metric_values / metric_history rows in storage; the pruning task eventually removes them via the retention window. |
All five routes inherit Basic auth + the Story 9-4 CSRF defence
(via path-aware audit dispatch — see below). PUT-replaces semantics
mean the operator must ship the full intended read_metric_list
(possibly empty) on every PUT; granular per-metric routes are
deferred to a future story.
Path-aware CSRF audit dispatch: the CSRF middleware emits
event="device_crud_rejected" reason="csrf" for rejections under
/api/applications/:application_id/devices* and
event="application_crud_rejected" reason="csrf" for the
/api/applications* surface. The defence layer itself (Origin
allow-list + JSON-only Content-Type) is byte-for-byte unchanged
from Story 9-4.
AppConfig::validate amendments: Story 9-5 extends the
validator with two additive rules (modelled on the existing
seen_device_ids HashSet pattern at src/config.rs:1568):
- Per-device
metric_nameuniqueness — two metrics with the samemetric_nameinside ONE device’sread_metric_listare rejected. Without this, the post-#99 NodeId construction would collapse them onto the same OPC UA address-space slot via last-wins semantics. - Per-device
chirpstack_metric_nameuniqueness — same collision class on the reverse-lookup map keyed by(device_id, chirpstack_metric_name)atsrc/opc_ua.rs:1032.
Cross-device metric_name collisions are allowed — the
post-#99 NodeId fix at commit 9f823cc makes this a valid
scenario (two devices dev-A and dev-B can both expose
metric_name = "Moisture" with distinct address-space NodeIds
dev-A/Moisture and dev-B/Moisture).
Audit events: four new event names, parallel to Story 9-4’s shape:
event="device_created"(info) — POST succeeded.event="device_updated"(info) — PUT succeeded.event="device_deleted"(info) — DELETE succeeded.event="device_crud_rejected"(warn / audit) — request rejected. Reason set extends Story 9-4’s withapplication_not_found(POST/PUT/DELETE under a non-existent application_id) anddevice_not_found(PUT/DELETE on a non-existent device_id).
The grep contract git grep -hoE 'event = "device_[a-z_]+"' src/
must return exactly 4 lines.
v1 limitations specific to Story 9-5:
- No granular per-metric routes. v1 ships PUT-replaces-device with the full metric list. Editing one metric on a 50-metric device requires sending the full list back.
device_id(DevEUI) is immutable. Renaming would orphan every storage row keyed ondevice_id(metric_values,metric_history,command_queue,gateway_status) — same Epic-A-scale change asapplication_idrename. Operator workaround: DELETE then POST.- No cascade-delete of
metric_values/metric_historyon DELETE. v1 leaves orphaned rows in storage. The pruning task (Story 2-5a) eventually removes them via the retention window (default[storage].history_retention_days = 7). - OPC UA address-space mutation deferred to Story 9-8. Inherited from Story 9-7 + Story 9-4. The dashboard reflects newly created devices immediately; SCADA clients connected via OPC UA must reconnect to see the new variables.
- No ChirpStack-side existence check on
device_id. v1 trusts the operator-supplied DevEUI; the next poll cycle surfaces a “device list lookup failed” log if the DevEUI is invalid.
References
- Story 7-1 spec:
_bmad-output/implementation-artifacts/7-1-credential-management-via-environment-variables.md - Story 7-2 spec:
_bmad-output/implementation-artifacts/7-2-opc-ua-security-endpoints-and-authentication.md - Story 7-3 spec:
_bmad-output/implementation-artifacts/7-3-connection-limiting.md - Story 9-1 spec:
_bmad-output/implementation-artifacts/9-1-axum-web-server-and-basic-authentication.md - Story 9-7 spec:
_bmad-output/implementation-artifacts/9-7-configuration-hot-reload.md - PRD requirements: FR42 (env-var injection), NFR7 (no secrets in logs),
NFR8 (no real credentials in default config), NFR24 (env override for
all secrets), FR19 (multi-policy OPC UA endpoints), FR20 (OPC UA user
auth), FR45 (PKI layout), NFR9 (private-key 0o600), NFR12 (failed-auth
audit trail), FR44 (connection limiting), FR50 (web Basic auth),
NFR11 (web auth before any change), FR41 (mobile-responsive web UI) in
_bmad-output/planning-artifacts/prd.md - Configuration reference:
docs/configuration.md - Deferred follow-ups:
_bmad-output/implementation-artifacts/deferred-work.md