Create docs/specs/008-observability.md

d2dcf46113f9 jacobcole 2026-04-23 1 file
docs/specs/008-observability.md
new file mode 100644
index 0000000..8dd8834
@@ -0,0 +1,108 @@
+---
+visibility: public
+---
+
+# Spec 008 — Observability
+
+**Status:** Draft
+**Related:** [PRD FR-22..FR-25](../prd/001-picortex-v1.md#observability), Jacob's global [dev-patterns](https://github.com/tmad4000/jacob-computer-config-private)
+
+## Goal
+
+Everything observable without standing up a paid SaaS. Logs useful for the user ("what happened in chat X last Tuesday?") and for the developer ("why did the discriminator skip this message?").
+
+## Structured logs
+
+- **Logger:** `pino` with `pino-pretty` in dev, JSON in prod.
+- **Level:** default `info`; `debug` via `LOG_LEVEL=debug`.
+- **Location:** stdout (journald / docker-less systemd captures); no separate log files.
+- **Fields every log carries:**
+  - `time` (ISO)
+  - `level` (`info`/`warn`/`error`)
+  - `request_id` (see below)
+  - `chat_id` (if in chat context)
+  - `event_type` (e.g. `linq.inbound`, `tmux.turn.start`, `discriminator.decision`)
+  - `msg` (free text)
+
+## Request IDs
+
+- Fastify middleware generates `X-Request-ID` (uuid v7) for every inbound HTTP request.
+- Response headers echo it.
+- Logs in that request's async context include it.
+- Child-process spawns inherit it via env (`PICORTEX_REQUEST_ID`).
+- Linq inbound events tag the request ID into the `events` SQLite row.
+
+## `/api/frontend-log`
+
+Per Jacob's global rules. Client-side:
+
+```ts
+window.addEventListener('error', ev => fetch('/api/frontend-log', {
+  method: 'POST',
+  body: JSON.stringify({
+    level: 'error',
+    message: ev.message,
+    error: ev.error?.toString(),
+    stack: ev.error?.stack,
+    context: { url: location.href, ua: navigator.userAgent, build: __VERSION__ }
+  })
+}))
+```
+
+Server-side endpoint:
+
+- Accepts up to `FRONTEND_LOG_MAX_BYTES` (default 64 KB)
+- Rate-limited to 30/min per IP
+- Logs under `event_type: "frontend"` with the browser-supplied fields plus the request ID tying it to the current user session
+
+## Metrics
+
+No Prometheus in v1. Instead, lightweight counters in SQLite `metrics` table that `/health` exposes:
+
+```
+chats_total
+chats_active_7d
+turns_total
+turns_last_24h
+discriminator_skipped_24h
+errors_last_24h
+```
+
+`/health` returns:
+```json
+{
+  "status": "ok",
+  "version": "0.0.1",
+  "commit": "abcd123",
+  "uptime_seconds": 3412,
+  "db_ok": true,
+  "tmux_ok": true,
+  "metrics": { ... }
+}
+```
+
+## Network egress allowlist
+
+Claude Code chat users should only reach:
+- `api.anthropic.com`
+- `registry.npmjs.org` (for tooling, if used by Claude)
+- `pypi.org` (if Python is used)
+- `github.com`, `raw.githubusercontent.com`
+- Anything the user explicitly allowlists in `/etc/picortex/egress-allowlist.txt`
+
+Enforced via iptables `owner` match on the chat-user's UID. Rejected connections log an event — Jacob gets an alert if a new host is attempted (learning mode).
+
+## Sentry (optional, post-v0.1)
+
+If Jacob wants error aggregation: `@sentry/node` + `@sentry/browser`. Keep it off by default.
+
+## Testing
+
+- **Unit:** request-ID middleware; log shape sanity.
+- **Integration:** frontend-log roundtrip.
+- **Manual:** tail logs during E2E; verify every turn has a request ID.
+
+## Open questions
+
+- OQ1: Where are logs archived long-term? (Not in v1 — stdout + journald is fine.)
+- OQ2: Do we want Axiom or Loki integration? (Not for v1. Cortex uses Axiom.)
\ No newline at end of file