High Availability and Disaster Recovery

Architecture Overview

jitsudod is a stateless binary — all persistent state lives in PostgreSQL. This has an important implication: you can run multiple jitsudod instances behind a load balancer, sharing the same database, without any additional coordination infrastructure.

Internal Load Balancer (private, not internet-facing)
         │
    ┌────┴────┐
    ▼         ▼
jitsudod-1  jitsudod-2   (multiple instances, same image)
    │         │
    └────┬────┘
         ▼
    PostgreSQL
    (single source of truth for all state)

Expiry sweeper coordination — A PostgreSQL session-level advisory lock (pg_try_advisory_lock) ensures that only one jitsudod instance runs the expiry sweeper at a time. Because provider.Revoke() is called before the database state transition, without this lock multiple instances could issue duplicate revoke calls for the same grant. The winning instance acquires the lock, runs the sweep, then releases it; other instances skip that tick.

Policy sync — Each instance independently polls the database every 30 seconds and reloads its in-memory OPA query cache. This means policy changes applied via ApplyPolicy or DeletePolicy propagate to all replicas within one sync interval without any fan-out coordination.

Failure Modes

Control plane unavailable (all jitsudod instances down)

Behavior: fail-closed for new requests.

If all jitsudod instances are unavailable:

Engineers cannot submit new elevation requests
Pending requests cannot be approved or denied
The jitsudo CLI will return connection errors

This is intentional. An unreachable access control system should not silently grant access.

Existing active grants are unaffected. Credentials already issued by the cloud provider (STS session tokens, Azure RBAC assignments, GCP IAM bindings, Kubernetes RBAC bindings) remain valid until their natural TTL expiry. The credentials are held by the cloud provider, not by jitsudod. A downed control plane does not immediately revoke active sessions.

Exception: the expiry sweeper stops. The background process that calls Revoke on expired grants will not run while jitsudod is down. Grants that expire during the outage will linger until the sweeper resumes. For providers with native TTL enforcement (GCP IAM conditions, Kubernetes TTL annotations), expiry is enforced by the provider regardless. For Azure RBAC, the sweeper is the enforcement mechanism — grants will overstay their TTL during a prolonged outage.

Unlike AWS (STS session tokens self-expire) and GCP (IAM conditions enforce TTL), Azure RBAC has no native time-bound assignment support. jitsudod’s expiry sweeper is the sole enforcement mechanism. During a prolonged control plane outage, Azure RBAC grants will remain active indefinitely until the sweeper resumes.

Mitigations for Azure operators:

Shorter TTLs: Request shorter elevation windows for Azure grants (e.g., 30–60 minutes instead of 4 hours). This bounds the maximum overstay duration if an outage occurs.
Azure Monitor alert: Create an alert that fires on role assignments older than your maximum permitted TTL — this detects stuck grants even during a jitsudod outage. Filter on assignments made by jitsudod’s service principal.
Azure PIM as fallback: Azure Privileged Identity Management (PIM) supports time-bound role assignments natively. Configure PIM as a secondary enforcement layer for your highest-sensitivity scopes.
Monitor jitsudod availability: A jitsudod outage is a security event for Azure providers. Add jitsudod /healthz to your uptime monitoring with a short alert threshold.

See Azure Provider — Security Considerations for provider-level guidance.

Database unavailable

If PostgreSQL is unavailable, jitsudod cannot process any requests (all operations require database access). jitsudod instances will log errors and return 503 responses. Recovery is automatic once the database is restored.

Single jitsudod instance failure

Behind a load balancer, the load balancer routes around failed instances. Active requests in flight may return errors, but clients can retry. The CLI retries transient errors automatically.

Emergency Access When Control Plane Is Down

Break-glass (jitsudo request --break-glass) requires a running jitsudod. If the control plane is truly unavailable:

Use the cloud provider’s IAM console to grant the minimum required permissions directly
Document the access: timestamp, user, resource, justification, incident ticket
After jitsudod is restored, revoke the manual IAM change immediately
File a post-incident review noting the out-of-band access

Persistent out-of-band access events are audit gaps. Minimize them by monitoring jitsudod availability and having runbooks for rapid recovery.

Production Deployment Recommendations

Run multiple instances

Use the bundled values-ha.yaml overlay to enable HA mode in a single command:

helm upgrade --install jitsudo ./helm/jitsudo \
  --namespace jitsudo \
  --create-namespace \
  -f helm/jitsudo/values-ha.yaml \
  --set config.auth.oidcIssuer=https://your-idp.example.com \
  --set config.auth.clientId=jitsudo-server

The HA overlay enables:

2 replicas by default (minimum for HA; HPA scales up from there)
HPA — scales on CPU (70%) and memory (80%), up to 10 replicas
PodDisruptionBudget — ensures at least 1 pod remains available during node drains
Pod anti-affinity — prefers scheduling pods on different nodes
PostgreSQL read replica — one streaming replica for the bundled subchart (see note below)

For full production deployments, also supply an external managed database and disable the bundled subchart:

# In your environment-specific values file
postgresql:
  enabled: false
config:
  database:
    existingSecret: "jitsudo-db"   # Secret with DATABASE_URL key

Use a managed PostgreSQL service

The bundled PostgreSQL subchart (values-ha.yaml adds one read replica via streaming replication) is suitable for testing HA configuration. It does not provide automatic failover — if the primary crashes, manual intervention is required.

For production, use a managed service with built-in automatic failover:

Cloud	Managed PostgreSQL
AWS	RDS Multi-AZ (automatic failover ~30–60s)
Azure	Azure Database for PostgreSQL - Flexible Server (HA mode)
GCP	Cloud SQL for PostgreSQL (HA with failover replica)
On-prem	Patroni + etcd, or Crunchy Data PGO

All managed services above provide automatic failover, point-in-time recovery (PITR), and automated backups.

Configure connection pooling

PostgreSQL has a hard limit on concurrent connections. Use PgBouncer (or pgpool-II) between jitsudod and PostgreSQL for connection efficiency, especially during rolling restarts:

# In jitsudod config — point at PgBouncer, not PostgreSQL directly
database:
  url: "postgres://jitsudo_app:${DB_PASSWORD}@pgbouncer:5432/jitsudo?sslmode=require"

Health checks

jitsudod exposes a health endpoint:

GET /healthz       → 200 OK if the server is healthy
GET /readyz        → 200 OK if the server is ready to serve traffic

Configure your load balancer to use /readyz for routing decisions. The ready check includes a database connectivity check.

Backup and Restore

Backup schedule

Take daily automated backups of the PostgreSQL database. Managed services (RDS, Cloud SQL, Azure Database) provide this by default.

For self-managed PostgreSQL:

# Daily pg_dump to S3 (example)
pg_dump -U jitsudo_app jitsudo \
  | gzip \
  | aws s3 cp - s3://your-backup-bucket/jitsudo/$(date +%Y-%m-%d).sql.gz

Restore procedure

# 1. Stop jitsudod instances to prevent writes during restore
kubectl scale deployment jitsudod --replicas=0

# 2. Restore from backup
gunzip -c backup.sql.gz | psql -U postgres jitsudo

# 3. Verify audit log hash chain integrity
jitsudo audit verify

# 4. Restart jitsudod
kubectl scale deployment jitsudod --replicas=2

Audit log verification after restore

After any restore, verify the audit log hash chain:

jitsudo audit verify

If the chain breaks, entries were modified or inserted out-of-band between the backup point and the restore point. Investigate before allowing the restored instance to serve traffic. See Audit Log for the chain format and verification script.