High Availability and Disaster Recovery
Architecture Overview
Section titled “Architecture Overview”jitsudod is a stateless binary — all persistent state lives in PostgreSQL. This has an important implication: you can run multiple jitsudod instances behind a load balancer, sharing the same database, without any additional coordination infrastructure.
Internal Load Balancer (private, not internet-facing) │ ┌────┴────┐ ▼ ▼jitsudod-1 jitsudod-2 (multiple instances, same image) │ │ └────┬────┘ ▼ PostgreSQL (single source of truth for all state)Expiry sweeper coordination — A PostgreSQL session-level advisory lock (pg_try_advisory_lock) ensures that only one jitsudod instance runs the expiry sweeper at a time. Because provider.Revoke() is called before the database state transition, without this lock multiple instances could issue duplicate revoke calls for the same grant. The winning instance acquires the lock, runs the sweep, then releases it; other instances skip that tick.
Policy sync — Each instance independently polls the database every 30 seconds and reloads its in-memory OPA query cache. This means policy changes applied via ApplyPolicy or DeletePolicy propagate to all replicas within one sync interval without any fan-out coordination.
Failure Modes
Section titled “Failure Modes”Control plane unavailable (all jitsudod instances down)
Section titled “Control plane unavailable (all jitsudod instances down)”Behavior: fail-closed for new requests.
If all jitsudod instances are unavailable:
- Engineers cannot submit new elevation requests
- Pending requests cannot be approved or denied
- The
jitsudoCLI will return connection errors
This is intentional. An unreachable access control system should not silently grant access.
Existing active grants are unaffected. Credentials already issued by the cloud provider (STS session tokens, Azure RBAC assignments, GCP IAM bindings, Kubernetes RBAC bindings) remain valid until their natural TTL expiry. The credentials are held by the cloud provider, not by jitsudod. A downed control plane does not immediately revoke active sessions.
Exception: the expiry sweeper stops. The background process that calls Revoke on expired grants will not run while jitsudod is down. Grants that expire during the outage will linger until the sweeper resumes. For providers with native TTL enforcement (GCP IAM conditions, Kubernetes TTL annotations), expiry is enforced by the provider regardless. For Azure RBAC, the sweeper is the enforcement mechanism — grants will overstay their TTL during a prolonged outage.
Database unavailable
Section titled “Database unavailable”If PostgreSQL is unavailable, jitsudod cannot process any requests (all operations require database access). jitsudod instances will log errors and return 503 responses. Recovery is automatic once the database is restored.
Single jitsudod instance failure
Section titled “Single jitsudod instance failure”Behind a load balancer, the load balancer routes around failed instances. Active requests in flight may return errors, but clients can retry. The CLI retries transient errors automatically.
Emergency Access When Control Plane Is Down
Section titled “Emergency Access When Control Plane Is Down”Break-glass (jitsudo request --break-glass) requires a running jitsudod. If the control plane is truly unavailable:
- Use the cloud provider’s IAM console to grant the minimum required permissions directly
- Document the access: timestamp, user, resource, justification, incident ticket
- After jitsudod is restored, revoke the manual IAM change immediately
- File a post-incident review noting the out-of-band access
Persistent out-of-band access events are audit gaps. Minimize them by monitoring jitsudod availability and having runbooks for rapid recovery.
Production Deployment Recommendations
Section titled “Production Deployment Recommendations”Run multiple instances
Section titled “Run multiple instances”Use the bundled values-ha.yaml overlay to enable HA mode in a single command:
helm upgrade --install jitsudo ./helm/jitsudo \ --namespace jitsudo \ --create-namespace \ -f helm/jitsudo/values-ha.yaml \ --set config.auth.oidcIssuer=https://your-idp.example.com \ --set config.auth.clientId=jitsudo-serverThe HA overlay enables:
- 2 replicas by default (minimum for HA; HPA scales up from there)
- HPA — scales on CPU (70%) and memory (80%), up to 10 replicas
- PodDisruptionBudget — ensures at least 1 pod remains available during node drains
- Pod anti-affinity — prefers scheduling pods on different nodes
- PostgreSQL read replica — one streaming replica for the bundled subchart (see note below)
For full production deployments, also supply an external managed database and disable the bundled subchart:
# In your environment-specific values filepostgresql: enabled: falseconfig: database: existingSecret: "jitsudo-db" # Secret with DATABASE_URL keyUse a managed PostgreSQL service
Section titled “Use a managed PostgreSQL service”The bundled PostgreSQL subchart (values-ha.yaml adds one read replica via streaming replication) is suitable for testing HA configuration. It does not provide automatic failover — if the primary crashes, manual intervention is required.
For production, use a managed service with built-in automatic failover:
| Cloud | Managed PostgreSQL |
|---|---|
| AWS | RDS Multi-AZ (automatic failover ~30–60s) |
| Azure | Azure Database for PostgreSQL - Flexible Server (HA mode) |
| GCP | Cloud SQL for PostgreSQL (HA with failover replica) |
| On-prem | Patroni + etcd, or Crunchy Data PGO |
All managed services above provide automatic failover, point-in-time recovery (PITR), and automated backups.
Configure connection pooling
Section titled “Configure connection pooling”PostgreSQL has a hard limit on concurrent connections. Use PgBouncer (or pgpool-II) between jitsudod and PostgreSQL for connection efficiency, especially during rolling restarts:
# In jitsudod config — point at PgBouncer, not PostgreSQL directlydatabase: url: "postgres://jitsudo_app:${DB_PASSWORD}@pgbouncer:5432/jitsudo?sslmode=require"Health checks
Section titled “Health checks”jitsudod exposes a health endpoint:
GET /healthz → 200 OK if the server is healthyGET /readyz → 200 OK if the server is ready to serve trafficConfigure your load balancer to use /readyz for routing decisions. The ready check includes a database connectivity check.
Backup and Restore
Section titled “Backup and Restore”Backup schedule
Section titled “Backup schedule”Take daily automated backups of the PostgreSQL database. Managed services (RDS, Cloud SQL, Azure Database) provide this by default.
For self-managed PostgreSQL:
# Daily pg_dump to S3 (example)pg_dump -U jitsudo_app jitsudo \ | gzip \ | aws s3 cp - s3://your-backup-bucket/jitsudo/$(date +%Y-%m-%d).sql.gzRestore procedure
Section titled “Restore procedure”# 1. Stop jitsudod instances to prevent writes during restorekubectl scale deployment jitsudod --replicas=0
# 2. Restore from backupgunzip -c backup.sql.gz | psql -U postgres jitsudo
# 3. Verify audit log hash chain integrityjitsudo audit verify
# 4. Restart jitsudodkubectl scale deployment jitsudod --replicas=2Audit log verification after restore
Section titled “Audit log verification after restore”After any restore, verify the audit log hash chain:
jitsudo audit verifyIf the chain breaks, entries were modified or inserted out-of-band between the backup point and the restore point. Investigate before allowing the restored instance to serve traffic. See Audit Log for the chain format and verification script.