A production Mirth Connect disaster recovery architecture requires four components: database replication with cross-region copies, automated nightly channel configuration exports to S3 or equivalent versioned storage, infrastructure-as-code so the entire deployment can be rebuilt from version-controlled templates, and quarterly tested restore procedures. Target RTO under 1 hour and RPO under 5 minutes for most healthcare integration workloads. A backup you have never restored is not a backup.
Quick answer
A production Mirth Connect disaster recovery architecture requires four components: (1) database replication with cross-region copies, (2) automated nightly channel configuration exports to S3 or equivalent versioned storage, (3) infrastructure-as-code so the entire deployment can be rebuilt from version-controlled templates, and (4) quarterly tested restore procedures. Target RTO under 1 hour and RPO under 5 minutes for most healthcare integration workloads. A backup you have never restored is not a backup.
This guide walks through the HA-vs-DR distinction, the RTO/RPO targets that drive the architecture, the four backup layers every Mirth deployment has, three reference architectures with their cost-vs-recovery-time tradeoffs, the runbook template, the testing cadence, and the ten mistakes we see in nearly every Mirth DR plan we audit. Written by the engineers who deliver Mirth Connect support for US healthcare organizations.
HA vs DR — they solve different problems
Before architecting anything, get the terminology right. These two get conflated constantly.
High Availability (HA) protects against component-level failures within a region or data center. A single EC2 instance dies, but the load balancer routes traffic to a healthy instance. A database primary fails, but RDS automatically promotes the standby. HA failures are common and HA recovery is automatic.
Disaster Recovery (DR) protects against site-level or region-level failures. An entire AWS region becomes unavailable. A data center loses power for an extended period. A ransomware attack encrypts production infrastructure. DR failures are rare but catastrophic, and DR recovery involves human decision-making and procedural execution.
Production Mirth Connect deployments need both. HA gives you near-zero downtime for the failure modes you'll encounter regularly. DR gives you survivability for the rare events that would otherwise be existential.
| Concern | HA solves it | DR solves it |
|---|---|---|
| Single EC2 instance failure | ✓ | — |
| Single AZ outage | ✓ | — |
| Database primary failure | ✓ | — |
| Region-wide AWS outage | — | ✓ |
| Ransomware encrypts production | — | ✓ |
| Data center fire | — | ✓ |
| Accidental destruction of production | — | ✓ |
| Compromise of all in-region backups | — | ✓ (with cross-region backup) |
Setting RTO and RPO targets
Two numbers drive every DR architecture decision.
RTO (Recovery Time Objective) — how long can the system be down before business impact becomes unacceptable. Measured in minutes, hours, or days.
RPO (Recovery Point Objective) — how much data loss is acceptable. Measured in minutes, hours, or days of data.
These are business decisions, not technical decisions. Set them before designing infrastructure.
Typical RTO/RPO targets for healthcare integration workloads:
| Workload type | RTO | RPO |
|---|---|---|
| Critical clinical (ADT, real-time orders) | Under 15 min | Near-zero |
| Standard clinical integration | Under 1 hour | Under 5 min |
| Lab results, scheduling | Under 4 hours | Under 30 min |
| Analytics / reporting feeds | Under 24 hours | Under 4 hours |
| Archival / historical | Under 7 days | Under 24 hours |
The tighter the targets, the more expensive the architecture. A 15-minute RTO with near-zero RPO requires active-passive multi-region with synchronous replication — meaningfully more expensive than a 4-hour RTO with 30-minute RPO. Match the architecture to the actual business requirement, not aspiration.
The four backup layers in a Mirth Connect deployment
Mirth Connect deployments have four distinct things that need backup, each with different mechanisms.
Layer 1 — The Mirth Connect database
The database holds channel definitions, channel statistics, message history, audit logs, and user accounts. Loss of this database means loss of operational history and visibility, even if channels can be reconstructed from exports.
Backup approach:
- RDS automated backups with 30-day retention (HIPAA minimum)
- Multi-AZ for synchronous replication within a region
- Cross-region read replica for multi-region DR (one-way async replication)
- Periodic manual snapshots before major changes
- Snapshot copying to a separate AWS account for ransomware protection
Restore considerations:
- Point-in-time recovery within the retention window
- Cross-region snapshot copies enable region-failure recovery
- Test restore quarterly to a non-production environment
For the broader database configuration choices that underpin this layer, see Mirth Connect database configuration.
Layer 2 — Channel configurations (the infrastructure-as-code layer)
Channel configurations are your most important asset for fast recovery. If you have current channel exports, you can rebuild a Mirth Connect deployment in hours. Without them, rebuilding from scratch is a multi-week project.
Backup approach:
- Scheduled nightly export of all channels to S3
- S3 bucket with versioning enabled (point-in-time recovery)
- S3 bucket in a separate AWS account or different region
- Channel exports committed to a Git repository for change history
Implementation: the Mirth Connect API provides export endpoints. A scheduled Lambda function or cron job calls these endpoints daily, writes the resulting XML to S3 with timestamp-based key naming, and triggers a backup-verified alarm if the export count drops.
Layer 3 — Message store
The message store contains the actual HL7/FHIR messages processed by Mirth. For HIPAA-covered organizations, retention requirements typically range from 6-7 years. The message store can grow very large.
Backup approach:
- Database backups cover recent messages (within the retention window)
- For long-term retention, archive completed messages to S3 with lifecycle policies
- S3 lifecycle: standard for 90 days → Glacier Instant Retrieval → Glacier Deep Archive for 7+ year retention
- Use SSE-KMS with customer-managed keys for PHI
Recovery consideration: message store recovery is typically the slowest part of a full DR scenario. If RTO requires fast restoration, prioritize recent message data and accept that older archives may take hours-to-days longer to make queryable.
Layer 4 — Attachments and large objects
If your channels handle large attachments (DICOM images, PDF documents, large HL7 messages), these are typically stored in S3 (recommended) or in the Mirth database (not recommended at scale).
Backup approach:
- S3 with cross-region replication for HIPAA-eligible buckets
- Versioning enabled to protect against accidental deletion
- Object Lock for compliance-critical attachments that must not be deleted
Holding large attachments in the Mirth database is also one of the most common causes of the heap-space error covered in our Mirth Connect Java heap space error post — another reason to keep attachments in S3.
Three common DR architectures
The right architecture depends on your RTO/RPO targets and budget.
Architecture A — Single-region with backups (basic DR)
For workloads tolerating RTO of 4-24 hours and RPO of 4 hours.
Region us-east-1:
- Mirth Connect (Multi-AZ EC2 or ECS)
- RDS Multi-AZ
- S3 with versioning (channels, attachments)
- Daily database snapshots
Cross-region:
- Snapshots copied to us-west-2 (manual or automated)
- S3 cross-region replicationDR scenario: a region fails. Engineering team builds new infrastructure in us-west-2 from infrastructure-as-code, restores database from cross-region snapshot, imports channels from S3. Total recovery time: 4-24 hours.
Cost: Lowest of the three patterns. No standby infrastructure running in the second region.
Architecture B — Active-passive multi-region (standard DR)
For workloads requiring RTO of 1 hour and RPO of 5 minutes or less.
Region us-east-1 (primary):
- Mirth Connect (Multi-AZ, actively processing)
- RDS Multi-AZ
- S3 with versioning
Region us-west-2 (passive):
- Mirth Connect infrastructure deployed but not processing (warm standby)
- RDS cross-region read replica
- S3 cross-region replication
- Route 53 health checks ready to fail over DNSDR scenario: primary region fails. Health checks detect failure. DNS fails over to us-west-2. Read replica promoted to primary. Mirth instances start processing. Total recovery time: under 1 hour.
Cost: Standby infrastructure costs running 24/7, but at typically 30-50% of primary cost.
Architecture C — Active-active multi-region (advanced DR)
For workloads requiring RTO of minutes and tolerating zero downtime.
Region us-east-1 (active):
- Mirth Connect actively processing channels for east customers
- RDS primary with cross-region replication
Region us-west-2 (active):
- Mirth Connect actively processing channels for west customers
- RDS primary with cross-region replication
Both regions:
- GeoDNS routes clients to nearest region
- Channel state and message stores replicated bidirectionally (complex)DR scenario: one region fails. Traffic automatically routes to the surviving region. Total recovery time: seconds-to-minutes.
Cost: Highest. Both regions run full production capacity. Operational complexity is significantly higher due to bidirectional replication.
Warning:Active-active is hard to do well. Most healthcare organizations achieve their actual business RTO/RPO targets more reliably with Architecture B. Don't pick active-active for prestige reasons.
For the underlying AWS deployment patterns these architectures sit on top of, see Mirth Connect on AWS Deployment Guide.
Infrastructure-as-code is non-negotiable
A DR plan that relies on human memory or wiki documentation will fail at exactly the wrong moment. Express your entire Mirth deployment in code.
What to put in code:
- VPC, subnets, security groups, network ACLs
- EC2 instances or ECS task definitions
- RDS instances and configuration
- S3 buckets and lifecycle policies
- IAM roles and policies
- CloudWatch alarms and log groups
- Route 53 records
- Load balancers and target groups
- Secrets Manager entries (encrypted)
What NOT to put in code:
- Sensitive values themselves (use Secrets Manager references)
- Manually generated certificates (use ACM)
- Customer-specific configuration that changes frequently (use parameter store)
Tools: CloudFormation, Terraform, AWS CDK, Pulumi — all work. Pick one and standardize.
Storage: the infrastructure-as-code repository must be backed up like any other critical asset. A private Git repository in your version control system, with branch protection and access logging.
The DR runbook — what it actually contains
A runbook is a step-by-step procedure for executing recovery. It exists because the people executing DR are often not the people who built the system, and they're operating under stress at 3am.
Minimum runbook contents:
- DR triggers and decision authority. Who declares a DR event. What conditions justify declaration. Who has authority to begin failover.
- Communication plan. Who gets notified, in what order, by what mechanism. Status update cadence during recovery.
- Pre-failover verification. Confirm primary is actually down (not a false alarm). Confirm DR target is healthy.
- Infrastructure deployment commands. Exact commands to deploy infrastructure-as-code to the DR region.
- Database restore procedure. Exact commands or console steps to restore from snapshot or promote replica.
- Channel import procedure. Commands to import latest channel exports from S3 into the new Mirth instance.
- Verification procedure. Specific test messages or queries to verify channels are processing correctly.
- DNS failover. Steps to route traffic to the recovered environment.
- Stakeholder communication. Template messages to send when recovery is complete.
- Post-incident review schedule. When and how to debrief.
Critical: the runbook must be tested. Reading it during a real DR event is too late.
Testing the plan
A backup you have never restored is not a backup. A DR plan you have never executed is not a plan.
Recommended testing cadence:
| Test type | Frequency | Scope |
|---|---|---|
| Database restore | Monthly | Restore latest backup to dev environment, verify queryable |
| Channel restore | Monthly | Import latest channel export to dev, verify channels start |
| Full DR drill | Quarterly | Build entire stack from code, restore data, verify message flow |
| Region failure simulation | Annually | Full failover to DR region with full clinical workflow validation |
Most organizations discover problems with their DR plan during the first quarterly drill. That's the point — discover problems in drills, not in incidents.
Common issues found during DR drills:
- IAM role permissions missing in the DR region
- Secrets Manager secrets not replicated
- DNS TTLs too high for fast failover
- Channel exports older than expected
- Infrastructure-as-code references hardcoded to the primary region
- Cross-region replication lag higher than expected RPO
- Manual snapshot copy schedule broken
- Test message routes don't exist in the DR environment
Each found issue is a win. Each issue not found is a future incident.
Common mistakes in Mirth Connect DR planning
Ten failure modes we see in nearly every Mirth Connect DR plan we audit:
- Mistake 1 — Confusing HA with DR. Multi-AZ RDS is HA, not DR. It will not save you from a region-wide AWS event or a ransomware attack. Both are needed.
- Mistake 2 — Backups in the same account as production. If an attacker compromises your AWS account, they can delete backups in that account too. Cross-account backup replication is the defense.
- Mistake 3 — No infrastructure-as-code. Rebuilding from scratch in a real DR scenario takes days. With IaC, it takes hours. The investment in IaC pays for itself the first time you need it.
- Mistake 4 — Never testing restore. Backups can fail silently. Snapshot processes can break. Channel export schedules can stop. The only way to know your backups work is to restore them periodically.
- Mistake 5 — RTO/RPO targets without business validation.Engineering decides “we'll target 1 hour RTO” without confirming whether the business can tolerate 1 hour. Either commit to less, or accept honestly that the business can tolerate more. Misaligned targets lead to over-investment or under-investment.
- Mistake 6 — Channel exports stored only on the production Mirth instance. When the production instance is gone, so are the exports. Channel exports must live outside the Mirth deployment they describe.
- Mistake 7 — Forgetting about external dependencies.Mirth doesn't operate alone. EHRs, downstream consumers, identity providers, and partner systems all need consideration in a DR scenario. A Mirth instance that's recovered but can't reach its partners isn't actually recovered.
- Mistake 8 — Underestimating recovery time for the message store. Restoring 6 years of message history takes meaningfully longer than restoring channel configurations. Architectures should make the recent message store available first and the historical archive available second.
- Mistake 9 — Treating DR as a one-time project. A DR plan from 2023 may not match production in 2026. The DR architecture needs the same maintenance discipline as production itself.
- Mistake 10 — No documented decision authority for declaring DR.A real DR scenario is high-pressure. Confusion over “are we actually doing this” wastes hours that the RTO budget doesn't have.
What good DR looks like — a checklist
A production-ready Mirth Connect deployment with proper DR has all of the following:
- RTO and RPO targets defined and documented with business sign-off
- HA architecture covers single-instance and single-AZ failures
- Database has Multi-AZ and cross-region replication
- Database backups have 30-day minimum retention and tested point-in-time recovery
- Channel configurations exported nightly to versioned S3 storage
- Channel exports stored in a separate AWS account or region
- Entire infrastructure expressed in CloudFormation, Terraform, or equivalent
- Infrastructure-as-code repository backed up with access controls and change history
- Secrets Manager entries replicated to DR region
- Route 53 health checks and failover routing configured
- DR runbook documented with step-by-step procedures
- Quarterly DR drills scheduled and tracked
- Annual region-failure simulation exercise completed
- Post-drill issues tracked to resolution
- DR plan reviewed and updated whenever production architecture changes
- Stakeholders trained on DR procedures
- DR documentation accessible without production access (e.g., not stored only in Confluence on the production network)
If any items are unchecked, those are your priorities.
When to get help
Mirth Connect DR architecture sits at the intersection of cloud infrastructure, database engineering, healthcare compliance, and operational discipline. Getting it right requires experience across all four. Most teams that build their first DR plan discover gaps during their first drill, regardless of how carefully they planned.
Our free Mirth Connect health checkexplicitly covers DR posture as one of the audit points. If you're not sure whether your current DR plan would actually survive a regional event, the audit will tell you.
For related operational context, see our Mirth Connect on AWS Deployment Guide, our Mirth Connect Performance Tuning post, and our Mirth Connect Security and HIPAA Checklist.
To estimate the cost of building a production-ready DR architecture, run our pricing calculator — select your full scope including the DR region.
Prefer email? info@tactionsoft.com — we reply within 4 business hours.
Related Reading
- Mirth Connect on AWS Deployment Guide →
- Mirth Connect Performance Tuning →
- Mirth Connect Security & HIPAA Checklist →
- Mirth Connect Database Configuration →
- How to Fix Mirth Connect Java Heap Space Error →
- Free Mirth Connect Health Check →
- Common Mirth Connect Issues & Fixes →
- Mirth Connect: The Complete Guide →
- Mirth Connect Pricing Calculator →