Disaster recovery
Disaster recovery creates a replica of your primary server.
You can have only one replica at a time, and you can add disaster recovery to an installation with or without compilers.
- If your primary server fails, the replica takes over the handling of Puppet Server and PuppetDB traffic, allowing existing agents to remain operational and Puppet runs to continue without interruption. By configuring nodes to automatically fail over to the replica when the primary is unreachable, you can ensure that they still receive catalogs and enforce your desired state.
- If your primary server can’t be repaired, you can promote the replica to primary server. Promotion establishes the replica as the new, permanent primary server.
Disaster recovery architecture
The replica is not an exact copy of the primary server. Rather, the replica duplicates specific infrastructure components and services. By default Hiera data and other custom configurations are not replicated. However, if you store Hiera data in the control repository, as recommended, the data is replicated through Code Manager.
Replication can be read-write, meaning that data can be written to the service or component on either the primary server or the replica, and the data is synced to both nodes. Alternatively, replication can be read-only, where data is written only to the primary server and synced to the replica. Some components and services, like Puppet Server and the console service UI, are not replicated because they contain no native data.
Some components and services are activated immediately when you enable a replica; others aren't active until you promote a replica.
Component or service | Type of replication | Activated when replica is... |
---|---|---|
Puppet Server | none | enabled |
File sync client | read-only | enabled |
PuppetDB | read-write | enabled |
Certificate authority | read-only | promoted |
RBAC service | read-only | enabled |
Node classifier service | read-only | enabled |
Activity service | read-only | enabled |
Orchestration service | read-only | promoted |
Console service UI | none | promoted |
Agentless Catalog Executor (ACE) service | none | promoted |
Bolt service | none | promoted |
Host Action Collector service | read-only | promoted |
The following services performed by the primary server are unavailable on a replica until the replica is promoted:
-
Certificate authority: The replica cannot provision new agents.
-
Orchestration: Tasks, plans, and Puppet runs can not be initiated from the replica. This includes running operations via the Agentless Catalog Executor.
-
Console: The console is not available on the replica, and classification changes cannot be made from the replica.
In a standard installation, when a Puppet run fails over, agents communicate with the replica instead of the primary server. In a large or extra-large installation with compilers, agents communicate with load balancers or compilers, which communicate with the primary server or replica.
What happens during failovers
Failover occurs when the replica takes over services usually performed by the primary server.
Failover is automatic — you don’t have to take action to activate the replica. With disaster recovery enabled, Puppet runs are directed first to the primary server. If the primary server is either fully or partially unreachable, runs are directed to the replica.
In partial failovers, Puppet runs can use the server, node classifier, or PuppetDB on the replica if those services aren’t reachable on the primary server. For example, if the primary server’s node classifier fails, but its Puppet Server is still running, agent runs use the Puppet Server on the primary server but fail over to the replica’s node classifier.
What works during failovers:
- Scheduled Puppet runs
- Catalog compilation
- Viewing classification data using the node classifier API
- Reporting and queries based on PuppetDB data
What doesn’t work during failovers:
- Deploying new Puppet code
- Editing node classifier data
- Using the console
- Certificate functionality, including provisioning
new agents, revoking certificates, or running the
puppet certificate
command - Most CLI tools
- Running Puppet tasks or plans through the orchestrator.
System and software requirements for disaster recovery
Your Puppet infrastructure must meet specific requirements in order to configure disaster recovery.
Component | Requirement |
---|---|
Operating system | All supported PE primary server platforms. |
Software |
|
Replica |
|
Firewall | Your replica must comply with the same port requirements as your primary server to ensure that the replica can operate as the primary server during failover. For details, see the firewall configuration requirements for your installation type. |
Node names | You must use resolvable domain names when specifying node names for the primary server and replica. |
RBAC tokens | You must have an admin RBAC token when running some puppet
infrastructure commands, including
provision , enable , and
forget . You can generate a token using the
puppet-access command. However, an RBAC token isn't
required to promote a replica or to run the enable_ha_failover command. |
Classification changes in disaster recovery installations
When you provision and enable a replica, the system makes a number of classification changes in order to manage disaster recovery.
Two infrastructure node groups are added in installations with disaster recovery. The PE HA Master node group includes your primary server and inherits from the PE Master node group. The PE HA Replica node group includes your replica and inherits from the PE Infrastructure node group.
Additional disaster recovery configuration is managed with these parameters:
manage_puppet_conf
, server_list
, pcp_broker_list
, and primary_uris
), all of these are system parameters that should not be manually
modified. The PE Agent and PE Infrastructure
Agent parameters are automatically updated based on the values you specify
when you provision and enable a replica.classifier_client_certname
- Purpose
- Specifies the name on the certificate used by the classifier.
- Node group
- PE Master
- Class
-
puppet_enterprise::profile::master
- DR-only parameter
- No
- Example with enabled replica
["<PRIMARY_CERTNAME>","<REPLICA_CERTNAME>"]
- Notes
- Replica values are appended to the end of parameter when a replica is enabled.
classifier_host
- Purpose
- Specifies the certname of the node running the classifier service.
- Node group
- PE Master
- Class
-
puppet_enterprise::profile::master
- DR-only parameter
- No
- Example with enabled replica
["<PRIMARY_CERTNAME>","<REPLICA_CERTNAME>"]
- Notes
- Replica values are appended to the end of parameter when a replica is enabled.
classifier_port
- Purpose
- Specifies the port used for communicating with the classifier service. Always
4433
. - Node group
- PE Master
- Class
-
puppet_enterprise::profile::master
- DR-only parameter
- No
- Example with enabled replica
[4433,4433]
- Notes
- Replica values are appended to the end of parameter when a replica is enabled.
ha_enabled_replicas
- Purpose
- Tracks replica nodes that are failover ready.
- Node group
- PE Infrastructure
- Class
-
puppet_enterprise
- DR-only parameter
- Yes
- Example with enabled replica
["<REPLICA_CERTNAME>"]
- Notes
- Updated when you enable a replica.
manage_puppet_conf
- Purpose
- When
true
, specifies that theserver_list
setting is managed inpuppet.conf
. - Node group
- PE Agent, PE Infrastructure Agent
- Class
-
puppet_enterprise::profile::agent
- DR-only parameter
- No
- Example with enabled replica
true
pcp_broker_list
- Purpose
- Specifies the list of Puppet Communications Protocol brokers that Puppet Execution Protocol agents contact, in order.
- Node group
- PE Agent, PE Infrastructure Agent
- Class
-
puppet_enterprise::profile::agent
- DR-only parameter
- No
- Example with enabled replica
- PE Agent —
["<PRIMARY_CERTNAME>:8142,"<REPLICA_CERTNAME>:8142"]
or in a large installation,["<LOAD_BALANCER>:8142"]
PE Infrastructure Agent —
["<PRIMARY_CERTNAME>:8142","<REPLICA_CERTNAME>:8142"]
- Notes
-
- Infrastructure nodes must be configured to communicate directly with the primary in the PE Infrastructure Agent node group, or in a DR configuration, the primary and then the replica. In large installations with compilers, agents must be configured to communicate with the load balancers or compilers in the PE Agent node group.
- When a replica is enabled, the replica is appended to the end of the list in the PE Infrastructure Agent group, and when not using a load balancer, it's appended to the list in PE Agent.
- Some
puppet infrastructure
commands refer to this parameter asagent-server-urls
, but those commands nonetheless manage theserver_list
parameter.
Important: Setting agents to communicate directly with the replica in order to use the replica as a compiler is not supported.
primary_uris
- Purpose
- Specifies the list of Puppet Server nodes hosting task files for download that Puppet Execution Protocol agents contact, in order.
- Node group
- PE Agent, PE Infrastructure Agent
- Class
-
puppet_enterprise::profile::agent
- DR-only parameter
- No
- Example with enabled replica
- PE Agent —
["<PRIMARY_CERTNAME>:8140,"<REPLICA_CERTNAME>:8140"]
, or in a large installation,["<LOAD_BALANCER>:8140"]
PE Infrastructure Agent —
["<PRIMARY_CERTNAME>:8140","<REPLICA_CERTNAME>:8140"]
- Notes
-
- Infrastructure nodes must be configured to communicate directly with the primary in the PE Infrastructure Agent node group, or in a DR configuration, the primary and then the replica. In large installations with compilers, agents must be configured to communicate with the load balancers or compilers in the PE Agent node group.
- When a replica is enabled, the replica is appended to the end of the list in the PE Infrastructure Agent group, and when not using a load balancer, it's appended to the list in PE Agent.
- Some
puppet infrastructure
commands refer to this parameter asagent-server-urls
, but those commands nonetheless manage theserver_list
parameter.
Important: Setting agents to communicate directly with the replica in order to use the replica as a compiler is not supported.
provisioned_replicas
- Purpose
- Specifies the certname of replica to give access to the ca-data file sync repo.
- Node group
- PE HA Master
- Class
-
puppet_enterprise::profile::master
- DR-only parameter
- Yes
- Example with enabled replica
["<REPLICA_CERTNAME>"]
puppetdb_host
- Purpose
- Specifies the certname of the node running the PuppetDB service.
- Node group
- PE Master
- Class
-
puppet_enterprise::profile::master
- DR-only parameter
- No
- Example with enabled replica
["<PRIMARY_CERTNAME>","<REPLICA_CERTNAME>"]
- Notes
- Replica values are appended to the end of parameter when a replica is enabled.
puppetdb_port
- Purpose
- Specifies the port used for communicating with the PuppetDB service. Always
8081
. - Node group
- PE Master
- Class
-
puppet_enterprise::profile::master
- DR-only parameter
- No
- Example with enabled replica
[8081,8081]
- Notes
- Replica values are appended to the end of parameter when a replica is enabled.
replica_hostnames
- Purpose
- Specifies the certname of the replica to set up pglogical replication for non-PuppetDB databases.
- Node group
- PE HA Master
- Class
-
puppet_enterprise::profile::database
- DR-only parameter
- Yes
- Example with enabled replica
["<REPLICA_CERTNAME>"]
replicating
- Purpose
- Specifies whether databases other than PuppetDB replicate data.
- Node group
- PE Infrastructure
- Class
-
puppet_enterprise
- DR-only parameter
- Yes
- Example with enabled replica
true
- Notes
- Used when provisioning a new replica.
replication_mode
- Purpose
- Sets replication type and direction on primary servers and replicas.
- Node group
- PE Master (none), HA Master (source)
- Class
-
puppet_enterprise::profile::master
puppet_enterprise::profile::database
puppet_enterprise::profile::console
- DR-only parameter
- Yes (although
"none"
by default) - Example with enabled replica
- PE Master —
"none"
(Present only in master profile.)PE HA Master —
"source"
(Set automatically in the replica profile; no setting in the classifier in PE HA Replica.)
server_list
- Purpose
- Specifies the list of servers that agents contact, in order.
- Node group
- PE Agent, PE Infrastructure Agent
- Class
-
puppet_enterprise::profile::agent
- DR-only parameter
- No
- Example with enabled replica
- PE Agent —
["<PRIMARY_CERTNAME>:8140","<REPLICA_CERTNAME>:8140"]
or in a large installation,["<LOAD_BALANCER>:8140"]
PE Infrastructure Agent —
["<primary certname>:8140","<replica certname>:8140"]
- Notes
-
- Infrastructure nodes must be configured to communicate directly with the primary in the PE Infrastructure Agent node group, or in a DR configuration, the primary and then the replica. In large installations with compilers, agents must be configured to communicate with the load balancers or compilers in the PE Agent node group.
- When a replica is enabled, the replica is appended to the end of the list in the PE Infrastructure Agent group, and when not using a load balancer, it's appended to the list in PE Agent.
- Some
puppet infrastructure
commands refer to this parameter asagent-server-urls
, but those commands nonetheless manage theserver_list
parameter.
Important: Setting agents to communicate directly with the replica in order to use the replica as a compiler is not supported.
sync_allowlist
- Purpose
- Specifies a list of nodes that the primary PuppetDB syncs with.
- Node group
- PE HA Master
- Class
-
puppet_enterprise::profile::puppetdb
- DR-only parameter
- Yes
- Example with enabled replica
-
["<REPLICA_CERTNAME>"]
During upgrade, when primary is upgraded but replica hasn't been upgraded,
[]
to prevent syncing until upgrade is complete.
sync_peers
- Purpose
- Specifies a list of hashes that contain configuration data for syncing with a remote PuppetDB node. Includes the host, port, and sync interval.
- Node group
- PE HA Master
- Class
-
puppet_enterprise::profile::puppetdb
- DR-only parameter
- Yes
- Example with enabled replica
-
[{"host":"<REPLICA_CERTNAME>","port":8081,"sync_interval_minutes":<X>}]
During upgrade, when primary is upgraded but replica hasn't been upgraded,
[]
to prevent syncing until upgrade is complete. - Notes
- Updated when you enable a replica.
Load balancer timeout in disaster recovery installations
Disaster recovery configuration uses timeouts to determine when to fail over to the replica. If the load balancer timeout is shorter than the server and agent timeout, connections from agents might be terminated during failover.
To avoid timeouts, set the timeout option for load balancers to four minutes or longer. This duration allows compilers enough time for required queries to PuppetDB and the node classifier service. You can set the load balancer timeout option using parameters in the haproxy or f5 modules.