Disaster recovery

Disaster recovery creates a replica of your primary server.

You can have only one replica at a time, and you can add disaster recovery to an installation with or without compilers.

There are two main advantages to enabling disaster recovery:
  • If your primary server fails, the replica takes over the handling of Puppet Server and PuppetDB traffic, allowing existing agents to remain operational and Puppet runs to continue without interruption. By configuring nodes to automatically fail over to the replica when the primary is unreachable, you can ensure that they still receive catalogs and enforce your desired state.
  • If your primary server can’t be repaired, you can promote the replica to primary server. Promotion establishes the replica as the new, permanent primary server.

Disaster recovery architecture

The replica is not an exact copy of the primary server. Rather, the replica duplicates specific infrastructure components and services. By default Hiera data and other custom configurations are not replicated. However, if you store Hiera data in the control repository, as recommended, the data is replicated through Code Manager.

Replication can be read-write, meaning that data can be written to the service or component on either the primary server or the replica, and the data is synced to both nodes. Alternatively, replication can be read-only, where data is written only to the primary server and synced to the replica. Some components and services, like Puppet Server and the console service UI, are not replicated because they contain no native data.

Some components and services are activated immediately when you enable a replica; others aren't active until you promote a replica.

Component or service Type of replication Activated when replica is...
Puppet Server none enabled
File sync client read-only enabled
PuppetDB read-write enabled
Certificate authority read-only promoted
RBAC service read-only enabled
Node classifier service read-only enabled
Activity service read-only enabled
Orchestration service read-only promoted
Console service UI none promoted
Agentless Catalog Executor (ACE) service none promoted
Bolt service none promoted
Host Action Collector service read-only promoted

The following services performed by the primary server are unavailable on a replica until the replica is promoted:

  • Certificate authority: The replica cannot provision new agents.

  • Orchestration: Tasks, plans, and Puppet runs can not be initiated from the replica. This includes running operations via the Agentless Catalog Executor.

  • Console: The console is not available on the replica, and classification changes cannot be made from the replica.

In a standard installation, when a Puppet run fails over, agents communicate with the replica instead of the primary server. In a large or extra-large installation with compilers, agents communicate with load balancers or compilers, which communicate with the primary server or replica.

What happens during failovers

Failover occurs when the replica takes over services usually performed by the primary server.

Failover is automatic — you don’t have to take action to activate the replica. With disaster recovery enabled, Puppet runs are directed first to the primary server. If the primary server is either fully or partially unreachable, runs are directed to the replica.

In partial failovers, Puppet runs can use the server, node classifier, or PuppetDB on the replica if those services aren’t reachable on the primary server. For example, if the primary server’s node classifier fails, but its Puppet Server is still running, agent runs use the Puppet Server on the primary server but fail over to the replica’s node classifier.

What works during failovers:

  • Scheduled Puppet runs
  • Catalog compilation
  • Viewing classification data using the node classifier API
  • Reporting and queries based on PuppetDB data

What doesn’t work during failovers:

  • Deploying new Puppet code
  • Editing node classifier data
  • Using the console
  • Certificate functionality, including provisioning new agents, revoking certificates, or running the puppet certificate command
  • Most CLI tools
  • Running Puppet tasks or plans through the orchestrator.

System and software requirements for disaster recovery

Your Puppet infrastructure must meet specific requirements in order to configure disaster recovery.

Component Requirement
Operating system All supported PE primary server platforms.
Software
  • You must use Code Manager so that code is deployed to both the primary server and the replica after you enable a replica. Code Manager also replicates the certificate authority state, as well as PE configuration files. Even if you have an alternate method for syncing your code across nodes, Code Manager must still be enabled.
  • You must use the default PE node classifier so that disaster recovery classification can be applied to nodes.

  • Orchestrator must be enabled so that it can perform PE maintenance and upgrade actions.
Replica
  • Must be an agent node that doesn’t have a specific function already. You can decommission a node, uninstall all puppet packages, and re-commission the node to be a replica. However, a compiler cannot perform two functions, for example, as a compiler and a replica.
  • Must have the same hardware specifications and capabilities as your primary server.
  • Must use the same operating system type and version as your primary server.
  • Must have the same agent version as your primary server.
Firewall Your replica must comply with the same port requirements as your primary server to ensure that the replica can operate as the primary server during failover. For details, see the firewall configuration requirements for your installation type.
Node names You must use resolvable domain names when specifying node names for the primary server and replica.
RBAC tokens You must have an admin RBAC token when running some puppet infrastructure commands, including provision, enable, and forget. You can generate a token using the puppet-access command. However, an RBAC token isn't required to promote a replica or to run the enable_ha_failover command.

Classification changes in disaster recovery installations

When you provision and enable a replica, the system makes a number of classification changes in order to manage disaster recovery.

Two infrastructure node groups are added in installations with disaster recovery. The PE HA Master node group includes your primary server and inherits from the PE Master node group. The PE HA Replica node group includes your replica and inherits from the PE Infrastructure node group.

Additional disaster recovery configuration is managed with these parameters:

Note: Apart from the parameters in the PE Agent and PE Infrastructure Agent node groups (manage_puppet_conf, server_list, pcp_broker_list, and primary_uris), all of these are system parameters that should not be manually modified. The PE Agent and PE Infrastructure Agent parameters are automatically updated based on the values you specify when you provision and enable a replica.

classifier_client_certname
Purpose
Specifies the name on the certificate used by the classifier.
Node group
PE Master
Class

puppet_enterprise::profile::master

DR-only parameter
No
Example with enabled replica
["<PRIMARY_CERTNAME>","<REPLICA_CERTNAME>"]
Notes
Replica values are appended to the end of parameter when a replica is enabled.

classifier_host
Purpose
Specifies the certname of the node running the classifier service.
Node group
PE Master
Class

puppet_enterprise::profile::master

DR-only parameter
No
Example with enabled replica
["<PRIMARY_CERTNAME>","<REPLICA_CERTNAME>"]
Notes
Replica values are appended to the end of parameter when a replica is enabled.

classifier_port
Purpose
Specifies the port used for communicating with the classifier service. Always 4433.
Node group
PE Master
Class

puppet_enterprise::profile::master

DR-only parameter
No
Example with enabled replica
[4433,4433]
Notes
Replica values are appended to the end of parameter when a replica is enabled.

ha_enabled_replicas
Purpose
Tracks replica nodes that are failover ready.
Node group
PE Infrastructure
Class

puppet_enterprise

DR-only parameter
Yes
Example with enabled replica
["<REPLICA_CERTNAME>"]
Notes
Updated when you enable a replica.

manage_puppet_conf
Purpose
When true, specifies that the server_list setting is managed in puppet.conf.
Node group
PE Agent, PE Infrastructure Agent
Class

puppet_enterprise::profile::agent

DR-only parameter
No
Example with enabled replica
true

pcp_broker_list
Purpose
Specifies the list of Puppet Communications Protocol brokers that Puppet Execution Protocol agents contact, in order.
Node group
PE Agent, PE Infrastructure Agent
Class

puppet_enterprise::profile::agent

DR-only parameter
No
Example with enabled replica
PE Agent — ["<PRIMARY_CERTNAME>:8142,"<REPLICA_CERTNAME>:8142"] or in a large installation, ["<LOAD_BALANCER>:8142"]

PE Infrastructure Agent — ["<PRIMARY_CERTNAME>:8142","<REPLICA_CERTNAME>:8142"]

Notes
  • Infrastructure nodes must be configured to communicate directly with the primary in the PE Infrastructure Agent node group, or in a DR configuration, the primary and then the replica. In large installations with compilers, agents must be configured to communicate with the load balancers or compilers in the PE Agent node group.
  • When a replica is enabled, the replica is appended to the end of the list in the PE Infrastructure Agent group, and when not using a load balancer, it's appended to the list in PE Agent.
  • Some puppet infrastructure commands refer to this parameter as agent-server-urls, but those commands nonetheless manage the server_list parameter.
Important: Setting agents to communicate directly with the replica in order to use the replica as a compiler is not supported.

primary_uris
Purpose
Specifies the list of Puppet Server nodes hosting task files for download that Puppet Execution Protocol agents contact, in order.
Node group
PE Agent, PE Infrastructure Agent
Class

puppet_enterprise::profile::agent

DR-only parameter
No
Example with enabled replica
PE Agent — ["<PRIMARY_CERTNAME>:8140,"<REPLICA_CERTNAME>:8140"], or in a large installation, ["<LOAD_BALANCER>:8140"]

PE Infrastructure Agent — ["<PRIMARY_CERTNAME>:8140","<REPLICA_CERTNAME>:8140"]

Notes
  • Infrastructure nodes must be configured to communicate directly with the primary in the PE Infrastructure Agent node group, or in a DR configuration, the primary and then the replica. In large installations with compilers, agents must be configured to communicate with the load balancers or compilers in the PE Agent node group.
  • When a replica is enabled, the replica is appended to the end of the list in the PE Infrastructure Agent group, and when not using a load balancer, it's appended to the list in PE Agent.
  • Some puppet infrastructure commands refer to this parameter as agent-server-urls, but those commands nonetheless manage the server_list parameter.
Important: Setting agents to communicate directly with the replica in order to use the replica as a compiler is not supported.

provisioned_replicas
Purpose
Specifies the certname of replica to give access to the ca-data file sync repo.
Node group
PE HA Master
Class

puppet_enterprise::profile::master

DR-only parameter
Yes
Example with enabled replica
["<REPLICA_CERTNAME>"]

puppetdb_host
Purpose
Specifies the certname of the node running the PuppetDB service.
Node group
PE Master
Class

puppet_enterprise::profile::master

DR-only parameter
No
Example with enabled replica
["<PRIMARY_CERTNAME>","<REPLICA_CERTNAME>"]
Notes
Replica values are appended to the end of parameter when a replica is enabled.

puppetdb_port
Purpose
Specifies the port used for communicating with the PuppetDB service. Always 8081.
Node group
PE Master
Class

puppet_enterprise::profile::master

DR-only parameter
No
Example with enabled replica
[8081,8081]
Notes
Replica values are appended to the end of parameter when a replica is enabled.

replica_hostnames
Purpose
Specifies the certname of the replica to set up pglogical replication for non-PuppetDB databases.
Node group
PE HA Master
Class

puppet_enterprise::profile::database

DR-only parameter
Yes
Example with enabled replica
["<REPLICA_CERTNAME>"]

replicating
Purpose
Specifies whether databases other than PuppetDB replicate data.
Node group
PE Infrastructure
Class

puppet_enterprise

DR-only parameter
Yes
Example with enabled replica
true
Notes
Used when provisioning a new replica.

replication_mode
Purpose
Sets replication type and direction on primary servers and replicas.
Node group
PE Master (none), HA Master (source)
Class
puppet_enterprise::profile::master

puppet_enterprise::profile::database

puppet_enterprise::profile::console

DR-only parameter
Yes (although "none" by default)
Example with enabled replica
PE Master — "none" (Present only in master profile.)

PE HA Master — "source" (Set automatically in the replica profile; no setting in the classifier in PE HA Replica.)

server_list
Purpose
Specifies the list of servers that agents contact, in order.
Node group
PE Agent, PE Infrastructure Agent
Class

puppet_enterprise::profile::agent

DR-only parameter
No
Example with enabled replica
PE Agent — ["<PRIMARY_CERTNAME>:8140","<REPLICA_CERTNAME>:8140"] or in a large installation, ["<LOAD_BALANCER>:8140"]

PE Infrastructure Agent —["<primary certname>:8140","<replica certname>:8140"]

Notes
  • Infrastructure nodes must be configured to communicate directly with the primary in the PE Infrastructure Agent node group, or in a DR configuration, the primary and then the replica. In large installations with compilers, agents must be configured to communicate with the load balancers or compilers in the PE Agent node group.
  • When a replica is enabled, the replica is appended to the end of the list in the PE Infrastructure Agent group, and when not using a load balancer, it's appended to the list in PE Agent.
  • Some puppet infrastructure commands refer to this parameter as agent-server-urls, but those commands nonetheless manage the server_list parameter.
Important: Setting agents to communicate directly with the replica in order to use the replica as a compiler is not supported.

sync_allowlist
Purpose
Specifies a list of nodes that the primary PuppetDB syncs with.
Node group
PE HA Master
Class

puppet_enterprise::profile::puppetdb

DR-only parameter
Yes
Example with enabled replica
["<REPLICA_CERTNAME>"]

During upgrade, when primary is upgraded but replica hasn't been upgraded, [] to prevent syncing until upgrade is complete.

sync_peers
Purpose
Specifies a list of hashes that contain configuration data for syncing with a remote PuppetDB node. Includes the host, port, and sync interval.
Node group
PE HA Master
Class

puppet_enterprise::profile::puppetdb

DR-only parameter
Yes
Example with enabled replica
[{"host":"<REPLICA_CERTNAME>","port":8081,"sync_interval_minutes":<X>}]

During upgrade, when primary is upgraded but replica hasn't been upgraded, [] to prevent syncing until upgrade is complete.

Notes
Updated when you enable a replica.

Load balancer timeout in disaster recovery installations

Disaster recovery configuration uses timeouts to determine when to fail over to the replica. If the load balancer timeout is shorter than the server and agent timeout, connections from agents might be terminated during failover.

To avoid timeouts, set the timeout option for load balancers to four minutes or longer. This duration allows compilers enough time for required queries to PuppetDB and the node classifier service. You can set the load balancer timeout option using parameters in the haproxy or f5 modules.