Configure disaster recovery
To configure disaster recovery, you must provision a replica to serve as backup during failovers. If your primary server is permanently disabled, you can then promote a replica.
- Apply disaster recovery system and software requirements.
- Ensure you have a valid admin RBAC token that is valid for at least an hour, so that it does not expire during the provisioning process. You can delete the token after provisioning is complete.
- Ensure Code Manager is enabled and configured on your primary server.
- Move any tuning parameters that you set for your primary server using the console to Hiera. Using Hiera ensures configuration is applied to both your primary server and replica.
- If you're using an r10k private key for code management, set
puppet_enterprise::profile::master::r10k_private_key
inpe.conf
. This ensures that the r10k private key is synced to your primary server replica. - Back up your classifier hierarchy, because enabling a replica alters classification.
puppet infrastructure
commands that are
used to configure and manage disaster recovery require a valid admin RBAC token,
and all commands must be run from a root session. Running with elevated
privileges via sudo puppet infrastructure
is not
sufficient. Instead, start a root session by running sudo
su -
, and then run the puppet
infrastructure
command. For details about these commands, run
puppet infrastructure help <ACTION>
.
For example: puppet infrastructure help
provision
. Provision and enable a replica
Provisioning a replica duplicates specific components and services from the primary server to the replica. Enabling a replica activates most of its duplicated services and components, and instructs agents and infrastructure nodes how to communicate in a failover scenario.
- Ensure you have completed the steps outlined in the Configure disaster recovery section.
Managing agent communication in multi-region installations
Typically, when you enable a replica by using puppet
infrastructure enable replica
or puppet infrastructure
provision replica --enable
, the configuration tool automatically sets the same
communication parameters for all agents. In multi-region installations, with load
balancers or compilers in multiple locations, you must manually configure agent communication
settings so that agents fail over to the appropriate load balancer or compiler.
--skip-agent-config
flag when you provision and enable a
replica, for example:
puppet infrastructure provision replica example.puppet.com --enable --skip-agent-config
To manually configure which load balancer or compiler agents communicate with, use one of these options:
- CSR attributes
-
For each node, include a CSR attribute that identifies the location of the node, for example
pp_region
orpp_datacenter
. -
Create child groups off of the PE Agent node group for each location.
-
In each child node group, include the
puppet_enterprise::profile::agent
class and set theserver_list
parameter to the appropriate load balancer or compiler hostname. -
In each child node group, add a rule that uses the trusted fact created from the CSR attribute.
-
-
Hiera
For each node or group of nodes, create a key/value pair that sets the
puppet_enterprise::profile::agent::server_list
parameter to be used by the PE Agent node group. - Custom method that sets the
server_list
parameter inpuppet.conf
.
Provision and enable a replica for a PEADM installation
The following outlines the varying processes to provision and enable a replica for standard, large and extra-large PEADM-configured installations.
-
A standard or large PEADM installation must use
the
peadm::add_replica
plan to expand to a standard disaster recovery or large disaster recovery installation. -
An extra-large PEADM installation must first use
the
peadm::add_database
plan to prepare a replica-postgresql node before using thepeadm::add_replica
to complete expansion to an extra-large disaster recovery installation. When running thepeadm::add_replica
plan, you must also set thereplica_postgresql_host
parameter to the database host you just added with thepeadm::add_database
plan. -
In a standard disaster recovery configuration, or extra-large disaster recovery
configuration without compilers, you need to manually reset the PE agent node groups
puppet_enterprise::profile::agent
parameters with the new primary/replica addresses.
For more information see:
https://github.com/puppetlabs/puppetlabs-peadm/blob/main/documentation/expanding.md.
Promote a replica
If your primary server can’t be restored, you can promote the replica to establish it as the new, permanent primary server.
Enable a new replica using a failed primary server
After promoting a replica, you can use your old primary server as a new replica, effectively swapping the roles of your failed primary server and promoted replica.
The puppet infrastructure run enable_ha_failover
command detailed
here leverages a built-in Bolt plan. To use this
command, you must be able to connect using SSH from your current primary server to
the failed primary server. You can establish an SSH connection using key forwarding,
a local key file, or by specifying keys in .ssh/config
on your
primary server. Additionally, the tasks used by the plan must run as root, so
specify the –-run-as root
flag with the command, as well as
–-sudo-password
if necessary. For more information, see Bolt OpenSSH
configuration options.
By default, the enable_ha_failover
plan uses its own
RBAC user to perform the provision and enable commands. If you want to use a
specific user instead, specify the RBAC parameters to the command.
To view all available parameters, use the --help
flag. The logs for this and all puppet infrastructure run
Bolt plans are located at /var/log/puppetlabs/installer/bolt_info.log
.
uninstall_workflow=true
./var/log/puppetlabs
to
/var/log/puppetlabs_<timestamp>
.To repurpose a failed primary server as a new replica, run the enable_ha_failover
plan as follows:
puppet infrastructure
run enable_ha_failover
, specifying these parameters:
-
host
— Hostname of the failed primary server. This node becomes your new replica. -
topology
— The architecture used in your environment, eithermono
(for a standard installation) ormono-with-compile
(for a large installation). Formono-with-compile
, you must specify eitherskip_agent_config
, or bothagent_server_urls
andpcp_brokers
.
-
skip_agent_config
— Optional. Specifying this parameter withtopology=mono-with-compile
skips configuringpuppet.conf
on non-infrastructure agent nodes. This parameter is ignored whentopology=mono
. -
agent_server_urls
— Optional. This is the parameter used withtopology=mono-with-compile
to specify theserver_list
parameter inpuppet.conf
on all agent nodes. This parameter is ignored whentopology=mono
. -
pcp_brokers
— Optional. This is the parameter used withtopology=mono-with-compile
to list the PCP brokers for PXP agent’s configuration file. This parameter is ignored whentopology=mono
.
-
-
dns_alt_names
— Optional. A comma-separated list of DNS alt names to add to the host’s certificate. -
rbac_account
— Optional. The RBAC account you want to use to run the provision and enable commands, instead of the built-inenterprise_tasks
user. -
rbac_password
— Optional. The password for the RBAC account you want to use to run the provision and enable commands. -
replication_timeout_secs
— Optional. The number of seconds allowed to complete provisioning and enabling of the new replica before the command fails. -
uninstall_workflow
— Optional. Use the uninstall/reinstall workflow instead of the default workflow. -
force
— Skip some checks when running the plan.
puppet infrastructure run enable_ha_failover host=<FAILED_PRIMARY_HOSTNAME> topology=mono
Enable a new replica using a failed primary server for a PEADM installation
To reuse an old primary server as a new replica on a PEADM-configured installation:
For more information see:
https://github.com/puppetlabs/puppetlabs-peadm/blob/main/documentation/automated_recovery.md
Forget a replica
Forgetting a replica removes the replica from classification and database state and purges the node.
Ensure you have a valid admin RBAC token and the replica you want to remove is permanently offline.
Run the forget
command whenever a replica node
is destroyed, even if you plan to replace it with a replica with the same name.
You can also follow this process if your replica is offline for an extended period. When the replica is offline, PostgreSQL Write-Ahead Log (WAL) files build up on the primary server, potentially consuming excessive disk space. To avoid this, you can run the forget command and then reprovision the replica.
-
On the primary server, as the root user, run
puppet infrastructure forget <REPLICA NODE NAME>
-
If the replica node still exists, run
puppet-enterprise-uninstaller –y –p –d
to completely remove Puppet Enterprise from the node. This action helps to avoid security risks associated with leaving sensitive information in the PostgreSQL database and secret keys on a replica.
Reinitialize a replica
If puppet infrastructure status
shows errors on your
replica after provisioning, you can reinitialize the replica. Reinitializing destroys and
re-creates replica databases, except for PuppetDB. This
process is usually quick because non-PuppetDB databases are
relatively small.
Reinitialization is not intended to fix slow queries or intermittent failures. Reinitialize your replica only if it’s not operational or if you encounter replication errors on non-PuppetDB databases.
-
On the replica, as the root user, run
puppet infrastructure reinitialize replica
. -
Follow prompts to complete the reinitialization. You can use the
–y
flag to bypass the prompts.