Configure disaster recovery

To configure disaster recovery, you must provision a replica to serve as backup during failovers. If your primary server is permanently disabled, you can then promote a replica.

Before you begin
  • Apply disaster recovery system and software requirements.
  • Ensure you have a valid admin RBAC token that is valid for at least an hour, so that it does not expire during the provisioning process. You can delete the token after provisioning is complete.
  • Ensure Code Manager is enabled and configured on your primary server.
  • Move any tuning parameters that you set for your primary server using the console to Hiera. Using Hiera ensures configuration is applied to both your primary server and replica.
  • If you're using an r10k private key for code management, set puppet_enterprise::profile::master::r10k_private_key in pe.conf . This ensures that the r10k private key is synced to your primary server replica.
  • Back up your classifier hierarchy, because enabling a replica alters classification.
Tip: Some of the puppet infrastructure commands that are used to configure and manage disaster recovery require a valid admin RBAC token, and all commands must be run from a root session. Running with elevated privileges via sudo puppet infrastructure is not sufficient. Instead, start a root session by running sudo su -, and then run the puppet infrastructure command. For details about these commands, run puppet infrastructure help <ACTION>. For example: puppet infrastructure help provision.

Provision and enable a replica

Provisioning a replica duplicates specific components and services from the primary server to the replica. Enabling a replica activates most of its duplicated services and components, and instructs agents and infrastructure nodes how to communicate in a failover scenario.

Before you begin
Important: The process outlined here isn't suitable if your installation is configured by the Puppet Enterprise Administration Module (PEADM). See Provision and enable a replica for a PEADM installation.
  1. Configure infrastructure agents to connect orchestration agents to the primary server.
    1. In the console, click Node groups, and in the PE Infrastructure group, select the PE Agent > PE Infrastructure Agent group.
    2. If you manage your load balancers with agents, on the Rules tab, pin load balancers to the group.
      Pinning load balancers to the PE Infrastructure Agent group ensures that they communicate directly with the primary server.
    3. On the Classes tab, find the puppet_enterprise::profile::agent class and specify these parameters:
      Parameter Value
      manage_puppet_conf Specify true to ensure that your setting for server_list is configured in the expected location and persists through Puppet runs. This is the default value.
      pcp_broker_list Hostname for your primary server. Hostnames must include port 8142, for example ["PRIMARY.EXAMPLE.COM:8142"].
      primary_uris Hostname for your primary server, for example ["PRIMARY.EXAMPLE.COM"]. This setting assumes port 8140 unless you specify otherwise with host:port.
      server_list Hostname for your primary server, for example ["PRIMARY.EXAMPLE.COM"]. This setting assumes port 8140 unless you specify otherwise with host:port.
    4. Remove any values set for pcp_broker_ws_uris.
    5. Commit changes.
    6. Run Puppet on all agents classified into the PE Infrastructure Agent group.
  2. On the primary server, as the root user, run puppet infrastructure provision replica <REPLICA NODE NAME> --enable
    Tip:

    The default replica --enable command adds the replica to your PE Agent node group's server list, which causes all Puppet.conf files to include the new server. However, if you include the --skip-agent-config flag, the replica is added to the server list of the PE Infrastructure Agent node group (which is a child of the PE Agent node group); this, by extension, impacts only the Puppet.conf files on your infrastructure nodes (including compilers).

    In installations with compilers, use the --skip-agent-config flag with --enable if you want to:
    • Upgrade a replica without needing to run Puppet on all agents.
    • Add disaster recovery to an installation without modifying the configuration of existing load balancers.
    • Manually configure which load balancer agents communicate with in multi-region installations. See Managing agent communication in multi-region installations.
  3. Copy your secret key files from the primary server to the replica.
    The secret key files are located at:
    • /etc/puppetlabs/orchestration-services/conf.d/secrets/keys.json
    • /etc/puppetlabs/orchestration-services/conf.d/secrets/orchestrator-encryption-keys.json
    • /etc/puppetlabs/console-services/conf.d/secrets/keys.json
    Important: If you do not copy your secret key files onto your replica, the replica generates new secret key files when you promote it. This prevents you from accessing LDAP, and prevents services from accessing encrypted information in PE databases.
  4. Verify that the contents of the global layer Hiera file on the new replica, located at /etc/puppetlabs/puppet/hiera.yaml, match the contents of the global layer Hiera file on the primary server.
    • If necessary, update hiera.yaml on the replica to match hiera.yaml on the primary server.
    • If you use code to manage the contents of hiera.yaml on the primary server, ensure that the new replica is also classified to manage the contents of its own hiera.yaml file.
  5. Optional: Verify that all services running on the primary server are also running on the replica:
    1. From the primary server, run puppet infrastructure status --verbose to verify that the replica is available.
    2. From any managed node, run puppet agent -t --noop --server_list=<REPLICA HOSTNAME>. If the replica is correctly configured, the Puppet run succeeds and shows no changed resources.
  6. Optional: Deploy updated configuration to agents by running Puppet, or wait for the next scheduled Puppet run.

    If you used the --skip-agent-config option, you can skip this step.

    Note: If you use the direct Puppet workflow, where agents use cached catalogs, you must manually deploy the new configuration by running:
    puppet job run --no-enforce-environment --query 'nodes {deactivated is null and expired is null}'
  7. Optional: Perform any tests you feel are necessary to verify that Puppet runs continue to work during failover. For example, to simulate an outage on the primary server:
    1. Prevent the replica and a test node from contacting the primary server. For example, you might temporarily shut down the primary server or use iptables with drop mode.
    2. Run puppet agent -t on the test node. If the replica is correctly configured, the Puppet run succeeds and shows no changed resources. Runs might take longer than normal when in failover mode.
    3. Reconnect the replica and test node.

Managing agent communication in multi-region installations

Typically, when you enable a replica by using puppet infrastructure enable replica or puppet infrastructure provision replica --enable, the configuration tool automatically sets the same communication parameters for all agents. In multi-region installations, with load balancers or compilers in multiple locations, you must manually configure agent communication settings so that agents fail over to the appropriate load balancer or compiler.

To skip automatically configuring which Puppet servers and PCP brokers agents communicate with, use the --skip-agent-config flag when you provision and enable a replica, for example:
puppet infrastructure provision replica example.puppet.com --enable --skip-agent-config

To manually configure which load balancer or compiler agents communicate with, use one of these options:

  • CSR attributes
    1. For each node, include a CSR attribute that identifies the location of the node, for example pp_region or pp_datacenter.

    2. Create child groups off of the PE Agent node group for each location.

    3. In each child node group, include the puppet_enterprise::profile::agent class and set the server_list parameter to the appropriate load balancer or compiler hostname.

    4. In each child node group, add a rule that uses the trusted fact created from the CSR attribute.

  • Hiera

    For each node or group of nodes, create a key/value pair that sets the puppet_enterprise::profile::agent::server_list parameter to be used by the PE Agent node group.

  • Custom method that sets the server_list parameter in puppet.conf.

Provision and enable a replica for a PEADM installation

Before you begin
Ensure you have completed the steps outlined in the Configure disaster recovery section.

The following outlines the varying processes to provision and enable a replica for standard, large and extra-large PEADM-configured installations.

  • A standard or large PEADM installation must use the peadm::add_replica plan to expand to a standard disaster recovery or large disaster recovery installation.
  • An extra-large PEADM installation must first use the peadm::add_database plan to prepare a replica-postgresql node before using the peadm::add_replica to complete expansion to an extra-large disaster recovery installation. When running the peadm::add_replica plan, you must also set the replica_postgresql_host parameter to the database host you just added with the peadm::add_database plan.
  • In a standard disaster recovery configuration, or extra-large disaster recovery configuration without compilers, you need to manually reset the PE agent node groups puppet_enterprise::profile::agent parameters with the new primary/replica addresses.
What to do next

For more information see:

https://github.com/puppetlabs/puppetlabs-peadm/blob/main/documentation/expanding.md.

Promote a replica

If your primary server can’t be restored, you can promote the replica to establish it as the new, permanent primary server.

  1. Verify that the primary server is permanently offline.
    If the primary server comes back online during promotion, your agents can get confused trying to connect to two active primary servers, and replication between the primary server and replica could cause additional issues with the promotion process. If you still have access to the primary server, stop all PE services by running systemctl stop puppet and then systemctl stop pe-*.
  2. On the replica, as the root user, run puppet infrastructure promote replica
    Promotion can take up to the amount of time it took to install PE initially. Don’t make code or classification changes during promotion.
  3. When promotion is complete, update any systems or settings that refer to the old primary server, such as PE client tool configurations, Code Manager hooks, and CNAME records.
  4. Deploy updated configuration to nodes by running Puppet or waiting for the next scheduled run.
    Note: In case of a failover, scheduled Puppet and task runs are rescheduled based on the last execution time.
  5. If you have a SAML identity provider (IdP) configured for single sign-on access in PE, specify your replica's new URLs and certificate in your IdP's configuration.
    After promotion, view the replica's URLs and certificate in the console on the Access control page, on the SSO tab, under Show configuration information. Because your SAML IdP isn't connected to your replica yet, you'll need to log into the console using a local PE or LDAP account to get the URLs and certificate.
  6. Optional: Provision a new replica in order to maintain disaster recovery.
    Note: Agent configuration must be updated before provisioning a new replica. If you re-use your old primary server’s node name for the new replica, agents with outdated configuration might use the new replica as a primary server before it’s fully provisioned.

Enable a new replica using a failed primary server

After promoting a replica, you can use your old primary server as a new replica, effectively swapping the roles of your failed primary server and promoted replica.

Before you begin

The puppet infrastructure run enable_ha_failover command detailed here leverages a built-in Bolt plan. To use this command, you must be able to connect using SSH from your current primary server to the failed primary server. You can establish an SSH connection using key forwarding, a local key file, or by specifying keys in .ssh/config on your primary server. Additionally, the tasks used by the plan must run as root, so specify the –-run-as root flag with the command, as well as –-sudo-password if necessary. For more information, see Bolt OpenSSH configuration options.

Important: The process outlined here isn't suitable if your installation is configured by the Puppet Enterprise Administration Module (PEADM). See Enable a new replica using a failed primary server for a PEADM installation.

By default, the enable_ha_failover plan uses its own RBAC user to perform the provision and enable commands. If you want to use a specific user instead, specify the RBAC parameters to the command.

To view all available parameters, use the --help flag. The logs for this and all puppet infrastructure run Bolt plans are located at /var/log/puppetlabs/installer/bolt_info.log.

To minimize the time required to enable a new replica from a failed primary server, the default plan attempts to provision the failed server as a replica by retaining the existing PuppetDB database. However, if the failed server has been offline for an extended period, the backlog of data may cause synchronization issues with the current primary, especially if the rate of new data generation is higher than PuppetDB's sync capacity. In such cases, consider using the alternative workflow to completely reinstall Puppet Enterprise and re-provision the replica. To activate this workflow, you can run the plan with uninstall_workflow=true.
Note: The alternative workflow takes longer as it involves copying the entire PuppetDB database to the node. This workflow also backs up the node's log history by saving the contents of /var/log/puppetlabs to /var/log/puppetlabs_<timestamp>.

To repurpose a failed primary server as a new replica, run the enable_ha_failover plan as follows:

On your promoted replica, as the root user, run puppet infrastructure run enable_ha_failover, specifying these parameters:
  • host — Hostname of the failed primary server. This node becomes your new replica.

  • topology — The architecture used in your environment, either mono (for a standard installation) or mono-with-compile (for a large installation). For mono-with-compile, you must specify either skip_agent_config, or both agent_server_urls and pcp_brokers.

    • skip_agent_config — Optional. Specifying this parameter with topology=mono-with-compile skips configuring puppet.conf on non-infrastructure agent nodes. This parameter is ignored when topology=mono.
    • agent_server_urls — Optional. This is the parameter used with topology=mono-with-compile to specify the server_list parameter in puppet.conf on all agent nodes. This parameter is ignored when topology=mono.
    • pcp_brokers — Optional. This is the parameter used with topology=mono-with-compile to list the PCP brokers for PXP agent’s configuration file. This parameter is ignored when topology=mono.
  • dns_alt_names — Optional. A comma-separated list of DNS alt names to add to the host’s certificate.
  • rbac_account — Optional. The RBAC account you want to use to run the provision and enable commands, instead of the built-in enterprise_tasks user.
  • rbac_password — Optional. The password for the RBAC account you want to use to run the provision and enable commands.
  • replication_timeout_secs — Optional. The number of seconds allowed to complete provisioning and enabling of the new replica before the command fails.

  • uninstall_workflow — Optional. Use the uninstall/reinstall workflow instead of the default workflow.
  • force — Skip some checks when running the plan.
For example:
puppet infrastructure run enable_ha_failover host=<FAILED_PRIMARY_HOSTNAME> topology=mono
Results
The failed primary server is repurposed as a new replica.

Enable a new replica using a failed primary server for a PEADM installation

To reuse an old primary server as a new replica on a PEADM-configured installation:

  1. Uninstall PE on the old primary server: puppet-enterprise-uninstaller -ypd.
  2. From your PEADM jump host, run the peadm::add_replica plan with primary_host set to your newly promoted primary server, and replica_host the old primary server that you just uninstalled.

    When performing this in an extra-large disaster recovery environment, you must also supply the replica_postgresql_host parameter.

  3. In a standard disaster recovery configuration, or an extra-large disaster recovery without compilers, you need to manually reset the PE agent node groups puppet_enterprise::profile::agent parameters with the new primary/replica addresses.
What to do next

For more information see:

https://github.com/puppetlabs/puppetlabs-peadm/blob/main/documentation/automated_recovery.md

Forget a replica

Forgetting a replica removes the replica from classification and database state and purges the node.

Before you begin

Ensure you have a valid admin RBAC token and the replica you want to remove is permanently offline.

Run the forget command whenever a replica node is destroyed, even if you plan to replace it with a replica with the same name.

You can also follow this process if your replica is offline for an extended period. When the replica is offline, PostgreSQL Write-Ahead Log (WAL) files build up on the primary server, potentially consuming excessive disk space. To avoid this, you can run the forget command and then reprovision the replica.

  1. On the primary server, as the root user, run puppet infrastructure forget <REPLICA NODE NAME>
  2. If the replica node still exists, run puppet-enterprise-uninstaller –y –p –d to completely remove Puppet Enterprise from the node. This action helps to avoid security risks associated with leaving sensitive information in the PostgreSQL database and secret keys on a replica.
Results
The replica is decommissioned, the node is purged as an agent, secret key information is deleted, and a Puppet run is completed on the primary server.

Reinitialize a replica

If puppet infrastructure status shows errors on your replica after provisioning, you can reinitialize the replica. Reinitializing destroys and re-creates replica databases, except for PuppetDB. This process is usually quick because non-PuppetDB databases are relatively small.

Before you begin
Your primary server must be fully functional and the replica must be able to communicate with the primary server.
CAUTION: If you reinitialize a functional enabled replica, the replica is unavailable to serve as backup in a failover during reinitialization.

Reinitialization is not intended to fix slow queries or intermittent failures. Reinitialize your replica only if it’s not operational or if you encounter replication errors on non-PuppetDB databases.

  1. On the replica, as the root user, run puppet infrastructure reinitialize replica.
    1. Optionally, you can reinitialize a single database with puppet infrastructure reinitialize replica --db <DATABASE>, replacing <DATABASE> with one of the following:
      • pe-activity
      • pe-classifier
      • pe-orchestrator
      • pe-inventory
      • pe-rbac
  2. Follow prompts to complete the reinitialization. You can use the –y flag to bypass the prompts.