Reading and resolving Puppet errors
How to understand errors
When Puppet shows up red, don't panic! You need to figure out what's causing the error, so you can understand whether it's something fixable, or an actual bug.
Generally, the first error in a Puppet run is the one you need to concern yourself with. Once the first thing fails, others that run after it or require it to succeed will also fail and could produce a LOT of red text! So scroll to the first red error on your Puppet run. If you’re using the Report Viewer in Puppet Enterprise and if you sort by level, it will be the first red block you see.
I'm going to use a mix of complex resource examples and some simpler ones, just to show that the general thought process is the same.
Now, let's take an error like below:
That's a lot of words! Stay calm and read it through. You can see a DSC resource called MSFT_SqlWaitForAG failed. The name might give you a hint: you can see that it failed to wait for AG. There are some parts about a Set-TargetResource, and then it mentions another error message. Cluster group cluster_example0111 not found after 30 attempts with 20 sec interval. That seems pretty self-explanatory; it failed because it couldn't find the cluster_example0111 group! So that's where the troubleshooting should start. Figure out what creates that Availability Group, and see if you can identify why it isn't there.
Since the above example is from the mssqlagcluster module, we can cheat a bit because we know that only node_n creates the group. So you'd want to look at the node that is supposed to create the AG group for that cluster, and see if it has created the cluster yet or not. This starts diving deeper into SQL clustering knowledge, which is not something we'll cover in this post.
Alright, here's another one:
Exec resources are always considered to have a default state of 'not run' instead of the exit code until they get evaluated on the node. Change from 'notrun'? This is common lingo for something that might do an exec call somewhere. It just means when it tried to run it, it didn't get an exit code of 0 from the last command -- it got something else. The exec resource is only as smart as what we tell it to do, so it’s a matter of simply looking at the result of the command in the most computer-y way possible and relying on the command itself to tell us what its exit state was. For a deeper dive into exec resource and how it works, particularly with PowerShell, check out this blog post.
Keep reading, and you see the helpful error message phrase again. THROW CAUGHT: Database 'Sample' cannot be opened. It is in the middle of a restore. Perfect! This tells us exactly why it could not do the steps it needed to do. The database is busy, trying to restore itself, or at least that's the response Puppet is getting!
Everything else in the error message is telling you where to find the Puppet code that produced the error. This is handy if it's an error you can fix, because it shows the file AND line number to look at! Nice!
Now here's a beast of an error message:
Holy words, Batman! Scan it quickly for something that says “error message.” Nope, nothing! How about “failed”? There, near the end!
Failed to set permissions for 'CONTOSO\Monkeys': User or users do not exist.
This makes the error message pretty straightforward. Unfortunately the code reference is only pointing you to the code for the resource that is generating the error, so you have the task of searching the code and figuring out what would be creating that user. It's also possible that the user isn't being created in any of the Puppet code, and it should be to ensure idempotents!
And lastly, a really weird one:
This is one of those errors worth putting into Google. You'll get a lot of results. In cases like this, run Puppet again, or look to see if the subsequent report had the same error. If it's still spitting out the same error run after run, it’s time to check your Puppet code, see if you can modify the resource it mentions in the code, and iron it out. It could be possible there's actually a bug of some sort, and THAT is when it would be time to use the
--debug flag and dig deep. In this particular case, we found out that the DSC resource for Windows feature made some bold assumptions and silently failed when certain services weren’t running! A good case for using native Puppet resources, as there is a tried and true module managing Windows features already.
In summary, read through the error message. Puppet isn't magic, and generally it will tell you the reason an error occurred in the error message, so you can track the issue down really quickly. Chances are it's not actually Puppet at fault, but something preventing Puppet from doing the job you've asked it!
Throw that error message in Google, you'll likely find an answer if you're not sure!
Not all errors mean something is broken, either. It just means Puppet was told something didn't succeed, and perhaps you need to change the order of resources or rethink what Puppet is controlling. In more complicated builds, you may have to wait for other nodes to create or do something, and if you check a later Puppet run, you should see it executing that resource to success!
Puppet's knowledge base
Sometimes, it's worth checking out the knowledge base Puppet keeps specifically for all of our Enterprise customers. We've seen a lot, so we've most likely seen what you're experiencing.
Take a look at Troubleshooting potential issues in Puppet Enterprise on the KB for some other tips, or simply check out How to Open a Support Ticket if you're an Enterprise customer. If not (or even if you are!), our Puppet Community Slack is full of helpful and knowledgeable people who have likely been in the same situation. Check it out!