Help us help you with content usage telemetry
Let's rip off the bandaid and get the bad news out there first: we're rolling out telemetry for Puppet content. Read on to find out why I think that's actually good news for you, how you can see exactly what data it collects, and how to make sure it never runs if your corporate policy doesn't allow it. And maybe a free beanie if you choose to opt in?
Why we decided to roll out a telemetry client
I'll start out with the problems we are trying to solve. You've probably seen me in our Slack bugging people about what modules they use, and how they choose between modules on the Forge. My hardest job right now is to decide on which modules to focus development efforts—and let me tell you, currently it feels a lot like reading tea leaves! We have support tickets filed by customers, JIRA tickets and pull requests filed by users, and ongoing efforts from opinionated people on Slack. This means that our feedback mechanisms are heavily weighted towards the people willing to make the most amount of noise, so to speak. You shouldn't have to do that for your needs to be heard.
That sounds like a me problem, and it totally is. But if I can't effectively marshal development time for the modules you actually use, then it becomes a you problem too.
The other problem we are addressing is how users like yourself choose among modules on the Forge. Right now the most relevant ranking factor is download count. But that number is heavily skewed by mirroring robots and CI pipelines and even simply the age of the module itself. In the end, new modules—even if they're high quality—end up being drowned out by the existing heavy hitters, and that makes our ecosystem appear to stagnate. Three modules on the first page of the "most relevant modules" list haven't had a release in more than five years! Ultimately, we understand that it's harder for you to find high-quality modules than it should be.
We're addressing these issues with a telemetry client bundled in with Puppetserver as of version 7.5.0. If you opt in, then once a week it will send us some information about the public Forge content that you're using in your infrastructure.
I'm sure the biggest question on your mind right now is: what data are we collecting, and how identifiable is it? That's perfectly reasonable; we've seen story after story in the news about companies being quite unscrupulous with their data collection and retention.
I want to assure you of three things;
- You can see all the data collected before you even choose to enable the system.
- You are in full control of which data your Puppet Servers do or do not submit.
- The data we collect is not associated with individuals and is fully aggregated before anyone can access it.
As a side benefit of the design, you can even use the client to gather useful information about the content usage of your infrastructure—even if you choose not to share it with us.
How it works
When the client is gathering information, it cross-references against content published on the Forge and will ignore any modules you've developed internally. The most it looks at your own code is to see which public Forge modules are used by your profiles or internal modules. But don't take our word for it; if you feel so inclined then you can see exactly what is collected by reading through the source for each metric it runs.
If you trust the tool to tell you what it will collect, you can actually ask it directly. Running the command
puppetserver dropsonde list in the terminal will describe all the loaded metrics plugins and what they do. Note that depending on your
$PATH, you may have to use the full path, just like any other time you run
puppetserver commands. That would make the full command
/opt/puppetlabs/bin/puppetserver dropsonde list and other commands in this post will follow that pattern.
Then to see exactly what data is collected, run
puppetserver dropsonde preview. It's rendered in human readable form, but every bit of data that makes it to our telemetry pipeline is represented there. (Hint: use
--format=json if you want to use this data for your own tooling.)
If you'd like to omit any of the metrics listed in the report, you can use the Puppet module to add them to the disable list.
Public data aggregation
As we were designing and building this system, it occurred to us that we weren't the only ones with prioritization challenges. Surely other developers of Puppet modules would want to know how their code was being used, or which platforms it was being used on. So we designed an aggregation pipeline such that all the usage data available for our own systems and engineers could be made fully public so that you could use it too. All you need is a Google Cloud account to get to our BigQuery public database and start writing tools to use that information.
Submitting usage data
Today, the telemetry is opt-in, but I hope that I've convinced you to submit at least one content usage report. And as a bonus freebie, we'll be sending fabulous Puppet beanies to some of the first people to submit reports, so don't delay. Start out by double checking the data it will send with
puppetserver dropsonde preview in your terminal. If you're ok with the data it collects, then enable reporting, either by installing the module and classifying your primary server, or manually by editing the
/etc/puppetlabs/puppetserver/conf.d/puppetserver.conf file on your primary server and adding or updating the following clause. Don't forget to restart
Then finally validate your work by submitting a report with
puppetserver dropsonde submit. If it prints out a URL, then go claim your beanie!
We'll be switching the model to informed opt-out in Puppet 8. In the meantime, I hope you'll help us out by sharing your public content usage data.
Ben is the product manager for Forge and Ecosystem at Puppet.