How We Automated Our Ebook Builds With Pandoc and KindleGen
Puppet Labs is all about automation, so when we published our new continuous delivery ebook we wanted to create a workflow that could convert a Markdown file into fully featured EPUB and MobiPocket ebooks with as little manual intervention as possible.
There are just a few simple metadata files you'll need to manually edit, but in the end Pandoc does a lot of the heavy lifting to automatically generate internal navigation files and a table of contents based upon the structure of your HTML/Markdown document.
I'll start with an overview of the workflow, show what the final commands look like, and then talk about some of the files that you'll have to set up in order to add all the proper metadata. Finally, I'll talk about some helpful tools you can use to validate and test your ebook files.
Why Creating EPUB and MobiPocket Files Is Hard
The prospect of creating MobiPocket or EPUB versions of your ebook can be overwhelming. Check out Amazon's Kindle Publishing Guidelines or read through this epic ebook creation tutorial from BB ebooks, which provides a comprehensive list of all the things you need to do.
The basic workflow they suggest is that you need to convert your content to HTML. Sounds simple enough, but then you have to do a bunch of additional things including:
- Adding anchor tags to the HTML for the chapter markers and subchapter headers
- Adding tags within your markup to indicate page breaks
- Handcrafting an internal navigation file named
- Creating a very similar HTML version of the table of contents
- Creating a cover image & title page
- Creating a
content.opfXML file that defines all sorts of additional publishing metadata, the manifest of chapters, the spine ncx navigation, and the HTML table of contents guide file.
That's a lot of grunt work and manual steps that computers should be able to help out with. Fortunately, it turns out that there's a much better and easier way to create an ebook.
First off, manually adding HTML tags to content is a pain. That's why John Gruber created Markdown: so writers could use a lightweight syntax that would convey meaning without adding opening and closing HTML tags, and without needing to see the fully-rendered presentation of the formatting. Starting from Markdown was one of our requirements.
I then started crafting some HTML from Markdown that would be converted to a Kindle-ready MobiPocket file with Amazon's KindleGen program. However, the problem with starting with creating a MobiPocket file is that there's a ton of additional manual steps listed above that had no clear path toward automation.
So I wasn't able to find a satisfying tool to be able to start with creating a MobiPocket file. The consensus seems to be that it's better with an EPUB version, and then use that to generate your MobiPocket file with KindleGen.
There are a lot of automated tools out there for converting Markdown to EPUB, but the biggest issue issue I ran into is that many either completely ignore the process of generating an HTML table of contents and ncx navigation file, or the process of adding these and other metadata was too manual and unwieldy.
This is where Pandoc steps in to save the day.
Pandoc is emerging as the Swiss army knife of document conversions. It's really taking off in the academic and scientific community since it takes lightweight Markdown (or a dozen other) inputs and outputs to LaTeX, HTML, EPUB, PDF and about 30 other formats.
This is about the simplest pandoc command you can run.
That command converts a Markdown file to an EPUB file. As you might guess, Pandoc looks at the extensions to figure out what type of conversion you want to do.
You can also concatenate input files by just adding a space and the file name. This command combines three Markdown files into one EPUB file.
The best thing about Pandoc is that it uses
H1 tags to determine the break points for chapters. This means you don't have to manually split your content into individual HTML chapter files. You can also designate H2 or H3 tags as the break points with the
--epub-chapter-level option. It also automatically generates the
toc.ncx navigation files and HTML table of contents and adds all of the appropriate anchor tags so you don't have to manually craft these files or add any page breaks. You get all of these navigation features for free based upon the inherent HTML structure of your document.
If you want to have second- or third-level navigation within these navigation files, you can pass in the table of contents depth, an it'll look to your
H3 tags and automatically add anchor tags to your HTML and add it to the table of contents and navigation.
Pandoc also has a very lightweight way to add all of the additional metatdata, and even add a customized CSS file if you'd like. You can read Pandoc's comprehensive README file on GitHub to learn all the various options.
In the end, Pandoc proved to be the Holy Grail of ebook generation, so let's install Pandoc and KindleGen and take a look at the commands I ended up using.
Installing Pandoc & KindleGen
Pandoc has a lot of dependencies, so you should try to install it using a package management system or as a prebuilt binary rather than trying to compile it from source.
You can download Amazon's KindleGen tarball from here, unpack it, and place the KindleGen command somewhere where it's available in your shell's path. I placed it in
Now let's take a look at the final commands and unpack what's happening.
Pandoc & KindleGen Commands for Ebook Conversion
John MacFarlane -- author of Pandoc -- wrote up a simple blog post for how to generate an ebook. This was a great beginner's guide, but I needed to add some additional flags and a bit more information in order to create fully functional EPUB and MobiPocket files.
Here are the commands that I ran in order to generate an EPUB and MobiPocket file from Markdown.
First get an EPUB version with Pandoc:
Then you can feed that into KindleGen to generate a MobiPocket file:
That's it! Done! You should now have both a
my-ebook.mobi file to start testing.
Let's unpack what's happening with each argument we used:
- Tells pandoc to generate an EPUB file
- Creates a title page and some title and author metadata. You can optionally append this information to the beginning of your first input document.
- The actual ebook content in Markdown format. It contains multiple
H1tags, which will get split into different chapters. You can concatenate more files here if need to add chapters. You can also pass HTML as input or add URLs here.
- The actual ebook content in Markdown format. It contains multiple
- Creates a cover image file and embeds the proper metadata
- You can define additional Dublin core tags about the file, including published date, author, publisher, rights, and language. Some of this is pulled from the title page, such as title, author & published date.
- This tells Pandoc to add an HTML table of contents to the beginning of the EPUB file. The machine navigation and toc.ncx file are generated automatically, but I found that the HTML version is not always included.
- Tells Pandoc to look at the
H2tags and add then as secondary navigation links within both the navigation file and the HTML index. By default, it will look at a depth of 3.
- Tells Pandoc to look at the
- You can include this option if you want to make any CSS changes or tweaks
Metadata Files for Pandoc
Let's look at what's in a couple of these metadata files that we passed into Pandoc.
title.txt file is just two lines containing the book title and the author. This generates the title page as well as some additional metadata:
metadata.xml file adds in additional metadata to the file according to the Dublin Core Metadata Element Set
Finally, you can make additional design tweaks by altering the default EPUB CSS stylesheet. What I did was create a simple prototype EPUB document with
pandoc -o my-ebook.epub my-ebook.md.
To look at the guts of an EPUB file, you can change the
.epub extension to
.zip. An EPUB file is technically just a compressed folder containing a number of HTML files, images, and a number of other XML manifest and metadata files.
I ended up needing to use the B1FreeArchiver to extract the zip file, since the default Mac Archive Utility got into a recursive loop of unpacking it to a
*.zip.cpgz and then back into a
*.zip file. Other alternatives to solving this are here.
You'll see a stylesheet.css in the unzipped folder that you can copy back into the base directory where you're executing the conversion. Feel free to add any CSS styling changes here and be sure to set it with the
So after you've generated some initial EPUB and MobiPocket files, you're going to want to test them and iterate on formatting and design.
A Sample Markdown File
If you want a fun sample Markdown file to convert into an EPUB and MobiPocket format, then download Pandoc's README & user guide to a file named README.md.
You can run the following command to generate an EPUB file.
You'll notice that the ten
H1 headers were converted to chapters and that the default ROC depth is three. Also note that the title page is generated from the first three lines instead of requiring a
Previewing Your EPUB & MobiPocket Files
Essential tools for previewing your ebook files include:
- Adobe's free Digital Editions EPUB reader
- Amazon's Kindle Previewer, which allows you to preview what the MobiPocket file looks like all of the variants of their e-ink devices, Kindle Fires and Kindle for iOS readers.
Note that these are just emulators: It's good to take your new files for a spin on the actual devices, since emulators don't always render the ebook true to how it would appear on the emulated device.
A good tool I used to confirm that all of the metadata was properly added was Calibre. Calibre is an ebook conversion program with a GUI and some great features to confirm that the metadata you entered into
metadata.xml was saved to the file properly.
Validating & Debugging your EPUB & MobiPocket files
It's also good to run your EPUB file through a validator to be sure you don't have any broken links or other warnings or errors. You can upload your EPUB file to the International Digital Publishing Forum website, and it'll run the EpubCheck validator on it.
If you'd prefer to run it locally, then you can download EpubCheck from their releases and run the following command after unzipping it:
If you're curious about what the source HTML looks like on an EPUB, then you can simply change the
*.EPUB extension to
*.zip and unarchive the folder as mentioned above.
You can navigate to the Mobi_Unpack_v047 folder from the terminal, and then run
Select your MobiPocket input file, and an appropriate output directory, and then hit "Start." It'll unpack your MobiPocket file into a mobi7 folder with the source files for e-ink readers and an mobi8 folder with the KF8 files used for the Kindle Fire.
If you're getting some validation errors, then BB Ebooks has an excellent troubleshooting section in their how-to write up.
Learning Pandoc's Markup
There are many different flavors and implementations of Markdown syntax. Pandoc uses its own special flavor of Markdown.
The Pandoc author wrote a neat online tool called Babelmark 2 that allows you to compare and contrast how different markup syntaxes will render into HTML.
For example, I wanted to find which Markdown variant provided the easiest table syntax. I ended up settling on this pipe table syntax.
For more details on the nuances of the Pandox syntax, then take a look at the "Pandoc's markdown" section on pandoc's README.
You can also test out conversion snippets on John MacFarlane's Try pandoc! page.
I enjoyed using the Mou markdown editor in order to get bootstrapped on Markdown because it has a nice live preview. However, it doesn't use Pandoc's Markdown syntax, and can get slow if the file starts to get too big.
There is a nice browser-based, markdown live editor that is compliant with Pandoc's syntax called Markx. They have a hosted version at http://markx.herokuapp.com that is helpful for getting a little bit more interactive experience for how the Markdown will render to HTML.
There was also an alpha-quality app called Kalam that uses node-webkit to create a live preview of Pandoc syntax. I found it to be a bit buggy, and limited as an interactive editor, but helpful as another Pandoc live preview option.
Images, Design & CSS Media Queries
Pandoc and KindleGen are powerful tools that get you 80-90 percent of the way there in creating professional looking ebooks. The last 10-20 percent is a lot of iteration on formatting, styling and special CSS rules, and even customized fonts with the
One issue that I ran into is making sure that images were the best size on all platforms. What may have looked okay on an e-ink reader looked way too small on the Kindle Fire. One way that Amazon suggests dealing with this is to use media queries.
Chapter 8 of the Amazon Kindle Publishing Guide talks about how to target the Kindle Fire KF8 or older e-ink MobiPocket files.
For KF8 CSS styles, use the media query
@media amzn-kf8. This is only applied for the KF8 format.
For Mobi CSS styles, use the media query
@media amzn-mobi. This is only applied for the Mobi format.
There are also some sample files listed on the sidebar that you can download with different Kindle-specific formatting tips. The KF8Sample is particularly instructive for all of the formatting options that are available in the new KF8 format.
We're excited about this first iteration of an ebook generation workflow. If you'd like to see how our conversions came out, then be sure to download our free Continuous Delivery ebook