How do .epubs work?

There are a couple of different file formats that e-books are distributed in. The one I most commonly see is the EPUB format, which has the .epub file extension. Let's take a look at the internals of this file this format.

An example EPUB

In this article, we'll be looking at a copy of Moby Dick, published by the website Planet Ebook. You can download the EPUB from this page. N.B: you should check that this book is in the public domain and out of copyright in your country before downloading.

EPUBs are zip files

We can see their contents by unzipping them:

$ unzip moby-dick.epub
$ rm moby-dick.epub  # We don't need this anymore
$ tree
.
├── META-INF
│   ├── com.apple.ibooks.display-options.xml
│   ├── container.xml
│   └── encryption.xml
├── OEBPS
│   ├── Moby-Dick.xhtml
│   ├── content.opf
│   ├── cover.xhtml
│   ├── css
│   │   └── idGeneratedStyles.css
│   ├── font
│   │   ├── MinionPro-BoldDisp.otf
│   │   ├── MinionPro-CnIt.otf
│   │   ├── MinionPro-MediumDisp.otf
│   │   └── MinionPro-Regular.otf
│   ├── image
│   │   ├── 1.png
│   │   └── Moby-Dick.jpg
│   └── toc.ncx
└── mimetype

EPUBs are similar to websites

We can see a couple of filetypes we'd expect to see in a website:

MIMEtype

$ cat mimetype
application/epub+zip

This file provies a more reliable way of telling that this file is actually an EPUB than the .epub file extension. All EPUBs must have an application/epub+zip MIMEtype

META-INF

The META-INF directory contains metadata about the book. The only required file here is container.xml. This file points to the file which contains the contents of the book:

<!-- META-INF/container.xml -->
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<container version="1.0" xmlns="urn:oasis:names:tc:opendocument:xmlns:container">
  <rootfiles>
    <rootfile full-path="OEBPS/content.opf" media-type="application/oebps-package+xml" />
  </rootfiles>
</container>

We can see that the content of the book is stored at OEBPS/content.opf

Document layout

OEBPS/content.opf is an Open Packaging Format file. It's an XML document with a particular format used to define the contents of EPUBs:

<!-- OEBPS/content.opf -->
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<package version="2.0" xmlns="http://www.idpf.org/2007/opf" unique-identifier="bookid">
  <metadata xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:opf="http://www.idpf.org/2007/opf">
    <meta name="generator" content="Adobe InDesign 12.1" />
    <meta name="cover" content="x1.png" />
    <dc:title>Moby-Dick</dc:title>
    <dc:date>2018-02-20T05:18:46Z</dc:date>
    <dc:language>en-US</dc:language>
    <dc:identifier id="bookid">urn:uuid:29d919dd-24f5-4384-be78-b447c9dc299b</dc:identifier>
  </metadata>
  <manifest>
    <item id="cover" href="cover.xhtml" media-type="application/xhtml+xml" />
    <item id="Moby-Dick" href="Moby-Dick.xhtml" media-type="application/xhtml+xml" />
    <item id="ncx" href="toc.ncx" media-type="application/x-dtbncx+xml" />
    <item id="idGeneratedStyles.css" href="css/idGeneratedStyles.css" media-type="text/css" />
    <item id="Moby-Dick.jpg" href="image/Moby-Dick.jpg" media-type="image/jpeg" />
    <item id="x1.png" href="image/1.png" media-type="image/png" />
    <item id="MinionPro-Regular.otf" href="font/MinionPro-Regular.otf"
      media-type="application/vnd.ms-opentype" />
    <!-- other fonts -->
  </manifest>
  <spine toc="ncx">
    <itemref idref="cover" linear="no" />
    <itemref idref="Moby-Dick" />
  </spine>
  <guide>
    <reference type="cover" href="cover.xhtml" title="Cover" />
  </guide>
</package>

Let's look at each of the sections:

When reading the book, you'll see the cover, defined in OEBPS/cover.xhtml, then the main content Moby-Dick, defined in OEBPS.Moby-Dick.xhtml.

Table of contents

The table of contents is defined in the NCX (Navigation Control file for XML) file OEBPS/toc.ncx:

<!-- OEBPS/toc.ncx -->
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE ncx PUBLIC "-//NISO//DTD ncx 2005-1//EN"
  "http://www.daisy.org/z3986/2005/ncx-2005-1.dtd">
<ncx xmlns="http://www.daisy.org/z3986/2005/ncx/" version="2005-1">
  <head>
    <meta name="dtb:uid"
      content="urn:uuid:29d919dd-24f5-4384-be78-b447c9dc299b" />
    <meta name="dtb:depth" content="3" />
    <meta name="dtb:totalPageCount" content="0" />
    <meta name="dtb:maxPageNumber" content="0" />
  </head>
  <docTitle>
    <text>Moby-Dick</text>
  </docTitle>
  <navMap>
    <navPoint id="navpoint1" playOrder="1">
      <navLabel><text>Moby Dick </text></navLabel>
      <content src="Moby-Dick.xhtml#_idParaDest-1" />
      <navPoint id="navpoint2" playOrder="2">
        <navLabel><text>ETYMOLOGY.</text></navLabel>
        <content src="Moby-Dick.xhtml#_idParaDest-2" />
        <navPoint id="navpoint3" playOrder="3">
          <navLabel><text>Chapter 1 Loomings.</text></navLabel>
          <content src="Moby-Dick.xhtml#_idParaDest-3" />
        </navPoint>
        <!-- other chapters -->
      </navPoint>
    </navPoint>
  </navMap>
</ncx>

The content tags in this file link to the relevant places in the main text.

Cover

This is just an HTML file which displays a full page image

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN"
  "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
  <head>
    <title>cover</title>
  </head>
  <body>
    <div style="text-align:center;">
      <img src="image/1.png" alt="1.png" style="max-width:100%;" />
    </div>
  </body>
</html>

Content

This is another HTML file which contains the actual text of the book. Chapter headings are marked with ids that match the URL fragment in OEBPS/toc.ncx.