Digitizing books to DjVu with free tools
July 23, 2009
While I love books — enough that I’ve written one myself — they’re often cumbersome to work with: finding things without a good index is very difficult, you can rarely take more than a few with you at a time and if it’s a particularly nice/expensive/rare one, you’d rather leave it in the shelf altogether.
The answer is, of course, to create a digital copy. One possible format for that would be PDF. Problem is: for its image data it has to resort to conventional compression algorithms. That means that scanned documents can turn out to be quite large. A file format that’s much more suited for this is DjVu. One of its tricks is a lossy algorithm that recognizes recurring shapes such as characters. As a result, DjVu encoded books are typically a quarter of the size of PDF encoded books.
Given the need to digitize a couple of books at work, I investigated whether it’s possible to create high-quality digital copies using freely available tools.
It’s not as easy as it sounds
If you already have a high-quality PDF document or a series of scanned images, there are a number of ways for you to end up with a decent DjVu document. The manual one involves calling the
cjb2 command line tool from the DjVuLibre project, a more automatized one would be through the
pdf2djvu tool. Problem is, when you scan a book, you rarely have high-quality scans to begin with. You typically have something like this:
Fortunately there’s an excellent tool that can help here. It’s called
unpaper and what it does is, among others, remove the ugly black borders and other noise, rotate the pages and split double pages in half. It works with PNM type images, so if your scanning program spits out a PDF, simply use ImageMagick to do the conversion. On a large document it’s most memory-efficient to make individual calls to the
convert program, one per page:
for i in `seq 1 $NUMBER_OF_PAGES`; do convert -density 600 scan.pdf[`expr $i - 1`] pages`printf %03d $i`.pbm done
This converts page N of the PDF to
pages00N.pbm. Now unpaper can be invoked with the necessary options:
unpaper -v --layout double --pre-rotate -90 --output-pages 2 \ pages%03d.pbm singlepages%03d.pbm
The result is separate image called
singlepages00N.pbm that are nicely cleaned up:
Extracting the text
At this point you might think that we’re done, given that
cjb2 can easily convert the resulting pages to DjVu and
djvm can create a multi-page document from them. However, the result wouldn’t be searchable for text, one of the reasons why one would want to digitize in the first place.
The solution here obviously is to apply some OCR technology. There are several free OCR tools available: tesseract, GOCR and ocropus. They all work more or less well, but ocropus has a trick up its sleave: It can not only extract the text from an image but also annotate the text with pixel coordinates. This means that a text search in a DjVu viewer will not only navigate to the right page but also to the right line (unfortunately, ocropus can’t resolve individual words, just lines). Installing ocropus on OS X is a bit of a pain in the neck, but if you follow these instructions to the word, it works. The following commands will then perform the OCR analysis:
ocropus book2pages outdir singlepages*.pbm ocropus pages2lines outdir ocropus lines2fsts outdir ocropus fsts2text outdir ocropus buildhtml outdir > hocr.html
As you can see, the result is an HTML file in the hOCR format. It contains the text gathered by ocropus in
<span> elements, annotated with pixel information. In order to apply this information to DjVu documents, it needs to be transformed into a format that the DjVuLibre tools, specifically the
djvused tool, understand. To do that, I hacked a little Python script together:
import sys import os.path from elementtree import ElementTree from PIL import Image hocrfile = sys.argv imgfiles = sys.argv[2:] et = ElementTree.parse(hocrfile) for page in et.getiterator('div'): if page.get('class') != 'ocr_page': continue if not imgfiles: continue imgfile = imgfiles.pop(0) txtfile = os.path.splitext(imgfile) + '.txt' out = open(txtfile, 'w') image = Image.open(imgfile) print >>out, "(page 0 0 %s %s" % image.size for line in page: linetitle = line.get('title') if not linetitle.startswith('bbox '): continue x0, y0, x1, y1 = [int(x) for x in linetitle[5:].split()] imgheight = image.size y0 = imgheight - y0 y1 = imgheight - y1 text = line.text.strip().replace('"', '\\"') print >>out, ' (line %s %s %s %s "%s")' % (x0, y0, x1, y1, text) print >>out, ")" out.close()
It’s evidently very crude and makes lots of assumptions specific to the ocropus output. For it to work you need the optional but fairly standard PIL and ElementTree packages installed. The script is invoked like so:
python hocl2djvu.py hocr.html singlepages*.pbm
It will spit out a
singlepages00N.txt file for every page it finds text information for.
Putting it all together
Finally the image files for the individual pages can be converted to individual DjVu files:
for i in singlepages*pbm; do cjb2 -clean $i `basename $i pbm`djvu done
Before combining the pages to a compound document, the
djvused tool can then be used to apply the text annotations:
for i in singlepages*txt; do djvused `basename $i txt`djvu -e "select 1; set-txt $i" -s done
Lastly, the following command creates the resulting book file:
djvm -c book.djvu singlepages*.djvu
And here’s what the result looks like:
It’s easily possible to digitize books using free tools. Some rough edges remain, however. For instance, the
unpaper program isn’t completely reliable. I haven’t fiddled with the settings yet, though, so perhaps the output can be improved. The same goes for the OCR machinery which still produces lots of erroneous words. Also, it’d be nice if the pixel annotations would work for individual words, too (like on Google book search). Perhaps a linear approximation could work — certainly seems feasible for monospace fonts.