Digitizing books to DjVu with free tools

July 23, 2009

While I love books — enough that I’ve written one myself — they’re often cumbersome to work with: finding things without a good index is very difficult, you can rarely take more than a few with you at a time and if it’s a particularly nice/expensive/rare one, you’d rather leave it in the shelf altogether.

The answer is, of course, to create a digital copy. One possible format for that would be PDF. Problem is: for its image data it has to resort to conventional compression algorithms. That means that scanned documents can turn out to be quite large. A file format that’s much more suited for this is DjVu. One of its tricks is a lossy algorithm that recognizes recurring shapes such as characters. As a result, DjVu encoded books are typically a quarter of the size of PDF encoded books.

Given the need to digitize a couple of books at work, I investigated whether it’s possible to create high-quality digital copies using freely available tools.

It’s not as easy as it sounds

If you already have a high-quality PDF document or a series of scanned images, there are a number of ways for you to end up with a decent DjVu document. The manual one involves calling the cjb2 command line tool from the DjVuLibre project, a more automatized one would be through the pdf2djvu tool. Problem is, when you scan a book, you rarely have high-quality scans to begin with. You typically have something like this:

Raw scan output

Fortunately there’s an excellent tool that can help here. It’s called unpaper and what it does is, among others, remove the ugly black borders and other noise, rotate the pages and split double pages in half. It works with PNM type images, so if your scanning program spits out a PDF, simply use ImageMagick to do the conversion. On a large document it’s most memory-efficient to make individual calls to the convert program, one per page:

for i in `seq 1 $NUMBER_OF_PAGES`; do
 convert -density 600 scan.pdf[`expr $i - 1`] pages`printf %03d $i`.pbm
done

This converts page N of the PDF to pages00N.pbm. Now unpaper can be invoked with the necessary options:

unpaper -v --layout double --pre-rotate -90 --output-pages 2 \
 pages%03d.pbm singlepages%03d.pbm

The result is separate image called singlepages00N.pbm that are nicely cleaned up:

Single page (left)Single page (right)

Extracting the text

At this point you might think that we’re done, given that cjb2 can easily convert the resulting pages to DjVu and djvm can create a multi-page document from them. However, the result wouldn’t be searchable for text, one of the reasons why one would want to digitize in the first place.

The solution here obviously is to apply some OCR technology. There are several free OCR tools available: tesseract, GOCR and ocropus. They all work more or less well, but ocropus has a trick up its sleave: It can not only extract the text from an image but also annotate the text with pixel coordinates. This means that a text search in a DjVu viewer will not only navigate to the right page but also to the right line (unfortunately, ocropus can’t resolve individual words, just lines). Installing ocropus on OS X is a bit of a pain in the neck, but if you follow these instructions to the word, it works. The following commands will then perform the OCR analysis:

ocropus book2pages outdir singlepages*.pbm
ocropus pages2lines outdir
ocropus lines2fsts outdir
ocropus fsts2text outdir
ocropus buildhtml outdir > hocr.html

As you can see, the result is an HTML file in the hOCR format. It contains the text gathered by ocropus in <span> elements, annotated with pixel information. In order to apply this information to DjVu documents, it needs to be transformed into a format that the DjVuLibre tools, specifically the djvused tool, understand. To do that, I hacked a little Python script together:

import sys
import os.path
from elementtree import ElementTree
from PIL import Image

hocrfile = sys.argv[1]
imgfiles = sys.argv[2:]

et = ElementTree.parse(hocrfile)
for page in et.getiterator('div'):
    if page.get('class') != 'ocr_page':
        continue

    if not imgfiles:
        continue
    imgfile = imgfiles.pop(0)

    txtfile = os.path.splitext(imgfile)[0] + '.txt'
    out = open(txtfile, 'w')

    image = Image.open(imgfile)
    print >>out, "(page 0 0 %s %s" % image.size

    for line in page:
        linetitle = line.get('title')
        if not linetitle.startswith('bbox '):
            continue
        x0, y0, x1, y1 = [int(x) for x in linetitle[5:].split()]
        imgheight = image.size[1]
        y0 = imgheight - y0
        y1 = imgheight - y1

        text = line.text.strip().replace('"', '\\"')
        print >>out, '  (line %s %s %s %s "%s")' % (x0, y0, x1, y1, text)

    print >>out, ")"
    out.close()

It’s evidently very crude and makes lots of assumptions specific to the ocropus output. For it to work you need the optional but fairly standard PIL and ElementTree packages installed. The script is invoked like so:

python hocl2djvu.py hocr.html singlepages*.pbm

It will spit out a singlepages00N.txt file for every page it finds text information for.

Putting it all together

Finally the image files for the individual pages can be converted to individual DjVu files:

for i in singlepages*pbm; do
    cjb2 -clean $i `basename $i pbm`djvu
done

Before combining the pages to a compound document, the djvused tool can then be used to apply the text annotations:

for i in singlepages*txt; do
    djvused `basename $i txt`djvu -e "select 1; set-txt $i" -s
done

Lastly, the following command creates the resulting book file:

djvm -c book.djvu singlepages*.djvu

And here’s what the result looks like:

DjVu text search

Conclusion

It’s easily possible to digitize books using free tools. Some rough edges remain, however. For instance, the unpaper program isn’t completely reliable. I haven’t fiddled with the settings yet, though, so perhaps the output can be improved. The same goes for the OCR machinery which still produces lots of erroneous words. Also, it’d be nice if the pixel annotations would work for individual words, too (like on Google book search). Perhaps a linear approximation could work — certainly seems feasible for monospace fonts.

About these ads

4 Responses to “Digitizing books to DjVu with free tools”


  1. [...] Digitizing books to DjVu with free tools « philiKON – a journal [...]

  2. bob Says:

    There is also Cuneiform, which is a good free OCR. If you want PDF instead of djvu, check out xcactcode’s hocr to PDF converter.

  3. Perico Says:

    converting scanned images to PDF and then back to images is pointless and will reduce quality a lot. A free (though not open source) alternative to unpaper is the Win32 program ScanKromsator. I’ve never used unpaper, so I cannot compare, but ScanKromsator can do almost anything to clean, split, crop and prepare text and images.

    This document doesn’t cover color or mixed content books with text and image data, which will make things much more omplicated.

    Finally, cjb2 is a jb2 crompessor which does a wonderful work in lossless mode, but it is not very efficient when it comes to lossy compression. Apart from that, it won’t be able to produce multiple page compression using a shared shape dictionary, which is the main advantage of the DjVu technologies. Alternatively, the open source minidjvu will do just that: encode multiple pages with a shared dictionary, using lossy compression. The compression ratios will be near to real formatted text, and you’ll be able to encode a 60 MB PDF in less than 2 MB. Typically, a 300 page b/w text book will be slightly over 1 MB. Isn’t this wonderful?


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: