Using OCR Software

Jamel Ostwald's
Gateway to the Early Modern World

Gateway

Research

Using Optical Character Recognition (OCR) Software

In theory you should of course transcribe primary documents as faithfully and as completely as possible, but I certainly don't have enough time or patience to copy every document in its entirety that I consult in the archives (I'll leave that to librarians, archivists and editors). However, you can speed up data entry AND copy the entire text of printed sources by using Optical Character Recognition (OCR) software. OCR software scans in a page of text as an image and then converts those text images into actual text which can be edited and more importantly searched in text-based software. This allows you to search for text that you haven't already coded, as is explained in #4 on the Database page.

I use Abbyy's FineReader 4.0 (recommended by a number of librarians/archivists/researchers on the history-digitization listserv), which can read over 50 different languages thanks to its dictionaries and is very accurate, depending on the quality of the original source and how you customize it. My informal comparisons with OmniPage Pro indicate FineReader is much, much better at recognizing text, and it's won numerous awards around the world. I have over 50 volumes of printed primary sources that I am slowly scanning in, documents that are then imported into Word and from there into Access. There are a number of tricks to speed up the process (such as always placing the page on the same location on the flatbed and not previewing each image if possible, and scanning two pages at a time if they fit on the scanner, where a legal-size scanning bed might be useful; click here for some other recommendations) and by using them my old 300 MHz 64 MB RAM PC with an inexpensive Visioneer PaperPort OneTouch 5300 flatbed scanner with a slow parallel connection (I would NOT recommend this scanner!) takes one minute scanning each page of text at 300 dpi grayscale, one to two minutes for the OCR software to convert the scanned page image into editable text (my 475 MHz 64 MB laptop quadruples the speed of recognition to about 15 seconds per page), and then several more minutes correcting the text as needed before exporting it to Word and then Access (or whatever other software you are using to store your notes in). With my new system (Athlon 900 MHz 256 MB RAM and a Microtek ScanMaker V6UPL with a USB connection) I can scan in documents 2.7 times as fast (i.e. ~ 135 2-page scans per hour, or about 20 seconds per image), and it takes about 40 minutes to recognize these pages. Unless you have a very complete dictionary in Word which includes common mispellings, I'd suggest proofing the text in FineReader, as it color-codes words which it thinks may be incorrect and it also magnifies in a separate window the corresponding word from the original graphic image so you don't have to search for the word on the printed page. As an example, after several hours of practice proofing, it took my old system and bad scanner approximately two-and-a-half hours to scan in, recognize and correct 41 pages from a German printed text (an estimated 12,300 words, which would take a 40-wpm typist over 5 hours to type non-stop), with FineReader making perhaps several real errors per page on average. With my new setup, I can do 270 pages (135 two-page scans, approximately 81,000 words) in one hour, or a rate (excluding corrections to be made) of 1350 wpm! FineReader has a batch feature which lets you scan in multiple pages consecutively without having to wait for each page to be recognized, and then you can either have the computer recognize the scanned images in the background, or recognize them at some later point in time, or even send them immediately to another networked computer for recognition.The correction stage usually takes the longest and you can "train" FineReader to recognize difficult letters and add words to its dictionaries; I'd also suggest you not correct every "error" flagged by the software, as most of them are correct "guesses" - just skim through the recognized text looking for obvious color-coded errors. After awhile with the same source, you'd be surprised at how quickly you can skim the proofed text.

I much prefer this process to manually typing in long letters, in which case I wouldn't enter the entire letter anyway, and I can do other things at my desk while the scanning and recognizing is being done. OCR software is also particularly helpful for foreign languages which you may have difficulty typing quickly (and this offers the opportunity of importing the text into translation software if you need to). Compare the time and effort involved: you select the document, place it on the scanner and press a few buttons, and then skim over the text looking for the occasional error, whereas if you enter in the document manually, you'll probably spend just as much time with less accuracy and with strained eyes and sore wrists to boot. So for long texts, for difficult-to-type texts, or just for a change of pace, I'd suggest you at least try out an OCR program or two: many of them have demos on their websites you can download.

Top