Converting text from analogue to digital using OCR

A2D

Optical Character Recognition (OCR) is used in the process of digitizing paper based text.

In trying to keep up with the digital age, there are many solutions for those seeking to go paperless, or at least looking for more options concerning how to utilize resources. Converting static analogue texts over to digital documents allows for these resources to be manipulated, altered, stored and distributed in a variety of new ways (Of course, we must be careful not to infringe on someone’s copyright).

OCR technology has been around for a while. There are many options for those with big budgets for this sort of thing. Thankfully there are also some options for those without corporate financing.

Capturing the Text Image

no-smartphones-allowed-outline-removed2-v1-512x512

Using a smartphone as a make-shift scanner is one option for capturing the analogue image. One nice thing about this is that it lets you work (or attempt to work) from anywhere. You can also access your images from other devices using cloud storage and the like. However, there are enough challenges to this technique to make it frustrating–namely in getting the image square enough to minimize distortion, capturing a sharp image, getting the lighting right, and keeping the text large enough in the image to be accurately read by the OCR software you will be using in the next step. So, it is not the ideal method.

scanner

The civilized solution is to use a scanner. Scanner beds keep the paper flat and the light even. The image resolution can be adjusted to a fine enough detail to capture a good image. I usually use the standard 300 dpi. Scanners are relatively cheep nowadays. Currently, I am using a Canon MG3130 which acts as a scanner, copy machine, and printer. I think it was around 7,000 yen at Costco Japan.

Converting the Image Text

lamp1

One OCR solution I have used in the past, is OnlineOCR.net. As the name, and link, suggest, it is online. It is also free to use in a limited capacity. Good enough for smaller tasks, and good for launching an initial trial balloon of the process–no commitment necessary (you can use it without creating an account).

OneNote_15

Another option is to use desktop software. Microsoft offers this within OneNote (Office 2013). There is also a OneNote mobile app which can streamline collecting images if you are going to attempt to use your smartphone. Once you have your image in place, it is relatively fast and somewhat reliable. The process includes right clicking on the image and selecting Copy Text from Image from the drop down menu (You may have to first right click on the image and select Make Text in Image Searchable).

Screenshot 2014-09-06 10.08.04

One complaint I have using OneNote is that it transfers analogue line breaks into digital ones. This is a potential formatting nightmare if you have to later re-format every line break into wrapped text. If anyone out there knows a solution to this, please let me know.

Google Drive

Of course, my fellow Googlites out there probably already know the quickest and easiest solution. Upload your image to Google Drive, right click the image and select open in Docs. Done.

Screenshot 2014-09-06 11.36.52

One nice thing about this solution, is that if you are still dead set on trying to do this with your smartphone (you James Bond spy, you), you can set up your phone to send your images directly to Drive as you take them, and then just open them up later in Docs–when you get around to it. Could anything possibly be easier? By the way, Google also keeps your formatting nice with wrapped text and paragraph breaks matching the original text. Seriously, why can’t everything be as easy as Google?

OCR Errors

Screenshot 2014-09-06 10.12.44

There is, or course, a need to proofread the result. The amount of errors depends primarily on text font, image quality, and the skill of the programming team who designed the OCR software in the first place. Yet, even with error correction the entire process of going from analogue to digital is less time consuming using OCR software than, say, typing out an entire book by hand. I have personally just spent 70+ hours converting three small books from analogue to digital. I can’t imagine how long it would have taken had I tried typing out everything on my own. Actually, I can. It would have taken 0 hours, because I would never have attempted it in the first place.

So, there you have it. Some technology to help you streamline your transition to paperless teaching, or for whatever other purpose you might find it useful. Comments and suggestions welcome.

%d bloggers like this: