Google’s “Quick View” PDF Also Does OCR Conversion For Many Languages

I was searching for a free program to OCR some of my pdf files. I have tried topocr, freeocr, tesseract-ocr and many others. The only one I found useful is the OCR offered by PDF XChange viewer. I downloaded the portable free version from their website and tried the OCR feature. It worked like a sharm for me. It uses advanced OCR (optical character recognition) to extract the text, even if that text was contained inside an image, which is common for PDFs produced from a scan-to-PDF function. Language packs are available for over 20 languages including French, German, Italian, Spanish, and more.

With that utility installed, I was cooking – I can convert any file (in particular PDF and TIFF) into bitmap, and then I can extract the text out of the bitmap. The only consideration was to somehow treat PDF files containing text differently – after all, OCR is very computation intensive and somewhat error prone even with perfect image quality and resolution. So another quick search, and I have a PDFTOTEXT For performance reasons, only the first page of the PDF/TIFF file is OCR-ed. There are additional ImageMagic utilities to combine multiple images together before OCR-ing if you want to OCR the whole document.

For repurposing, OCR typically converts a printed table into an Excel spreadsheet, or an old book either into a PDF with searchable text hidden under the page images or into a word-processing document that you can edit and reuse. High-powered OCR software can also convert printed text into HTML files that anyone can view in a browser. Affordable PDF OCR Server for business and home users. ReadIRIS Pro provides a very accurate OCR recognition rate at a low cost, but still has some of the advanced features that higher priced professional PDF OCR Server includes. The main limitation is that the Pro version is limited to documents under 50 pages.

Tesseract is an orphaned brain child of HP that worked on it from 1985 to 1995. Then it was moved to the Open Source, and now if I understand it correctly, Google is working on it. With credentials like that, it’s no wonder that Tesseract scores one of the highest marks on OCR recognition and accuracy. After downloading and struggling just a bit, I got Tesseract to work. The struggling part was that the home page claims that its base input format is a TIFF file. May be my TIFFs were bad, but I was able to get it to work only for BMP files.

Click the gear icon on the selected PDF file. Then, users can customize the page ranges and output format. There are MS Office Word and Rich Text Format for users to choose from. Here we choose “.doc” For poor-quality source documents, the best way is to scan in grayscale. Graysale mode can extract more information from the scanned documents. The program will automatically set the optimal brightness value when performing the process. Google Docs did a pretty good job here. It struggled to understand the web addresses, but all these tools did.

Since scanned PDFs are nothing but images, don’t be surprised if Google adds a “search by text” function to their Image Search engine similar to OneNote or EverNote. That will surely be huge. I am particularly interested in those OCRs that can accept a scanned pdf file as input and still produce as output another pdf file that looks the same as the input one but with its text copyable. And even better than that, just as I resigned myself to a painful process of opening a PDF, running OCR, doing a Save As command, and closing the PDF—repeat ad nauseam—I discovered a freely downloadable AppleScript droplet for batch OCRing

Smartsoft Invoices reads the data from the invoice image based on pre-defined templates. Defining a new invoice template is easy – the software locates all text regions and identifies them. So, for each new invoice, coming from the same vendor from now on, the software extracts the data automatically. OmniPage Ultimate is the all-in-one solution for professional document management. OmniPage utilizes optical character recognition technology that is unmatched in the industry. OmniPage Ultimate comes jammed packed with dozens of helpful and time-saving features. Your project will shine with OmniPage Ultimate! “