How to extract text from images: a comparison of 10 free OCR tools

Printing text to paper is done every day; on some occasions however the reverse is needed – getting the original text back from a scanned image or photograph, for further editing and use.

This conversion is named Optical Character Recognition or OCR for short, and it can convert scanned books and documents into editable text, to get editable text from PDFs created via scanning, or even get text from screenshots and images.

There are a variety of tools available for character recognition and some of them are free to use. This article will help you find and choose between several free OCR tools.

OCR Illustration6_e


Online OCR services vs. desktop OCR software

Selecting the right OCR tool depends on your specific needs. Generally OCR tools can be divided into two – online services and desktop software, both of them have their positive and negative sides.

Online services will require that you upload your files on the internet to their servers, so there may be privacy concerns as well as time/bandwidth concerns if your document is big. Most have limits to file size and count of pages to process daily/weekly that they will process for free; for bigger jobs they require to buy extra processing power. On the flip side, many of these services are really good at the OCR itself.

With Desktop Software you don’t need to worry about uploading sensitive information to foreign servers, or whether your file will take too long to upload. Some desktop software programs generally give better text review options, and some offer integration with scanner software.

A note on comparing OCR software: OCR programs are not mainstream applications so there is only limited number of freeware titles available, unlike for example media players or file managers. In this article we aimed to provide the complete list of items found and evaluated at the present moment. This is because OCR results tend to vary; the accuracy of different OCR solutions depends on the quality, file format and fonts used in the source documents. For instance some programs provide better quality with typewriter fonts and worse results with screen fonts whereas other program perform exactly the opposite.

We therefore shied away from a head-to-head comparison of OCR accuracy in this article as the rating can be unjust for the specific files you might need to process. There is some general information about getting good OCR result at the end of the article.

We reviewed the following online OCR services and desktop OCR programs, all of which are either FREE or have a free component.

Online OCR services Desktop software
  1. Google Docs
  2. Free Online OCR
  3. i2OCR
  4. OCRonline
  5. Online OCR
  1. Cuneiform OpenOCR
  2. FreeOCR
  3. gImageReader
  4. Puma.NET
  5. SimpleOCR

Quick links: click to jump to our recommendation for  online OCR services and desktop OCR software. Also, see our recommendations for better OCR results.


Part1: Online OCR software

Online OCR software is available through the web browser and you don’t have to install new software on your computer. All you need is to get the image file using scanner or a digital photo camera, upload it through the online OCR web page and wait for the processed file to download.

Google 11. Google Docs

If you have a Gmail or other Google account you might try Google Docs first. Google Docs is not a dedicated OCR tool but it provides the OCR power Google uses to digitize books and process PDFs for their search engine.

To get text from image or PDF files you need to first upload and convert the files to Google Docs. Then you can do the further editing online or/and download it back as PDF, DOC, TXT etc.

In Google Docs to upload the files first you need to click Upload button, select Settings from the menu and check ‘Convert uploaded files to Google docs format’ and ‘Convert text from uploaded PDF and images files’ and then click Upload/Files.Another way is to check ‘Confirm settings before each upload’ after clicking Upload/Settings so that every time you upload a file it is asked whether you want to convert the file or leave it intact. This gives also an option to select which language dictionary will be used in the text recognition process. The file is therefore converted to Google Docs document having both original image(s) and converted text in it. You can review the text and delete the original images afterwards.

Google Docs conversion works pretty good, especially with English texts. Over 30 different languages can be selected but if your language is not included in the list, the conversion may give an error and the file will not be processed. Of course – if you don’t have a Google account you can create one any time.

  • Input image file types: most bitmap formats
  • Input PDF files: yes
  • Output file types: ODT, PDF, TXT, RTF, DOC, HTML
  • Languages: 30+
Google Docs / PROS: CONS:
  • Unlimited processing capacity
  • Text in some minor languages may not be recognized

Free Online OCR2. Free Online OCR

Free online OCR web page is more thoroughly reviewed in freewaregenius.com. 

  • Input image file types: GIF, BMP, JPEG, TIFF, PNG
  • Input PDF files: yes
  • Output file types: DOC, PDF, RTF, TXT
  • Languages: English dictionary only
Free Online OCR / PROS: CONS:
  • No capacity limits for processing
  • Keeps original formatting and Layout
  • Only English dictionary supported. Text in other languages may be not recognized

i2OCR3. i2OCR

  • Input image file types: TIF, JPEG, PNG, BMP, GIF, PBM, PGM, PPM
  • Input PDF files: no
  • Output file types: TXT
  • languages: 30+
i2OCR/ PROS: CONS:
  • No limits for uploading
  • Has a review option after character recognition – the original image and result text is shown side-by-side on screen.
  • Only text output, all the original formatting will be lost. Though at least it supports multi column pages correctly.
  • Creates “hard” linebreaks at the end of each line.
  • Does not process PDF files.

OCRonline4. OCRonline

  • Input image file types: JPG, TIFF, PNG, GIF
  • Input PDF files: yes
  • Output file types: TXT, PDF, RTF, DOC
  • Languages: 150+
OCRonline/ PROS: CONS:
  • Excellent recognition quality
  • Rebuilds original formatting
  • Impressive list of 150 language dictionaries
  • Limited upload capacity – 5 pages in a week, file size up to 10 MB. Need to pay to get extra pages.

Online OCR5. Online OCR

  • Input image file types: JPG, JPEG, BMP, TIFF, GIF
  • Input PDF files: only for registered users
  • Output file types: DOC, XLS, TXT (+ PDF for registered users)
  • Languages: 30+

Note: There is registered and guest mode available for this site. In guest mode 15 images per hour can be processed and maximum file size is 4 MB. There are some extra possibilities in registered mode, like uploading larger images, ZIP archives and multi-page PDFs. Initial credits after registering is for converting 20 pages.

Online OCR / PROS: CONS:
  • Supports some languages that other servers do not support.
  • Limited upload capacity. Extra capacity may be purchased or earned by bonus program.

Our Recommendation: The last word on online OCR services

From the online OCR solutions reviewed above, OCRonline provided good and stable OCR accuracy with a number of different fonts and texts. Unfortunately the free service is limited by 5 pages per week. If you need more capacity, try the other providers as they also may give good results depending on your source text.


Part2: Desktop OCR software

Desktop software you need to download and install to your computer, and they usually have more configurable options than online tools. Some programs include the ability to acquire image directly from a scanner so you don’t need to use other programs to do that.

The following OCR software will be reviewed: Cuneiform, OpenOCR, FreeOCR, gImageReader, Puma.NET and SimpleOCR. There are some more free tools available, which are mainly meant for more specific tasks. JOCR is for getting text from screenshots, requires Microsoft Office 2003 or later to be installed and has been previously reviewed here. Also there is Nuance PDF Reader that is able to upload scanned PDFs to its online service for character recognition. Nuance PDF Reader is previously reviewed here. And finally, there is MyMorph, a program intended for converting document archive files from one format to another, like TIFF, PDF, RTF etc. MyMorph is able to convert image files to editable text files.


Cuneiform 16. Cuneiform OpenOCR

OpenOCR is based on commercial product Cuneiform that was released as freeware on 2007.

  • License: freeware
  • Input image: most bitmap file formats
  • Input PDF: no
  • Scanner input: yes
  • Output: TXT, RTF, HTML + output to Word/Excel
  • Dictionary languages: 20+
Cuneiform OpenOCR / PROS: CONS:
  • Includes both single file and batch of files processing mode.
  • Installation program creates invalid start menu shortcuts like NewFolder1

FreeOCR7. FreeOCR

This is another of the programs that uses the open source Tesseract OCR engine. Tesseract was originally developed by HP and is currently sponsored by Google.

  • License: freeware
  • Requires: Microsoft .NET
  • Input image: TIFF, multi-page TIFF
  • Input PDF: yes
  • Scanner input: yes
  • Output: TXT
  • Dictionary languages: 9
FreeOCR / PROS: CONS:
  • Tesseract OCR engine has good accuracy.
    • Only text output, no formatting recognition
    • No multi-column support (must crop the image manually to one column)

gImageReader8. gImageReader

gImageReader is one of the front-ends to the free Tesseract OCR engine. You need to download and install Tesseract separately from this page. Tesseract engine uses OpenOffice dictionaries and spellcheckers that can be downloaded from here.

  • License: freeware (GNU)
  • Requires: Tesseract, need to download separately
  • Input PDF: yes
  • Dictionary languages: many, uses freely downloadable OpenOffice spellcheckers
  • Scanner input: yes
  • Input image: JPEG, GIF, PNG, TIFF
  • Output: TXT
gImageReader / PROS: CONS:
  • Tesseract OCR engine has good accuracy
  • OCR area(s) can be manually selected
    • Only text output, no formatting recognition

Puma.NET 19. Puma.NET

Puma.NET is actually not a user solution but a development kit based on CuneiForm OCR engine, though it contains a sample program with the front-end.

After installing there will be no launch icon in Start Menu but you can find the program Puma.Net.Sample.exe deep in the C:\ Program Files\ Puma.NET\ Sample\ bin\ x86\ Debug\folder.

  • License: freeware (BSD)
  • Requires: Microsoft .NET
  • Input image: BMP, GIF, EXIG, JPG, PNG and TIFF
  • Input PDF: no
  • Scanner input: no
  • Output: TXT, RTF, HTML
  • Dictionary languages: 27
Puma.NET / PROS: CONS:
  • Font and formatting detection
    • You have to create the shortcut to start the program by yourself
    • Leaves “hard” linebreaks

SimpleOCR10. SimpleOCR

SimpleOCR uses its own OCR engine that is capable of learning the fonts in a particular document.

  • License: free for all non-commercial purposes
  • Input image: TIFF, JPG, BMP
  • Input PDF: no
  • Scanner input: yes
  • Output: DOC, TXT
  • Dictionary languages: 3

Note: SimpleOCR seems to give better results from color JPEGs, not grayscale.

SimpleOCR / PROS: CONS:
    • Word by word text revision
    • Ability to train the engine to use specific fonts
    • Includes both single file and batch of files processing mode
    • Only 3 languages dictionary.
    • No font and format detection

Our Recommendation: The last word on desktop OCR software

From the desktop OCR software reviewed above Cuneiform OpenOCR provided good accuracy with different fonts including artistic. Having said that, most of the programs performed also good processing text with simple fonts.

 

About OCR and how to get better results

OCR is used to turn printed books and documents back to text. OCR tools analyze the image, recognizes the characters/words and output them in form of editable text file. The character recognition is never perfect. By some studies the accuracy of the commercial OCR products vary from 70 – 98% and total accuracy can be achieved only with the help of human review.

To improve accuracy most OCR tools also use dictionaries. Instead of recognizing individual characters they try to recognize whole words that exist in the selected dictionary. Some OCR software cannot detect fonts and formatting and can only give plain text as output. You then need to reapply all the formatting manually. But some of the OCR engines detect fonts like bold and italic, some of them also detect paragraph formatting, multiple columns, tables and images inside the text, so they can use this information to replicate the text in editable format like DOC, HTML etc.

The source for character recognition can be qn image obtained by scanner, digital camera or screenshot. If you use a scanner and you have lot of pages you might use OCR software that has scanner support built in. The program then suggests the settings that give best results for OCR. Usually this means 300 dpi resolution (200 dpi minimal) and grayscale JPG or TIFF image. Some software like color images better than grayscale, though. So if you do not get best results it is recommended to try several settings, like 300 dpi color JPEG and 300 dpi grayscale JPG. Or TIFF instead of JPG.

Getting decent OCR results using images taken by digital camera is quite difficult. Good light, no flash, straight paper, macro mode etc help to get better results as it is described for instance in this article. It is also possible to get text from screenshot files but it also needs some extra measures. Usually the resolution of a screenshot is 72 dpi but OCR need at least 200 dpi. Some OCR programs can automatically adjust the resolution of the image file, but for others you need to use some image manipulation program to convert the resolution to 200 dpi.

For screenshots you can also use special programs like JOCR. OCR is often used to process PDF files. A PDF usually consists of images that are shown on screen and also the source text that you can select for copy-paste. But some of the PDFs contain only images, like scanned PDF files. Usual “convert-PDF-to-Word” type software cannot process these files. To extract text from PDF files that contain only images you need to use some OCR software that accepts PDF files for input.


 
 
 
November 1, 2011
Priit
23
flattr this!
  • http://raldaz.wordpress.com Raul Diaz

    Te falto evernote…

  • ermete

    Sorry : could you tell me where you downloaded Cuneiform OpenOCR .
    If I follow the links of download it goes to a Russian page if, in that page, I click english version
    it goes to another page where you can dowload a demo.
    I did something wrong or is a treasure search game?

  • http://portablefreeware.com webfork

    @Raul
    Espanol: Evernote no exporta a un archivo. Solo en la red.
    English: Evernote didn’t export its the text it recognizes or I would have switched over to it some time ago. Please someone post if this has changed.

    @Priit
    This is definitely a topic that needed review — thanks for covering it.

    Additional note: gImageReader *appears* to use a more recent version of the Tesseract engine than FreeOCR. Its also open source (GPL) and doesn’t require dotNET.

  • rodocop

    ermete,
    don’t click to english page – just scroll down and you’ll find this directlink:
    http://cognitiveforms.ru/downloads/setup_openocr_cuneiform_en.exe

  • ermete

    Thanks rodocop

  • Pingback: Extract Text from Images: 10 OCR Tool Compared | Shenanigans

  • Doug

    I know it’s not freeware, but I’ve really grown to like the Abbyy line of PDF tools. If you get on their mailing list, they do have occasional good sales. The OCR capability is overall very good. Lite versions of their software is sometimes bundled with hardware scanners.

  • Hamstermoon

    I use http://www.free-ocr.com/. It seems to allow more uploads (10 images an hour) than the others you mention.

  • Ron

    Although it isn’t free, another common tool is OneNote in Office 2007 and 2010. It incorporates OCR. Paste an image into OneNote, right click, and select the “Extract Text” option. It only captures text, not formatting.

    OneNote replaces MODI as OCR tool although there are techniques for getting downloading MODI into 2007 or 2010.

    On HowToGeek site, referring to this page, there is a comment about 2007 / 2010 Word providing OCR. I would really like to hear more about that. I, and several other “experts” I’m in contact have never mentioned OCR in Word.

  • Kent Dyer

    Here is a desktop app that does pretty simple captures of dialogs, Windows Explorer, etc. You do have to have Office 2007 (Imaging compents) or earlier installed to have it OCR, however. You have run it as a standalone (there is no install available) and it captures screens, sections too.

    http://home.megapass.co.kr/~woosjung/Product_JOCR.html

    I have used this for probably 3-5 years now and I make sure it is part of my toolset.

    HTH,

    Kent

  • http://www.helengrives.nl Hèlen

    Hi,

    Thanx for the info. I tried CuneiForm for a Dutch text and it worked like a charm. I will test your other recommendations later on.

  • http://www.MyFreeOCR.com Hakon

    http://www.MyFreeOCR.com is a free online ocr software.

  • http://purebible.blogspot.com/ Steven Avery

    Hi, while it is not free, Abbyy Screenshot Reader is only about $10, it works superbly. I use it even on google books and archive.org frequently, also PDFs without an easy copy facility. (All done for fair use extracts, without having to type in many paragraphs.) The quality of the OCR of course varies with strange fonts, small text, etc … but for normal text it is superb, since Abbyy knows their stuff and has lots of more sophisticated products for other purposes (which I have not needed at all). About 3 years ago I bought it and it has been the best $10 purchase, by far.

    Steven

  • http://portablefreeware.com webfork

    PDF-XChange — a PDF viewer and basic editor FWG highlighted back in ’07 — just added OCR capabilities to the freeware and portable versions. I’ve so far been very impressed by the accuracy.

  • sheraz

    This is really an informative article, I came across a nice Jave OCR component. I hope you guys are going to like it. Here are some details
    Aspose.OCR for Java is a character recognition component that allows developers to add OCR functionality in their Java web applications, web services and Windows applications. It provides a simple set of classes for controlling character recognition tasks. It helps developers to work with image files from within their Java applications. It allows developers to extract text from images, Read font, style information quickly, saving time & effort involved in developing an OCR solution from scratch.

  • Leoncio Tritón

    Ni que hablar que Cuneiform me parece un excelente programa entre los programas OCR ya que dieron el link de descarga sólo agregar que la pagina a escanear se puede hacer con un archivo jpg de una camara a 2 ó 3 para dos páginas al mismo tiempo y tenemos que usar el treshold de algún programa de retoque como fotografix para contrastar la imágen, hasta obtener una imágen con caracteres legibles.

  • vion77

    Try Wondershare PDF Converter Pro (for Windows or Mac):
    http://www.programmiperpc.com/convertire-pdf-scannerizzato.html

  • http://link2how.blogspot.in rick

    thanks,
    check out thisinterclue click less know more

  • Dave

    GT Text is very good one!
    At least free, fast and accurate
    http://gttext.googlecode.com

  • Richard

    Google Docs will only accept uploads of up to 2MB

  • Aware

    Really helpful. Thanks.

  • Pingback: How to convert PDF to Word DOC for free: a comparative test (updated) - freewaregenius.com — freewaregenius.com

  • Francis

    Can any of these softwares also do Quality Check, i.e. once after conversion, they identify the (likely) errors in the converted text document?