Since version 3.00 Tesseract has supported output text formatting, hOCR positional information and page-layout analysis. These early versions did not include layout analysis, and so inputting multi-columned text, images, or equations produced garbled output.
Tesseract up to and including version 2 could only accept TIFF images of simple one-column text as inputs. However, due to limited resources it is only rigorously tested by developers under Windows and Ubuntu.
It is available for Linux, Windows and Mac OS X. Tesseract was in the top three OCR engines in terms of character accuracy in 1995. Tesseract development has been sponsored by Google since 2006. It was then released as open source in 2005 by Hewlett Packard and the University of Nevada, Las Vegas (UNLV). Very little work was done in the following decade. Since then, all the code has been converted to at least compile with a C++ compiler. A lot of the code was written in C, and then some more was written in C++.
The Tesseract engine was originally developed as proprietary software at Hewlett Packard labs in Bristol, England and Greeley, Colorado between 19, with more changes made in 1996 to port to Windows, and some migration from C to C++ in 1998. In 2006, Tesseract was considered one of the most accurate open-source OCR engines available. Originally developed by Hewlett-Packard as proprietary software in the 1980s, it was released as open source in 2005 and development has been sponsored by Google since 2006. It is free software, released under the Apache License. Tesseract is an optical character recognition engine for various operating systems.
Afrikaans, Albanian, Arabic, Azerbaijani, Basque, Belarusian, Bengali, Bulgarian, Catalan, Czech, Cherokee, Croatian, Danish, Dutch, English, Esperanto, Estonian, Finnish, French, Galician, German, Greek, Hindi, Hungarian, Indonesian, Italian, Japanese, Kannada, Korean, Latvian, Lithuanian, Malayalam, Macedonian, Maltese, Malay, Norwegian, Polish, Portuguese, Romanian, Russian, Serbian, Slovak, Slovenian, Spanish, Swahili, Swedish, Tagalog, Tamil, Telugu, Thai, Turkish, Ukrainian & Vietnamese (more can be added using included training files)