OCR means “Optical Character Recognition” and Tesseract is licensed under the Apache License v2.0. Tesseract OCR configured system is able to convert images with embedded text to text files.

This tutorial “How to install” is meant as a practical guide; it does not cover theoretical backgrounds/concept of OCR/algorithm used in Tesseract. They are treated in lot of other documents in the web.

Tesseract installation is supported beautifully with Ubuntu without issues(cause apt-get) but with Centos required some effort and correct version to build.

Please follow below steps for Tesseract installation on Centos:-

1. OS update using yum.

Setup Centos 6.8 and update it using “yum update”

2. Preparation

Tesseract-ocr is not convenient to download using yum as yum says no lirary exists for tesseract or leptonica. Therefore we need to download source and build both Tesseract-ocr and leptonica.

3. Run below commands for “development tools” and dependent libraries.

sudo yum groupinstall “Development tools”

sudo yum -y install automake autoconf libtool zlib-devel libjpeg-devel giflib libtiff-devel libwebp libwebp-devel libicu-devel openjpeg-devel cairo-devel 

4. Download Leptonica below version and install:-

Download site:- http://www.leptonica.com/download.html

tar xzvf leptonica-1.73.tar.gz

cd leptonica-1.73

./configure

sudo make 

sudo make install

5. Download Tesseract below version and install:-

download latest version from https://github.com/tesseract-ocr/tesseract/releases

or

wget https://sourceforge.net/projects/tesseract-ocr-alt/files/tesseract-ocr-3.02.02.tar.gz

tar xzvf tesseract-ocr-3.02.02.tar.gz

cd tesseract-ocr

./autogen.sh

./configure

sudo make

sudo make install 

sudo ldconfig

6. Download Tesseract english trainer:-

Download sites:-wget https://sourceforge.net/projects/tesseract-ocr-alt/files/tesseract-ocr-3.02.eng.tar.gz

wget http://osdn.net/projects/sfnet_tesseract-ocr-alt/downloads/tesseract-ocr-3.02.eng.tar.gz/

tar xzvf tesseract-ocr-3.02.eng.tar.gz

sudo cp tesseract-ocr/tessdata/* /usr/local/share/tessdata

export TESSDATA_PREFIX=/usr/local/share/tessdata

7. GhostScript is for pdf to png processing:-

Download sites:- https://github.com/ArtifexSoftware/ghostpdl-downloads/releases

https://sourceforge.net/directory/os:linux/?q=ghostscript-9.16.tar.gz

tar xzvf 

cd ghostscript-9.09

./autogen.sh

./configure

sudo make

sudo make install 

Lesson learn:-

I was running into issue with below version of Tesseract and Leptonica installed. It was very difficult and time consuming to correct errors.

[root@mycluster ~]# tesseract -v

tesseract 3.02.02

leptonica-1.69

Exception in thread “main” java.lang.UnsatisfiedLinkError: Unable to load library ‘tesseract’: Native library (linux-x86-64/libtesseract.so) not found in resource path

Resolution: It seems there is bug with leptonica-1.69

https://bugs.mageia.org/show_bug.cgi?id=10403

Therefore i have installed next version of leptonica from http://www.leptonica.com/download.html and downloaded leptonica-1.73.tar.gz.tesseract

Below is the command to see the installation:-

[root@example SampleFiles]# tesseract -v

tesseract 3.04.01

 leptonica-1.73

 libjpeg 6b (libjpeg-turbo 1.2.1) : libpng 1.2.49 : libtiff 3.9.4 : zlib 1.2.3

You can use below command to extract text from image:-

$tesseract sample.tiff sample

output sample.txt contains the text output.

Please visit to see tutorial on How to install and run Tesseract below:-

http://ammozon.co.in/gif/ocr.gif

Leave a Reply

Your email address will not be published. Required fields are marked *