Tesseract

Converting PDF to Text using Tesseract…

Tesseract is unable to handle pdf files directly, therefore files first converted to a tiff using ghostscript before passing it to Tesseract. Tesseract does not have ability to process pdf files, In addition tesseract cannot process multiple page tiffs(images), so ghostscript go along with it to complete the task. I am using below command to process multiple tiff files:- for i in *.tiff ; do tesseract $i $i;┬ádone; When we run ghostscript and pass pdf file to process, it generate multiple tiff files for each page of our pdf. Run below command to process pdf file using ghostscript:- gs -dNOPAUSE […]

Analytics, Tesseract

OCR – “Optical Character Recognition”, Set up Tesseract OCR on Centos 6.8…

OCR means “Optical Character Recognition” and Tesseract is licensed under the Apache License v2.0. Tesseract OCR configured system is able to convert images with embedded text to text files. This tutorial “How to install” is meant as a practical guide; it does not cover theoretical backgrounds/concept of OCR/algorithm used in Tesseract. They are treated in lot of other documents in the web. Tesseract installation is supported beautifully with Ubuntu without issues(cause apt-get) but with Centos required some effort and correct version to build. Please follow below steps for Tesseract installation on Centos:- 1. OS update using yum. Setup Centos 6.8 […]