tesseractrun

Tesseract is unable to handle pdf files directly, therefore files first converted to a tiff using ghostscript before passing it to Tesseract. Tesseract does not have ability to process pdf files, In addition tesseract cannot process multiple page tiffs(images), so ghostscript go along with it to complete the task.

I am using below command to process multiple tiff files:-

for i in *.tiff ; do tesseract $i $i; done;

When we run ghostscript and pass pdf file to process, it generate multiple tiff files for each page of our pdf.

Run below command to process pdf file using ghostscript:-

gs -dNOPAUSE -dBATCH -sDEVICE=tiffg4 -sOutputFile=scan_%d.tif 1500_13578937.pdf

GPL Ghostscript 9.20 (2016-09-26)

Copyright (C) 2016 Artifex Software, Inc. All rights reserved.

This software comes with NO WARRANTY: see the file PUBLIC for details.

Processing pages 1 through 4.

Page 1Loading StandardSymbolsPS font from %rom%Resource/Font/StandardSymbolsPS… 4486484 2846586 2226420 871182 3 done.

Page 2Loading NimbusMonoPS-Regular font from %rom%Resource/Font/NimbusMonoPS-Regular… 4559292 3175740 2247716 886073 3 done.

Page 3Loading NimbusMonoPS-Bold font from %rom%Resource/Font/NimbusMonoPS-Bold… 4869340 3529503 2394484 1023982 3 done.

Page 4

You can see when I pass 1500_13578937.pdf (four pages pdf file) as input to ghostscript, below is my output where four tif files gets created:-

-rw-r–r– 1 root root 47365 Oct 12 06:24 scan_1.tif

-rw-r–r– 1 root root 11702 Oct 12 06:24 scan_2.tif

-rw-r–r– 1 root root 15917 Oct 12 06:24 scan_3.tif

-rw-r–r– 1 root root  7199 Oct 12 06:24 scan_4.tif

In this example, the sOutputFile is the name of output files. By appending a %d to the end, it will create and number, sequentially, different files for each page. {input.pdf} is my source, multi-page pdf file.

Now let us perform OCR to convert image file to text:-

Each file must be independently converted to txt. This can be done simply with the following command:

tesseract scan_1.tif scan_1

Tesseract will automatically append .txt to the file name, so the result of the above command would be a file named scan_1.txt containing the text from scan_1.tif.

Combine the text files into one:-

We can combine them all into one result file by doing the following:

$ scan_1.txt > result.txt 

$ scan_2.txt >> result.txt 

$ scan_3.txt >> result.txt

Combine it all together into a script:-

Below is a simple script to automate this task. The script takes as a command line argument the input file and will produce result.txt, overwriting any existing result.txt file.

[root@example SampleFiles]# cat combine_pdf_run.sh

# $1 is the first argument

# remove result.txt

rm result.txt

# convert the pdf to a group of tiffs

gs -dNOPAUSE -dBATCH -sDEVICE=tiffg4 -sOutputFile=scan_%d.tif $1

i=1

while [ $i -ge 0 ]

do

if [ -a scan_$i.tif ]

then

tesseract scan_$i.tif scan_$i

# add the text to the result.txt file

cat scan_$i.txt >> result.txt

rm scan_$i.txt scan_$i.tif

i=$(( $i + 1 ))

else

i=-100

fi done

This script will generate only one txt file containing all of the text from the pdf.

You can see running demo of above code at

http://ammozon.co.in/gif/ConvertUsingOCR.gif

Leave a Reply

Your email address will not be published. Required fields are marked *