How to read the Content from Image/Scanned Copy of pdf Using Tessaract OCR


I had a requirement to read the text Content of a Scanned Pdf image which is in Jpeg format .I will discuss in this article how to read an text content which exists in that scanned image .I have heard about OCR tesseract dll.This Dll will able to fetch the text content of an scanned image.How to read the text content from the image with this dll i will Discuss in this Article...

In my current Project i had a requirement to fetch the text content of a scanned image .I used to search/refer in Google is there any thing available to read the text content of a Scanned image.But my search and efforts are in vain.My client has given a hint or suggestion to Use tessaract dll. Now i got a clue that i need to Use tessnet32 dll to fetch the text content of the image.Once i had that dll everything is easy for me to fetch the text content of the image.How i used that in my project i will discuss here in this Article.

In this Article i will show you extract the text of pdf scanned image which is in the format of Jpeg.


Look at the below figure i have added a reference to the tessnet2_32.dll in my project.

refered dll

now after successful adding of the dll i used Tesseract Class in my code aspx.cs.After that i neet to use bitmap class in the constructor i need to mention the path of the jpeg fromat of scanned image pdf.


Dictionary sss = new Dictionary();
var image = new Bitmap(@"C:\Users\enterpi\Desktop\adobe4.jpg");
var ocr = new Tesseract();
ocr.SetVariable("tessedit_char_whitelist", "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789");
//@"C:\OCRTest\tessdata" contains the language package, without this the method crash and app breaks
ocr.Init(@"C:\OCRTest\tessdata", "eng", false);
string results = string.Empty;
var result = ocr.DoOCR(image, Rectangle.Empty);
foreach (Word word in result)
{
results = results + " " + word.Text;
}

results = results;



after getting the tesseract class we will get one Object with SetVariable Function will set the required thing i.e. if you want to ONLY FETCH DIGITS set only digits than setvariable looks like.


ocr.SetVariable("tessedit_digits", "0123456789");
[\CODE]

and OCR.INIT FUNCTION which initializes the dll that package should be available in that following described path otherwise your app will get crash.It will not yield any result.In the init() function you can describe eng if you need english content to fetch.If you need France you use france.But that package should be available on that described path.

English package

And the following config files should exists on your AppData folder which is hidden folder in your computer.so type in computer explorer AppData

package list

Now the output as follows....see

This is output for a scanned image of pdf.

Scanned Image

now look at the below output.

Result pdf


Attachments

Article by srirama
A Good advice from parent to a Child , Master to a Student , Scholar to an Ignorant is like a doctor prescribed pill it is bitter to take but when they take it will do all good for them --- Bhushan

Follow srirama or read 74 articles authored by srirama

Comments

No responses found. Be the first to comment...


  • Do not include your name, "with regards" etc in the comment. Write detailed comment, relevant to the topic.
  • No HTML formatting and links to other web sites are allowed.
  • This is a strictly moderated site. Absolutely no spam allowed.
  • Name:
    Email: