Scanned PDF to OCR (Textsearchable PDF)


Are you looking for a way to convert scanned PDF to Textsearchable PDF ? then read this article, I have explained How to convert Scanned PDF to OCR (Textsearchable PDF) using C# and with the help of some addon tools

Scanned PDF to OCR (Textsearchable PDF)



Introduction


Many times we need to scan some files and use them, but as it is scanned and converted to picture format, we can not copy content of the file and it is of no use, so we need some technique which will convert that scanned image to some Text searchable document that can be copied easily,

In such cases we need OCR to convert image in to text. Optical Character Recognition, or OCR, is a technology that enables you to convert different types of documents, such as scanned paper documents, PDF files or images captured by a digital camera into editable and searchable data.

process



Are you looking for a code that will convert scanned PDF to OCR ? This article will help you more in order to accomplish your task.


Let's start cooking


To create a tool which will convert scanned PDF to OCR we need following things.

Things need to collect ?


1. Ghost script
2. iTextSharp
3. tesseract-ocr
4. C#/ASP.NET (.NET framework 4 and above), Visual studio

Ghostscript
iTextSharp
tesseract
visual studio

GhostScript : It is an interpreter for the PostScript language and for PDF. Ghostscript consists of a PostScript interpreter layer, and a graphics library. Sometimes the Ghostscript graphics library is confusingly also referred to simply as Ghostscript. Even more confusingly, sometimes people say Ghostscript when they really mean GhostPDL. The Ghost script can be download from here : http://ghostscript.com/download/gsdnld.html

ItextSharp : iText is a PDF library that allows you to CREATE, ADAPT, INSPECT and MAINTAIN documents in the Portable Document Format (PDF), it can download from here : http://sourceforge.net/projects/itextsharp/

Tesseract : Tesseract is probably the most accurate open source OCR engine available. Combined with the Image processing library it can read a wide variety of image formats and convert them to text in over 60 languages, you can download it from here : http://code.google.com/p/tesseract-ocr/
With the help of all above components we are able to create scanned PDF to Text searchable PDF

Digging the code


The code will flow in following direction

First Input Scanned PDF -> using GhostScript get image scanned PDF (Page by Page) -> Run HOCR command on each extracted image using tessract to create .hocr file -> save output file as HTML -> convert the HTML to PDF using iTextSharp PDF Writer
first here we need to take input as scanned file and run ghost script on it, to take out scanned images from PDF file and write it in separate file using ItextSharp


see below code snippet, to know how to get image from scanned file (Page by Page)


public string ConvertPDFToBitmap(string PDF, int StartPageNum, int EndPageNum)
{
string OutPut = getOutPutFileName(".bmp");
PDF = "\"" + PDF + "\"";
string command = String.Concat("-dNOPAUSE -q -r300 -sDEVICE=bmp16m -dBATCH -dFirstPage=", StartPageNum.ToString(), " -dLastPage=", EndPageNum.ToString(), " -sOutputFile=" + OutPut + " " + PDF + " -c quit"); //command to fire with the help of GScript to get image from PDF file
Process p = new Process ();
string os = "C:\\Program files\\gs\\gs9.14\\bin\\gswin32c.exe"; //change your ghost script installation path here
ProcessStartInfo s = new ProcessStartInfo (os, command);
s.RedirectStandardOutput = true;
s.RedirectStandardError = true;
s.CreateNoWindow = true;
s.UseShellExecute = false;
p.StartInfo = s;
p.Start ();
p.WaitForExit ();
GC.Collect ();
return new FileInfo(OutPut.Replace('"', ' ').Trim()).FullName;
}


we can convert image to .hocr.html file using Tesseract or cuneiform, Here we have used Tesseract to create .hocr.html file. See below code snippet, to know how to convert image to .ocr.html file


public static string CreateHOCR(OcrMode Mode, string Language, string imagePath)
{
string outputFile = imagePath.Replace(Path.GetExtension(imagePath), ".hocr");
string inputFile = string.Concat('"', imagePath, '"');
string commandArgs = string.Empty; // Mode == OcrMode.Tesseract ? " -l " + Language + " hocr" : " -l " + Language + " -f hocr -o ";
string processName = Mode == OcrMode.Tesseract || Mode == OcrMode.TesseractDigitsOnly ? "tesseract" : Mode == OcrMode.Cuneiform ? "cuneiform" : "ocropus-hocr";

if (Mode == OcrMode.Tesseract)
{
string oArg = '"' + outputFile + '"';
commandArgs = String.Concat(inputFile, " ", oArg, " -l " + Language + " -psm 1 hocr ");
Process p = new Process();
string test = string.Concat(processName, " ", commandArgs);
ProcessStartInfo s = new ProcessStartInfo(processName, commandArgs);
s.WindowStyle = ProcessWindowStyle.Hidden;
s.CreateNoWindow = true;
s.UseShellExecute = true;
p.StartInfo = s;
s.WorkingDirectory = @"C:\\Program Files\\Tesseract-OCR\\"; //@"C:\Program Files\Tesseract-OCR\";
p.Start();
p.WaitForExit();
GC.Collect();
}

return outputFile + ".html";
}



finally we need to convert .hoct.html file back to pdf (which is our final output), we use iTextSharp PDf write to write content from .hocr.html file to PDF
see below snippet, to know how to write PDF file from .hocr.html


private void WriteUnderlayContent(hPage page)
{
string pageText = page.Text;
foreach (hParagraph para in page.Paragraphs)
{
foreach (hLine line in para.Lines)
{
if (PDFSettings.WriteTextMode == WriteTextMode.Word)
{
line.AlignTops();

foreach (hWord c in line.Words)
{
c.CleanText();
BBox b = BBox.ConvertBBoxToPoints(c.BBox, PDFSettings.Dpi);

if (b.Height > 28)
continue;
PdfContentByte cb = writer.DirectContentUnder;

BaseFont base_font = BaseFont.CreateFont(BaseFont.HELVETICA, BaseFont.WINANSI, false);
iTextSharp.text.Font font = new iTextSharp.text.Font(base_font);
if (PDFSettings.FontName != null && PDFSettings.FontName != string.Empty)
{
var fontPath = Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.Fonts), PDFSettings.FontName);
base_font = BaseFont.CreateFont(fontPath, BaseFont.IDENTITY_H, BaseFont.EMBEDDED);
// BaseFont base_font = BaseFont.CreateFont(BaseFont.HELVETICA, BaseFont.WINANSI, false);
font = new iTextSharp.text.Font(base_font);
}

cb.BeginText();
cb.SetFontAndSize(base_font, b.Height > 0 ? b.Height : 2);
cb.SetTextMatrix(b.Left, doc.PageSize.Height - b.Top - b.Height + 2);
cb.SetWordSpacing(PdfWriter.SPACE);
cb.ShowText(c.Text.Trim() + " ");
cb.EndText();
}
}

}
}
}



How to run


use



Prerequisite : Download Ghostscript, Tesseract from gioven path and then run EXE, Source Code

1. On double click on output exe, you will get following UI.
2. Click on Browse and give input as a scanned folder (A folder with scanned files).
3. Select 'Overide the Files' checkbox, if you want to replace original source file (Here your source PDF files will get replaced by output OCR files).
4. Click on 'Convert to OCR' button to start the process.
5. Cancel to terminate the process.
6. It will create Conversion Report.html file as summary report.
7. You can check output files in 'Ocr_ScanFile' directory on same location of exe.

Special thanks and references


https://hocrtopdf.codeplex.com/
http://code.google.com/p/tesseract-ocr/
http://sourceforge.net/projects/itextsharp/
http://soft.rubypdf.com/software/windows-version-jbig2-encoder-jbig2-exe
http://htmlagilitypack.codeplex.com/
http://itextpdf.com/
http://www.ghostscript.com/download/gsdnld.html

I have attached source code and EXE with this article
Click here to Download PDF_OCR_EXE
Click here to download Source code

Summing Up


With the help of GhostScript, tesseract and iTextsharp, we can create a scanned PDF to textsearchable PDF, a lot can happen with the help of iTextsharp Dlls we can see them in upcoming articles.

Suggestions and Queries always welcome

Thanks
koolprasad2003


Comments

Author: AcRaB17 Apr 2015 Member Level: Bronze   Points : 0

could you upload the source again the link of source is broken

Author: Guna28 Apr 2015 Member Level: Bronze   Points : 2

Hi

I would like to try with your source code but can't able to download its says File Not found. Tried with exe but my output pdf is not searchable pdf. Can you provide the link to download the source.
Kindly do the needful.
My email: mail2vguna@gmail.com
Skype: talk2guanv

Looking forward your reply.

Thanks

Author: Prasad kulkarni29 Apr 2015 Member Level: Gold   Points : 0

Dear all
I have changed the EXE and Source code download path, now you can easily download it, Please try again and let me know if it works for you

Author: srirama04 May 2015 Member Level: Gold   Points : 0

Dear Prasad ,



http://www.dotnetspider.com/resources/45646-How-to-read-the-Content-from-Image-Scanned-Copy-Using-Tessaract-OCR.aspx

Author: AcRaB27 May 2015 Member Level: Bronze   Points : 3

thank you sir for sharing the source but i try it on pdf and the output pdf is broken , i try both the source and compiled exe the result is the same

i try many pdf files the result is the same i use even your pdf file which is in
@path hocrtopdf-SourceCode\hocrtopdf-SourceCode\hocrtopdf\bin\Debug\A4_portrait.pdf

and i upload the result for you

https://drive.google.com/file/d/0Bz3i4PbtxMQ0YW9zRW5VaUVJOUE/view?usp=sharing

Author: srirama27 May 2015 Member Level: Gold   Points : 0

Had you tried this below Link with real time images...

http://www.dotnetspider.com/resources/45646-How-to-read-the-Content-from-Image-Scanned-Copy-Using-Tessaract-OCR.aspx

Author: AcRaB28 May 2015 Member Level: Bronze   Points : 0

@srirama

i dont want to ocr images i already succeed in that i want pdf as input



  • Do not include your name, "with regards" etc in the comment. Write detailed comment, relevant to the topic.
  • No HTML formatting and links to other web sites are allowed.
  • This is a strictly moderated site. Absolutely no spam allowed.
  • Name:
    Email: