Convert PDF files to text files using PyTesseract

Cpak
2 min readAug 11, 2020

Many Natural Language Processing (NLP) tasks start with unstructured data in the form of scanned PDF files. I used this simple Python class that leverages the Tesseract Optical Character Recognition (OCR) engine to convert a folder of scanned PDF files to respective text files.

Let’s first install the dependencies. We will use apt to install Tesseract, ImageMagick, and pkg to install Ghostscript

apt-get update          # updates apt itself
apt-get install tesseract
apt-get install tesseract-static
tesseract --version…

--

--