Many Natural Language Processing (NLP) tasks start with unstructured data in the form of scanned PDF files. I used this simple Python class that leverages the Tesseract Optical Character Recognition (OCR) engine to convert a folder of scanned PDF files to respective text files.
Let’s first install the dependencies. We will use apt
to install Tesseract
, ImageMagick
, and pkg
to install Ghostscript
apt-get update # updates apt itself
apt-get install tesseract
apt-get install tesseract-static
tesseract --version…