Through this blog post, I will walk you through how you can deploy a Streamlit App that uses OCR (Optical character Recognition) on Heroku.
Streamlit is an open-source Python library that makes it easy to create and share beautiful, custom web apps for machine learning and data science. In just a few minutes you can build and deploy powerful data apps. Find more about Streamlight at https://docs.streamlit.io/en/stable/
Heroku, on the other hand, is a platform as a service (PaaS) that enables developers to build, run, and operate applications entirely in the cloud.
Context
To give a little context on why I had to use these services, I was building a project that involved translating Non-editable PDFs written in Dutch to English, and performing sentiment analysis on the text.
Non-editable PDFs are those PDFs that are basically images that are turned into PDFs, thus disabling functionalities such as copy-paste from the PDFs. To translate these documents, passing them through Google Translate does not help either, the only way text can be extracted from these PDFs is to perform Optical Character Recognition or OCR.
This is an example of a non-editable PDF in Dutch, from a Dutch newspaper, https://github.com/Namyalg/Dutch-To-English/blob/main/newpaper.pdf
The application was developed using Streamlit and deployed on Heroku.
Hosting and Deployment
This is a link to the repository which has been hosted
Support for all languages can be found here: https://tesseract-ocr.github.io/tessdoc/Data-Files-in-different-versions.html
For the google-trans-new API, support can be found with regard to the different languages supported at: https://github.com/lushan88a/google_trans_new/blob/main/constant.py
Using the Google Translation API
The most important thing in order to perform the action of OCR is to ensure the right buildpacks are included in the Heroku setup.
The ones that I have used (after a lot of research are) :
heroku/python
In the settings tab under the Heroku Application, there is a provision to add buildpacks, it looks like this
The Python build pack can be chosen here, for the others, the URLs must be types and saved here.