Technology
Published
10/29/2021

Read scanned documents(images/PDFs) with AWS Textract in Serverless Express Application.

4 Minutes

Diya Mahendru
Published
10/29/2021
Technology

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique. Duis cursus, mi quis viverra ornare, eros dolor interdum nulla, ut commodo diam libero vitae erat. Aenean faucibus nibh et justo cursus id rutrum lorem imperdiet. Nunc ut sem vitae risus tristique posuere.

In today’s day and age, data has become a crucial part of our lives. Each day we encounter more such situations where we need to extract information from images and PDFs and work around that information in the hope of making our lives simpler.

Amazon Web Services(AWS) has a service that does it all with minimal effort. It is the AWS Textract. This service is solely based on Machine Learning and to be more precise, the OCR model, to extract all kinds of information from a scanned document. Textract can extract information that ranges from text to handwriting and even form data and tables.

In this blog, we’ll see just how to do that with Node js in a Serverless Application.

The prerequisites for this project are:

  • AWS account with the required permissions granted.
  • Serverless Express project set up on your local machine.
  • Required npm packages installed like aws-sdk, etc
  • AWS configured with Serverless. The tutorial for this is out of scope for this blog and will be covered in our next blog.

Textract provides functionality for both images(JPEG or PNG) and PDFs. We’ll start with images first.

Let’s suppose we pick a restaurant menu image and wish to extract all the data from it line-wise.

The pseudo code for text extraction from an image is as following:

  1. Get the image buffer from the local directory or upload the image on the S3 bucket and get that S3 object’s buffer by S3.getObject().
  2. Pass the image buffer to Textract.detectDocuementText()
  3. Parse through the response to get meaningful data from it.


We start with first creating the AWS instance and then further creating S3 and textract instances. While creating the S3 instance, you can provide your AWS account access key, secret access key and the bucket name where the image is stored.


Then inside the handler lambda function, we call on the textract.detectDocumentText() and pass the S3 details of the image like the Bucket name and the file name. Remember that the file name should include the complete file path(including the folder names). Note that textract.detectDocumentText() returns a promise, so we should use async await as well.


Finally, the result from the textract is parsed so that we can read the data line-wise.

Once you deploy this code, the output can be viewed in AWS cloudwatch for that lambda function. The result generated is


So far we have been successful in extracting text data from the menu image that we provided.

Now moving forward to PDFs. For this blog, we’ll be using one page PDF which looks like this:

For the detection of text from PDFs, there are two lambda functions that need to

be deployed.

The first lambda function will start the document analysis. Once the document analysis is complete, the result i.e the JSON file of the response is uploaded in the S3 bucket on the path provided. And simultaneously, a notification is sent to the next lambda via Simple Notification Service.

The second lambda will retrieve the textract response uploaded on the S3 bucket, read the response and parse the response to get meaningful data from it. For parsing, there are a couple of NPM packages that can easily parse the response and provide form data, table data and raw data.

The pseudo code is as following :

  1. In the first lambda, initialize the textract parameters which include the S3 object location, the S3 location where the response will be saved, the notification channel which includes the SNS topic ARN and feature types that need to be extracted.
  2. Call on the textract.startDocumentAnalysis promise.
  3. In the second lambda, extract the document location and Job Id from the event.
  4. Get the JSON file data from the S3 bucket.
  5. Parse the JSON data and generate the text data.

Now let’s start building.

Again we first import the necessary packages and create instances.

Then we create the textract parameters. Document location should contain the location of the PDF that needs to be passed through textract. FeatureTypes contains an array of types of data that needs to be detected. NotificationChannel contains the SNS Topic ARN as well as a role ARN that gives the lambda access to SNS. And lastly, OutputConfig contains the details of the S3 bucket where the final JSON files will be uploaded. One thing to note is that the prefix that is provided here should not end with a “\” because the textract will automatically add it while processing.

The textract by default saves the files in a folder that is named the same as the Job Id. Inside that folder, there would be a JSON file with the name “1”. That is the file that contains the complete textract response and should be used while parsing.

In the second lambda, we first start with extracting the information from the SNS event.

Then if the status is equal to SUCCEEDED, the buffer of the JSON response is obtained from the S3 Bucket. This buffer is then further parsed to read the data LINE wise.

One thing to note here is that in order to run the second lambda, a trigger needs to be set up on the lambda using the SNS, in the lambda console.

In this example, we used a PDF that contained only text. If the PDFs contain table data or form data, that can be extracted from the Blocks property using a couple of textract parsers available online. They can be easily installed using their npm install command.

That’s all about text detection from Images and PDFs using AWS Textract in Serverless Express Application. Thank you for reading.

Subscribe To Our Newsletter

Join the Guest X-perience Community Of 500+  Hotels That Use Quoality Every Day

Start your free trail  & start growing today