May 9, 2021

Parsing Amazon Textract responses for various document layouts

Parsing Amazon Textract responses for various document layouts

I have a question regarding parsing Amazon Textract responses.

I'm going to get a variety of invoices from many different companies. An invoice will typically always contain a table with a few pieces of information, like item description, quantity, unit price, and amount for each line item. However, these columns could be named in various different ways, and could be in any order.

My current idea is to have an array of words for each column that I want to identify, and just manage that array over time as I see new examples. When processing the response I would just search the JSON for each of the words in the array and identify it that way.

The downside to this is that I'd have to manage this array over time and obviously this seems brittle.

Here are a few examples of invoice table columns to illustrate what I'm talking about:

Date, Description, Qty, Price, Total

Item, Service, Charges, Usage, Total

Services, qty, unit price, amount

Looking for recommendations on how to handle this. Am I on the right track, or is there some other method I'm not thinking about. This is all being done in Python.

submitted by /u/Hyphen_81
[link] [comments]