Build A Receipt And Invoice Processing Pipeline With Amazon Textract

Ponovno objavil Platon

Spremljevalci: 0

In today’s business landscape, organizations are constantly seeking ways to optimize their financial processes, enhance efficiency, and drive cost savings. One area that holds significant potential for improvement is accounts payable. On a high level, the accounts payable process includes receiving and scanning invoices, extraction of the relevant data from scanned invoices, validation, approval, and archival. The second step (extraction) can be complex. Each invoice and receipt look different. The labels are imperfect and inconsistent. The most important pieces of information such as price, vendor name, vendor address, and payment terms are often not explicitly labeled and have to be interpreted based on context. The traditional approach of using human reviewers to extract the data is time-consuming, error-prone, and not scalable.

In this post, we show how to automate the accounts payable process using Amazonovo besedilo for data extraction. We also provide a reference architecture to build an invoice automation pipeline that enables extraction, verification, archival, and intelligent search.

Pregled rešitev

The following architecture diagram shows the stages of a receipt and invoice processing workflow. It starts with a document capture stage to securely collect and store scanned invoices and receipts. The next stage is the extraction phase, where you pass the collected invoices and receipts to the Amazon Textract AnalyzeExpense API to extract financially related relationships between text such as vendor name, invoice receipt date, order date, amount due, amount paid, and so on. In the next stage, you use predefined expense rules to determine if you should automatically approve or reject the receipt. Approved and rejected documents go to their respective folders within the Preprosta storitev shranjevanja Amazon (Amazon S3) bucket. For approved documents, you can search all the extracted fields and values using Storitev Amazon OpenSearch. You can visualize the indexed metadata using OpenSearch Dashboards. Approved documents are also set up to be moved to Amazon S3 Intelligent-Tiering for long-term retention and archival using S3 lifecycle policies.

Arhitektura rešitve

The following sections take you through the process of creating the solution.

Predpogoji

To deploy this solution, you must have the following:

AWS račun.
An AWS Cloud9 environment. AWS Cloud9 is a cloud-based integrated development environment (IDE) that lets you write, run, and debug your code with just a browser. It includes a code editor, debugger, and terminal.

To create the AWS Cloud9 environment, provide a name and description. Keep everything else as default. Choose the IDE link on the AWS Cloud9 console to navigate to IDE. You’re now ready to use the AWS Cloud9 environment.

Uvedite rešitev

To set up the solution, you use the Komplet za razvoj oblaka AWS (AWS CDK) to deploy an Oblikovanje oblaka AWS kup.

In your AWS Cloud9 IDE terminal, clone the GitHub repozitorij and install the dependencies. Run the following commands to deploy the InvoiceProcessor sklad:

git clone https://github.com/aws-samples/amazon-textract-invoice-processor.git
pip install -r requirements.txt
cdk bootstrap
cdk deploy

The deployment takes around 25 minutes with the default configuration settings from the GitHub repo. Additional output information is also available on the AWS CloudFormation console.

After the AWS CDK deployment is complete, create expense validation rules in an Amazon DynamoDB table. You can use the same AWS Cloud9 terminal to run the following commands:

aws dynamodb execute-statement --statement "INSERT INTO "$(aws cloudformation list-exports --query 'Exports[?Name==`InvoiceProcessorWorkflow-RulesTableName`].Value' --output text)" VALUE {'ruleId': 1, 'type': 'regex', 'field': 'INVOICE_RECEIPT_ID', 'check': '(?i)[0-9]{3}[a-z]{3}[0-9]{3}$', 'errorTxt': 'Receipt number is not valid. It is of the format: 123ABC456'}"
aws dynamodb execute-statement --statement "INSERT INTO "$(aws cloudformation list-exports --query 'Exports[?Name==`InvoiceProcessorWorkflow-RulesTableName`].Value' --output text)" VALUE {'ruleId': 2, 'type': 'regex', 'field': 'PO_NUMBER', 'check': '(?i)[a-z0-9]+$', 'errorTxt': 'PO number is not present'}"

In the S3 bucket that starts with invoiceprocessorworkflow-invoiceprocessorbucketf1-*, create an uploads folder.

In Amazon Cognito, you should already have an existing user pool called OpenSearchResourcesCognitoUserPool*. We use this user pool to create a new user.

On the Amazon Cognito console, navigate to the user pool OpenSearchResourcesCognitoUserPool*.
Create a new Amazon Cognito user.
Provide a user name and password of your choice and note them for later use.
Upload the documents random_invoice1 in random_invoice2 to the S3 uploads folder to start the workflows.

Now let’s dive into each of the document processing steps.

Zajem dokumenta

Customers handle invoices and receipts in a multitude of formats from different vendors. These documents are received through channels like hard copies, scanned copies uploaded to file storage, or shared storage devices. In the document capture stage, you store all scanned copies of receipts and invoices in a highly scalable storage such as in an S3 bucket.

Upload sample invoices

Pridobivanje

The next stage is the extraction phase, where you pass the collected invoices and receipts to the Amazon Textract AnalyzeExpense API to extract financially related relationships between text such as Vendor Name, Invoice Receipt Date, Order Date, Amount Due/Paid, etc.

AnalyzeExpense is an API dedicated to processing invoice and receipts documents. It is available both as a synchronous or asynchronous API. The synchronous API allows you to send images in bytes format, and the asynchronous API allows you to send files in JPG, PNG, TIFF, and PDF formats. The AnalyzeExpense API response consists of three distinct sections:

Summary fields – This section includes both normalized keys and the explicitly mentioned keys along with their values. AnalyzeExpense normalizes the keys for contact-related information such as vendor name and vendor address, tax ID-related keys such as tax payer ID, payment-related keys such as amount due and discount, and general keys such as invoice ID, delivery date, and account number. Keys that are not normalized still appear in the summary fields as key-value pairs. For a complete list of supported expense fields, refer to Analiza računov in potrdil.
Line items – This section includes normalized line item keys such as item description, unit price, quantity, and product code.
OCR block – The block contains the raw text extract from the invoice page. The raw text extract can be used for postprocessing and identifying information that is not covered as part of the summary and line item fields.

Ta objava uporablja Konstrukti Amazon Texttract IDP CDK (AWS CDK components to define infrastructure for intelligent document processing (IDP) workflows), which allows you to build use case-specific, customizable IDP workflows. The constructs and samples are a collection of components to enable definition of IDP processes on AWS and published to GitHub. The main concepts used are the AWS CDK constructs, the actual AWS CDK stacksin Korak funkcije AWS.

Naslednja slika prikazuje potek dela funkcij korakov.

Step function workflow

The extraction workflow includes the following steps:

InvoiceProcessor-Decider - An AWS Lambda function that verifies if the input document format is supported by Amazon Textract. For more details about supported formats, refer to Input Documents.
DocumentSplitter – A Lambda function that generates 2,500-page (max) chunks from documents and can process large multi-page documents.
Država zemljevida – A Lambda function that processes each chunk in parallel.
TextAsync – This task calls Amazon Textract using the asynchronous API following najboljše prakse z Amazon Simple notification Service (Amazon SNS) notifications and uses OutputConfig to store the Amazon Textract JSON output to the S3 bucket you created earlier. It consists of two Lambda functions: one to submit the document for processing and one that is triggered on the SNS notification.
TextAsyncToJSON2 - Zaradi TextractAsync task can produce multiple paginated output files, the TextractAsyncToJSON2 postopek jih združi v eno datoteko JSON.

We discuss the details of the next three steps in the following sections.

Verification and approval

For the verification stage, the SetMetaData Lambda function verifies whether the uploaded file is a valid expense as per the rules configured previously in DynamoDB table. For this post, you use the following sample rules:

Verification is successful if INVOICE_RECEIPT_ID is present and matches the regex (?i)[0-9]{3}[a-z]{3}[0-9]{3}$ in if PO_NUMBER is present and matches the regex (?i)[a-z0-9]+$
Verification is un-successful if either PO_NUMBER or INVOICE_RECEIPT_ID is incorrect or missing in the document.

After the files are processed, the expense verification function moves the input files to either approved or declined folders in the same S3 bucket.

S3 output

For the purposes of this solution, we use DynamoDB to store the expense validation rules. However, you can modify this solution to integrate with your own or commercial expense validation or management solutions.

Intelligent index and search

Z OpenSearchPushInvoke Lambda function, the extracted expense metadata is pushed to an OpenSearch Service index and is available for search.

Konec TaskOpenSearchMapping korak počisti kontekst, ki bi sicer lahko presegel Step Functions quota of maximum input or output size for a task, state, or workflow run.

After the OpenSearch Service index is created, you can search for keywords from the extracted text via OpenSearch Dashboards.

OpenSearch document search

Archival, audit, and analytics

To manage the lifecycle and archival of invoices and receipts, you can configure S3 lifecycle rules to transition S3 objects from Standard to Intelligent-Tiering storage classes. S3 Intelligent-Tiering monitors access patterns and automatically moves objects to the Infrequent Access tier when they haven’t been accessed for 30 consecutive days. After 90 days of no access, the objects are moved to the Archive Instant Access tier without performance impact or operational overhead.

For auditing and analytics, this solution uses OpenSearch Service for running analytics on invoice requests. OpenSearch Service enables you to effortlessly ingest, secure, search, aggregate, view, and analyze data for a number of use cases, such as log analytics, application search, enterprise search, and more.

Log in to OpenSearch Dashboards and navigate to Stack Management, Shranjeni predmeti, nato izberite uvoz. Izberite invoices.ndjson file from the cloned repository and choose uvoz. This prepopulates indexes and builds the visualization.

OpenSearch import

Refresh the page and navigate to Domov, Splošnoin odprite Računi. You can now select and apply filters and expand the time window to explore past invoices.

OpenSearch dashboard

Čiščenje

When you’re finished evaluating Amazon Textract for processing receipts and invoices, we recommend cleaning up any resources that you might have created. Complete the following steps:

Delete all content from the S3 bucket invoiceprocessorworkflow-invoiceprocessorbucketf1-*.
In AWS Cloud9, run the following commands to delete Amazon Cognito resources and CloudFormation stacks:

cognito_user_pool=$(aws cloudformation list-exports --query 'Exports[?Name==`InvoiceProcessorWorkflow-CognitoUserPoolId`].Value' --output text)
echo $cognito_user_pool
cdk destroy
aws cognito-idp delete-user-pool --user-pool-id $cognito_user_pool

Delete the AWS Cloud9 environment that you created from the AWS Cloud9 console.

zaključek

In this post, we provided an overview of how we can build an invoice automation pipeline using Amazon Textract for data extraction and create a workflow for validation, archival, and search. We provided code samples on how to use the AnalyzeExpense API for extraction of critical fields from an invoice.

To get started, sign in to the Amazon Textract console to try this feature. To learn more about Amazon Textract capabilities, refer to the Vodič za razvijalce Amazon Textract or Viri besedila. To learn more about IDP, refer to the IDP with AWS AI services Del 1 in Del 2 objav.

O avtorjih

Build a receipt and invoice processing pipeline with Amazon Textract | Amazon Web Services PlatoBlockchain Data Intelligence. Vertical Search. Ai. Sushant Pradhan is a Sr. Solutions Architect at Amazon Web Services, helping enterprise customers. His interests and experience include containers, serverless technology, and DevOps. In his spare time, Sushant enjoys spending time outdoors with his family.

Build a receipt and invoice processing pipeline with Amazon Textract | Amazon Web Services PlatoBlockchain Data Intelligence. Vertical Search. Ai. Šibin Michaelraj je višji produktni vodja pri ekipi AWS Texttract. Osredotočen je na izdelavo izdelkov, ki temeljijo na AI/ML, za stranke AWS.

Build a receipt and invoice processing pipeline with Amazon Textract | Amazon Web Services PlatoBlockchain Data Intelligence. Vertical Search. Ai. Suprakash Dutta je starejši arhitekt rešitev pri Amazon Web Services. Osredotoča se na strategijo digitalne transformacije, posodobitev in migracijo aplikacij, podatkovno analitiko in strojno učenje. Je del skupnosti AI/ML pri AWS in oblikuje rešitve za inteligentno obdelavo dokumentov.

Maran Chandrasekaran je višji arhitekt rešitev pri Amazon Web Services, ki dela z našimi poslovnimi strankami. Zunaj službe rad potuje in se vozi z motorjem po Texas Hill Countryju.

Distribucija vsebine in PR s pomočjo SEO. Okrepite se še danes.
PlatoData.Network Vertical Generative Ai. Opolnomočite se. Dostopite tukaj.
PlatoAiStream. Web3 Intelligence. Razširjeno znanje. Dostopite tukaj.
PlatoESG. Ogljik, CleanTech, Energija, Okolje, sončna energija, Ravnanje z odpadki. Dostopite tukaj.
PlatoHealth. Obveščanje o biotehnologiji in kliničnih preskušanjih. Dostopite tukaj.
vir: https://aws.amazon.com/blogs/machine-learning/build-a-receipt-and-invoice-processing-pipeline-with-amazon-textract/

Časovni žig: Marec 26, 2024

Časovni žig: Julij 6, 2022

Ponovno objavil Platon

Razmestite temeljne modele z Amazon SageMaker, ponovite in spremljajte s TruEra | Spletne storitve Amazon

Razmestite model diarizacije zvočnika Hugging Face (PyAnnote) na Amazon SageMaker kot asinhrono končno točko | Spletne storitve Amazon

Uvozite podatke iz več kot 40 virov podatkov za strojno učenje brez kode z Amazon SageMaker Canvas

Zaznavanje napak na slikah visoke ločljivosti z uporabo dvostopenjskih modelov Amazon Rekognition Custom Labels | Spletne storitve Amazon

Zaznajte populacijsko varianco ogroženih vrst z uporabo Amazon Rekognition

Prebijte jezikovne ovire z Amazon Transcribe, Amazon Translate in Amazon Polly

O nas

Navpično iskanje in Ai

Platforma

Ostanite povezani

Račun