Automate PDF Pre-labeling For Amazon Comprehend

افلاطون کے ذریعہ دوبارہ شائع کیا گیا۔

فالونگ: 0

ایمیزون کی تعریف is a natural-language processing (NLP) service that provides pre-trained and custom APIs to derive insights from textual data. Amazon Comprehend customers can train custom named entity recognition (NER) models to extract entities of interest, such as location, person name, and date, that are unique to their business.

To train a custom model, you first prepare training data by manually annotating entities in documents. This can be done with the Comprehend Semi-Structured Documents Annotation Tool, which creates an ایمیزون سیج میکر گراؤنڈ ٹروتھ job with a custom template, allowing annotators to draw bounding boxes around the entities directly on the PDF documents. However, for companies with existing tabular entity data in ERP systems like SAP, manual annotation can be repetitive and time-consuming.

To reduce the effort of preparing training data, we built a pre-labeling tool using AWS اسٹیپ فنکشنز that automatically pre-annotates documents by using existing tabular entity data. This significantly decreases the manual work needed to train accurate custom entity recognition models in Amazon Comprehend.

In this post, we walk you through the steps of setting up the pre-labeling tool and show examples of how it automatically annotates documents from a public ڈیٹاسیٹ of sample bank statements in PDF format. The full code is available on the GitHub repo.

حل جائزہ

In this section, we discuss the inputs and outputs of the pre-labeling tool and provide an overview of the solution architecture.

آدانوں اور نتائج

As input, the pre-labeling tool takes PDF documents that contain text to be annotated. For the demo, we use simulated bank statements like the following example.

Automate PDF pre-labeling for Amazon Comprehend | Amazon Web Services PlatoBlockchain Data Intelligence. Vertical Search. Ai.

The tool also takes a manifest file that maps PDF documents with the entities that we want to extract from these documents. Entities consists of two things: the expected_text to extract from the document (for example, AnyCompany Bank) and the corresponding entity_type (مثال کے طور پر، bank_name). Later in this post, we show how to construct this manifest file from a CSV document like the following example.

Automate PDF pre-labeling for Amazon Comprehend | Amazon Web Services PlatoBlockchain Data Intelligence. Vertical Search. Ai.

The pre-labeling tool uses the manifest file to automatically annotate the documents with their corresponding entities. We can then use these annotations directly to train an Amazon Comprehend model.

Alternatively, you can create a SageMaker Ground Truth labeling job for human review and editing, as shown in the following screenshot.

Automate PDF pre-labeling for Amazon Comprehend | Amazon Web Services PlatoBlockchain Data Intelligence. Vertical Search. Ai.

When the review is complete, you can use the annotated data to train an Amazon Comprehend custom entity recognizer model.

آرکیٹیکچر

The pre-labeling tool consists of multiple او ڈبلیو ایس لامبڈا۔ functions orchestrated by a Step Functions state machine. It has two versions that use different techniques to generate pre-annotations.

The first technique is مبہم ملاپ. This requires a pre-manifest file with expected entities. The tool uses the fuzzy matching algorithm to generate pre-annotations by comparing text similarity.

Fuzzy matching looks for strings in the document that are similar (but not necessarily identical) to the expected entities listed in the pre-manifest file. It first calculates text similarity scores between the expected text and words in the document, then it matches all pairs above a threshold. Therefore, even if there are no exact matches, fuzzy matching can find variants like abbreviations and misspellings. This allows the tool to pre-label documents without requiring the entities to appear verbatim. For example, if 'AnyCompany Bank' is listed as an expected entity, Fuzzy Matching will annotate occurrences of 'Any Companys Bank'. This provides more flexibility than strict string matching and enables the pre-labeling tool to automatically label more entities.

The following diagram illustrates the architecture of this Step Functions state machine.

Automate PDF pre-labeling for Amazon Comprehend | Amazon Web Services PlatoBlockchain Data Intelligence. Vertical Search. Ai.

The second technique requires a pre-trained Amazon Comprehend entity recognizer model. The tool generates pre-annotations using the Amazon Comprehend model, following the workflow shown in the following diagram.

Automate PDF pre-labeling for Amazon Comprehend | Amazon Web Services PlatoBlockchain Data Intelligence. Vertical Search. Ai.

The following diagram illustrates the full architecture.

Automate PDF pre-labeling for Amazon Comprehend | Amazon Web Services PlatoBlockchain Data Intelligence. Vertical Search. Ai.

In the following sections, we walk through the steps to implement the solution.

Deploy the pre-labeling tool

Clone the repository to your local machine:

git clone https://github.com/aws-samples/amazon-comprehend-automated-pdf-prelabeling-tool.git

This repository has been built on top of the Comprehend Semi-Structured Documents Annotation Tool and extends its functionalities by enabling you to start a SageMaker Ground Truth labeling job with pre-annotations already displayed on the SageMaker Ground Truth UI.

The pre-labeling tool includes both the Comprehend Semi-Structured Documents Annotation Tool resources as well as some resources specific to the pre-labeling tool. You can deploy the solution with AWS سرور لیس ایپلیکیشن ماڈل (AWS SAM), an open source framework that you can use to define serverless application infrastructure code.

If you have previously deployed the Comprehend Semi-Structured Documents Annotation Tool, refer to the FAQ section in Pre_labeling_tool/README.md for instructions on how to deploy only the resources specific to the pre-labeling tool.

If you haven’t deployed the tool before and are starting fresh, do the following to deploy the whole solution.

Change the current directory to the annotation tool folder:

cd amazon-comprehend-semi-structured-documents-annotation-tools

Build and deploy the solution:

make ready-and-deploy-guided

Create the pre-manifest file

Before you can use the pre-labeling tool, you need to prepare your data. The main inputs are PDF documents and a pre-manifest file. The pre-manifest file contains the location of each PDF document under 'pdf' and the location of a JSON file with expected entities to label under 'expected_entities'.

نوٹ بک generate_premanifest_file.ipynb shows how to create this file. In the demo, the pre-manifest file shows the following code:

[ { 'pdf': 's3://<bucket>/data_aws_idp_workshop_data/bank_stmt_0.pdf', 'expected_entities': 's3://<bucket>/prelabeling-inputs/expected-entities/example-demo/fuzzymatching_version/file_bank_stmt_0.json' }, ...
]

Each JSON file listed in the pre-manifest file (under expected_entities) contains a list of dictionaries, one for each expected entity. The dictionaries have the following keys:

‘expected_texts’ – A list of possible text strings matching the entity.
‘entity_type’ – The corresponding entity type.
‘ignore_list’ (optional) – The list of words that should be ignored in the match. These parameters should be used to prevent fuzzy matching from matching specific combinations of words that you know are wrong. This can be useful if you want to ignore some numbers or email addresses when looking at names.

مثال کے طور پر، expected_entities of the PDF shown previously looks like the following:

[ { 'expected_texts': ['AnyCompany Bank'], 'entity_type': 'bank_name', 'ignore_list': [] }, { 'expected_texts': ['JANE DOE'], 'entity_type': 'customer_name', 'ignore_list': ['JANE.DOE@example_mail.com'] }, { 'expected_texts': ['003884257406'], 'entity_type': 'checking_number', 'ignore_list': [] }, ...
]

Run the pre-labeling tool

With the pre-manifest file that you created in the previous step, start running the pre-labeling tool. For more details, refer to the notebook start_step_functions.ipynb.

To start the pre-labeling tool, provide an event with the following keys:

Premanifest – Maps each PDF document to its expected_entities file. This should contain the ایمیزون سادہ اسٹوریج سروس (Amazon S3) bucket (under bucket) and the key (under key) of the file.
اپسرگ – Used to create the execution_id, which names the S3 folder for output storage and the SageMaker Ground Truth labeling job name.
entity_types – Displayed in the UI for annotators to label. These should include all entity types in the expected entities files.
work_team_name (optional) – Used for creating the SageMaker Ground Truth labeling job. It corresponds to the private workforce to use. If it’s not provided, only a manifest file will be created instead of a SageMaker Ground Truth labeling job. You can use the manifest file to create a SageMaker Ground Truth labeling job later on. Note that as of this writing, you can’t provide an external workforce when creating the labeling job from the notebook. However, you can clone the created job and assign it to an external workforce on the SageMaker Ground Truth console.
comprehend_parameters (optional) – Parameters to directly train an Amazon Comprehend custom entity recognizer model. If omitted, this step will be skipped.

To start the state machine, run the following Python code:

import boto3
stepfunctions_client = boto3.client('stepfunctions')

response = stepfunctions_client.start_execution(
stateMachineArn=fuzzymatching_prelabeling_step_functions_arn,
input=json.dumps(<event-dict>)
)

This will start a run of the state machine. You can monitor the progress of the state machine on the Step Functions console. The following diagram illustrates the state machine workflow.

Automate PDF pre-labeling for Amazon Comprehend | Amazon Web Services PlatoBlockchain Data Intelligence. Vertical Search. Ai.

When the state machine is complete, do the following:

Inspect the following outputs saved in the prelabeling/ کا فولڈر comprehend-semi-structured-docs S3 bucket:
- Individual annotation files for each page of the documents (one per page per document) in temp_individual_manifests/
- A manifest for the SageMaker Ground Truth labeling job in consolidated_manifest/consolidated_manifest.manifest
- A manifest that can be used to train a custom Amazon Comprehend model in consolidated_manifest/consolidated_manifest_comprehend.manifest
On the SageMaker console, open the SageMaker Ground Truth labeling job that was created to review the annotations
Inspect and test the custom Amazon Comprehend model that was trained

As mentioned previously, the tool can only create SageMaker Ground Truth labeling jobs for private workforces. To outsource the human labeling effort, you can clone the labeling job on the SageMaker Ground Truth console and attach any workforce to the new job.

صاف کرو

To avoid incurring additional charges, delete the resources that you created and delete the stack that you deployed with the following command:

make delete

نتیجہ

The pre-labeling tool provides a powerful way for companies to use existing tabular data to accelerate the process of training custom entity recognition models in Amazon Comprehend. By automatically pre-annotating PDF documents, it significantly reduces the manual effort required in the labeling process.

The tool has two versions: fuzzy matching and Amazon Comprehend-based, giving flexibility on how to generate the initial annotations. After documents are pre-labeled, you can quickly review them in a SageMaker Ground Truth labeling job or even skip the review and directly train an Amazon Comprehend custom model.

The pre-labeling tool enables you to quickly unlock the value of your historical entity data and use it in creating custom models tailored to your specific domain. By speeding up what is typically the most labor-intensive part of the process, it makes custom entity recognition with Amazon Comprehend more accessible than ever.

For more information about how to label PDF documents using a SageMaker Ground Truth labeling job, see Amazon Comprehend کا استعمال کرتے ہوئے دستاویزات میں نامزد اداروں کو نکالنے کے لیے حسب ضرورت دستاویز کی تشریح اور ڈیٹا کو لیبل کرنے کے لیے Amazon SageMaker Ground Truth کا استعمال کریں۔.

مصنفین کے بارے میں

Oskar Schnaack is an Applied Scientist at the Generative AI Innovation Center. He is passionate about diving into the science behind machine learning to make it accessible for customers. Outside of work, Oskar enjoys cycling and keeping up with trends in information theory.

Romain Besombes is a Deep Learning Architect at the Generative AI Innovation Center. He is passionate about building innovative architectures to address customers’ business problems with machine learning.

SEO سے چلنے والا مواد اور PR کی تقسیم۔ آج ہی بڑھا دیں۔
پلیٹو ڈیٹا ڈاٹ نیٹ ورک ورٹیکل جنریٹو اے آئی۔ اپنے آپ کو بااختیار بنائیں۔ یہاں تک رسائی حاصل کریں۔
پلیٹوآئ اسٹریم۔ ویب 3 انٹیلی جنس۔ علم میں اضافہ۔ یہاں تک رسائی حاصل کریں۔
پلیٹو ای ایس جی۔ کاربن، کلین ٹیک، توانائی ، ماحولیات، شمسی، ویسٹ مینجمنٹ یہاں تک رسائی حاصل کریں۔
پلیٹو ہیلتھ۔ بائیوٹیک اینڈ کلینیکل ٹرائلز انٹیلی جنس۔ یہاں تک رسائی حاصل کریں۔
ماخذ: https://aws.amazon.com/blogs/machine-learning/automate-pdf-pre-labeling-for-amazon-comprehend/

ٹائم اسٹیمپ: دسمبر 14، 2023

ٹائم اسٹیمپ: دسمبر 7، 2022

ایمیزون کمپریہنڈ کے لیے خودکار پی ڈی ایف پری لیبلنگ | ایمیزون ویب سروسز

افلاطون کے ذریعہ دوبارہ شائع کیا گیا۔

حل جائزہ

آدانوں اور نتائج

آرکیٹیکچر

Deploy the pre-labeling tool

Create the pre-manifest file

Run the pre-labeling tool

صاف کرو

نتیجہ

مصنفین کے بارے میں

سے زیادہ AWS مشین لرننگ

تخلیقی AI صلاحیتوں کے ساتھ Amazon Connect اور Lex کو بہتر بنائیں | ایمیزون ویب سروسز

Amazon Bedrock اور AWS Step Functions کا استعمال کرتے ہوئے تصویری پس منظر کو تبدیل کرنے کے عمل کو خودکار بنائیں ایمیزون ویب سروسز

NLP اور CV PyTorch ماڈلز کے لیے Amazon EC2 G5 مثالوں کے ساتھ تین گنا کم قیمت پر چار گنا زیادہ ML انفرنس تھرو پٹ حاصل کریں۔

ایمیزون سیج میکر ڈیٹا متوازی لائبریری کے ساتھ تیز تر تربیت کو فعال کریں۔ ایمیزون ویب سروسز

Python ٹول باکس کا استعمال کرتے ہوئے آلات کے ماڈلز کے لیے Amazon Lookout کی تعمیر، تربیت اور تعیناتی کریں

فیس آف امکان، NHL Edge IQ کا حصہ: ٹیلیویژن گیمز کے دوران حقیقی وقت میں آمنے سامنے جیتنے والوں کی پیش گوئی

AWS پر کمپیوٹر وژن پائپ لائنز کے لیے مصنوعی ڈیٹا بنائیں

متعدد علاقوں کا استعمال کرتے ہوئے Amazon Recognition اسٹیٹ لیس APIs کے لیے اسکیل ایبلٹی کو بہتر بنائیں

ہمارے متعلق

عمودی تلاش اور Ai

پلیٹ فارم

مربوط رہو

اکاؤنٹ