In Del 1 of this series, we discussed intelligent document processing (IDP), and how IDP can accelerate claims processing use cases in the insurance industry. We discussed how we can use AWS AI services to accurately categorize claims documents along with supporting documents. We also discussed how to extract various types of documents in an insurance claims package, such as forms, tables, or specialized documents such as invoices, receipts, or ID documents. We looked into the challenges in legacy document processes, which is time-consuming, error-prone, expensive, and difficult to process at scale, and how you can use AWS AI services to help implement your IDP pipeline.
In this post, we walk you through advanced IDP features for document extraction, querying, and enrichment. We also look into how to further use the extracted structured information from claims data to get insights using AWS Analytics and visualization services. We highlight on how extracted structured data from IDP can help against fraudulent claims using AWS Analytics services.
Pregled rešitev
The following diagram illustrates the phases if IDP using AWS AI services. In Part 1, we discussed the first three phases of the IDP workflow. In this post, we expand on the extraction step and the remaining phases, which include integrating IDP with AWS Analytics services.
We use these analytics services for further insights and visualizations, and to detect fraudulent claims using structured, normalized data from IDP. The following diagram illustrates the solution architecture.
The phases we discuss in this post use the following key services:
- Amazon Comprehend Medical is a HIPAA-eligible natural language processing (NLP) service that uses machine learning (ML) models that have been pre-trained to understand and extract health data from medical text, such as prescriptions, procedures, or diagnoses.
- AWS lepilo is a part of the AWS Analytics services stack, and is a serverless data integration service that makes it easy to discover, prepare, and combine data for analytics, ML, and application development.
- Amazon RedShift is another service in the Analytics stack. Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud.
Predpogoji
Preden začnete, si oglejte Del 1 for a high-level overview of the insurance use case with IDP and details about the data capture and classification stages.
For more information regarding the code samples, refer to our Repo za GitHub.
Faza ekstrakcije
In Part 1, we saw how to use Amazon Textract APIs to extract information like forms and tables from documents, and how to analyze invoices and identity documents. In this post, we enhance the extraction phase with Amazon Comprehend to extract default and custom entities specific to custom use cases.
Insurance carriers often come across dense text in insurance claims applications, such a patient’s discharge summary letter (see the following example image). It can be difficult to automatically extract information from such types of documents where there is no definite structure. To address this, we can use the following methods to extract key business information from the document:
Extract default entities with the Amazon Comprehend DetectEntities API
We run the following code on the sample medical transcription document:
The following screenshot shows a collection of entities identified in the input text. The output has been shortened for the purposes of this post. Refer to the GitHub repo for a detailed list of entities.
Extract custom entities with Amazon Comprehend custom entity recognition
Odgovor s strani DetectEntities
API includes the default entities. However, we’re interested in knowing specific entity values, such as the patient’s name (denoted by the default entity PERSON
), or the patient’s ID (denoted by the default entity OTHER
). To recognize these custom entities, we train an Amazon Comprehend custom entity recognizer model. We recommend following the comprehensive steps on how to train and deploy a custom entity recognition model in the Repo za GitHub.
After we deploy the custom model, we can use the helper function get_entities()
to retrieve custom entities like PATIENT_NAME
in PATIENT_D
from the API response:
Naslednji posnetek zaslona prikazuje naše rezultate.
Faza obogatitve
In the document enrichment phase, we perform enrichment functions on healthcare-related documents to draw valuable insights. We look at the following types of enrichment:
- Extract domain-specific language – We use Amazon Comprehend Medical to extract medical-specific ontologies like ICD-10-CM, RxNorm, and SNOMED CT
- Uredite občutljive informacije – We use Amazon Comprehend to redact personally identifiable information (PII), and Amazon Comprehend Medical for protected health information (PHI) redaction
Extract medical information from unstructured medical text
Documents such as medical providers’ notes and clinical trial reports include dense medical text. Insurance claims carriers need to identify the relationships among the extracted health information from this dense text and link them to medical ontologies like ICD-10-CM, RxNorm, and SNOMED CT codes. This is very valuable in automating claim capture, validation, and approval workflows for insurance companies to accelerate and simplify claim processing. Let’s look at how we can use the Amazon Comprehend Medical InferICD10CM
API to detect possible medical conditions as entities and link them to their codes:
For the input text, which we can pass in from the Amazon Textract DetectDocumentText
API, InferICD10CM
API returns the following output (the output has been abbreviated for brevity).
Similarly, we can use the Amazon Comprehend Medical InferRxNorm
API to identify medications and the InferSNOMEDCT
API to detect medical entities within healthcare-related insurance documents.
Perform PII and PHI redaction
Insurance claims packages require a lot of privacy compliance and regulations because they contain both PII and PHI data. Insurance carriers can reduce compliance risk by redacting information like policy numbers or the patient’s name.
Let’s look at an example of a patient’s discharge summary. We use the Amazon Comprehend DetectPiiEntities
API to detect PII entities within the document and protect the patient’s privacy by redacting these entities:
We get the following PII entities in the response from the detect_pii_entities()
API :
We can then redact the PII entities that were detected from the documents by utilizing the bounding box geometry of the entities from the document. For that, we use a helper tool called amazon-textract-overlayer
. Za več informacij glejte Textract-Overlayer. The following screenshots compare a document before and after redaction.
Similar to the Amazon Comprehend DetectPiiEntities
API, we can also use the DetectPHI
API to detect PHI data in the clinical text being examined. For more information, refer to Detect PHI.
Faza pregleda in validacije
In the document review and validation phase, we can now verify if the claim package meets the business’s requirements, because we have all the information collected from the documents in the package from earlier stages. We can do this by introducing a human in the loop that can review and validate all the fields or just an auto-approval process for low dollar claims before sending the package to downstream applications. We can use Amazon, razširjeni AI (Amazon A2I) to automate the human review process for insurance claims processing.
Now that we have all required data extracted and normalized from claims processing using AI services for IDP, we can extend the solution to integrate with AWS Analytics services such as AWS Glue and Amazon Redshift to solve additional use cases and provide further analytics and visualizations.
Detect fraudulent insurance claims
In this post, we implement a serverless architecture where the extracted and processed data is stored in a data lake and is used to detect fraudulent insurance claims using ML. We use Preprosta storitev shranjevanja Amazon (Amazon S3) to store the processed data. We can then use AWS lepilo or Amazonski EMR to cleanse the data and add additional fields to make it consumable for reporting and ML. After that, we use Amazon Redshift ML to build a fraud detection ML model. Finally, we build reports using Amazon QuickSight to get insights into the data.
Setup Amazon Redshift external schema
For the purpose of this example, we have created a vzorec nabora podatkov the emulates the output of an ETL (extract, transform, and load) process, and use AWS Glue Data Catalog as the metadata catalog. First, we create a database named idp_demo
in the Data Catalog and an external schema in Amazon Redshift called idp_insurance_demo
(see the following code). We use an AWS upravljanje identitete in dostopa (IAM) role to grant permissions to the Amazon Redshift cluster to access Amazon S3 and Amazon SageMaker. For more information about how to set up this IAM role with least privilege, refer to Združite in konfigurirajte nastavitev za skrbništvo Amazon Redshift ML.
Create Amazon Redshift external table
The next step is to create an external table in Amazon Redshift referencing the S3 location where the file is located. In this case, our file is a comma-separated text file. We also want to skip the header row from the file, which can be configured in the table properties section. See the following code:
Create training and test datasets
After we create the external table, we prepare our dataset for ML by splitting it into training set and test set. We create a new external table called claim_train
, which consists of all records with ID <= 85000 from the claims table. This is the training set that we train our ML model on.
We create another external table called claim_test
that consists of all records with ID >85000 to be the test set that we test the ML model on:
Create an ML model with Amazon Redshift ML
Now we create the model using the USTVARJI MODEL command (see the following code). We select the relevant columns from the claims_train
table that can determine a fraudulent transaction. The goal of this model is to predict the value of the fraud
column; therefore, fraud
is added as the prediction target. After the model is trained, it creates a function named insurance_fraud_model
. This function is used for inference while running SQL statements to predict the value of the fraud
column for new records.
Evaluate ML model metrics
After we create the model, we can run queries to check the accuracy of the model. We use the insurance_fraud_model
function to predict the value of the fraud
column for new records. Run the following query on the claims_test
table to create a confusion matrix:
Detect fraud using the ML model
After we create the new model, as new claims data is inserted into the data warehouse or data lake, we can use the insurance_fraud_model
function to calculate the fraudulent transactions. We do this by first loading the new data into a temporary table. Then we use the insurance_fraud_model
function to calculate the fraud
flag for each new transaction and insert the data along with the flag into the final table, which in this case is the claims
miza.
Visualize the claims data
When the data is available in Amazon Redshift, we can create visualizations using QuickSight. We can then share the QuickSight dashboards with business users and analysts. To create the QuickSight dashboard, you first need to create an Amazon Redshift dataset in QuickSight. For instructions, refer to Ustvarjanje nabora podatkov iz baze podatkov.
After you create the dataset, you can create a new analysis in QuickSight using the dataset. The following are some sample reports we created:
- Total number of claims by state, grouped by the
fraud
Polje – This chart shows us the proportion of fraudulent transactions compared to the total number of transactions in a particular state. - Sum of the total dollar value of the claims, grouped by the
fraud
Polje – This chart shows us the proportion of dollar amount of fraudulent transactions compared to the total dollar amount of transactions in a particular state. - Total number of transactions per insurance company, grouped by the
fraud
Polje – This chart shows us how many claims were filed for each insurance company and how many of them are fraudulent.
- Total sum of fraudulent transactions by state displayed on a US map – This chart just shows the fraudulent transactions and displays the total charges for those transactions by state on the map. The darker shade of blue indicates higher total charges. We can further analyze this by city within that state and zip codes with the city to better understand the trends.
Čiščenje
To prevent incurring future charges to your AWS account, delete the resources that you provisioned in the setup by following the instructions in the Oddelek za čiščenje v našem repoju.
zaključek
In this two-part series, we saw how to build an end-to-end IDP pipeline with little or no ML experience. We explored a claims processing use case in the insurance industry and how IDP can help automate this use case using services such as Amazon Textract, Amazon Comprehend, Amazon Comprehend Medical, and Amazon A2I. In Part 1, we demonstrated how to use AWS AI services for document extraction. In Part 2, we extended the extraction phase and performed data enrichment. Finally, we extended the structured data extracted from IDP for further analytics, and created visualizations to detect fraudulent claims using AWS Analytics services.
Priporočamo, da pregledate varnostne razdelke Amazonovo besedilo, Amazonsko razumevanjein Amazon A2I dokumentacijo in upoštevanje navedenih smernic. Če želite izvedeti več o cenah rešitve, si oglejte podrobnosti o cenah Amazonovo besedilo, Amazonsko razumevanjein Amazon A2I.
O avtorjih
Chinmayee Rane je specialist za rešitve AI/ML pri Amazon Web Services. Navdušena je nad uporabno matematiko in strojnim učenjem. Osredotoča se na oblikovanje inteligentnih rešitev za obdelavo dokumentov za stranke AWS. Zunaj službe uživa v plesu salse in bachate.
Uday Narayanan is an Analytics Specialist Solutions Architect at AWS. He enjoys helping customers find innovative solutions to complex business challenges. His core areas of focus are data analytics, big data systems, and machine learning. In his spare time, he enjoys playing sports, binge-watching TV shows, and traveling.
Sonali Sahu vodi skupino arhitektov rešitev Intelligent Document Processing AI/ML Solutions pri Amazon Web Services. Je strastna tehnofilka in uživa v delu s strankami pri reševanju kompleksnih problemov z uporabo inovacij. Njeno osrednje področje je umetna inteligenca in strojno učenje za inteligentno obdelavo dokumentov.
- AI
- ai art
- ai art generator
- imajo robota
- Amazonsko razumevanje
- Amazon Comprehend Medical
- Strojno učenje Amazon
- Amazonovo besedilo
- analitika
- Umetna inteligenca
- certificiranje umetne inteligence
- umetna inteligenca v bančništvu
- robot z umetno inteligenco
- roboti z umetno inteligenco
- programska oprema za umetno inteligenco
- Strojno učenje AWS
- blockchain
- blockchain konferenca ai
- coingenius
- pogovorna umetna inteligenca
- kripto konferenca ai
- dall's
- globoko učenje
- strojno učenje
- platon
- platon ai
- Platonova podatkovna inteligenca
- Igra Platon
- PlatoData
- platogaming
- lestvica ai
- sintaksa
- Nekategorizirane
- zefirnet