Intelligent Document Processing With AWS AI And Analytics Services In The Insurance Industry: Part 2

Ponovno objavil Platon

Spremljevalci: 0

In Del 1 of this series, we discussed intelligent document processing (IDP), and how IDP can accelerate claims processing use cases in the insurance industry. We discussed how we can use AWS AI services to accurately categorize claims documents along with supporting documents. We also discussed how to extract various types of documents in an insurance claims package, such as forms, tables, or specialized documents such as invoices, receipts, or ID documents. We looked into the challenges in legacy document processes, which is time-consuming, error-prone, expensive, and difficult to process at scale, and how you can use AWS AI services to help implement your IDP pipeline.

In this post, we walk you through advanced IDP features for document extraction, querying, and enrichment. We also look into how to further use the extracted structured information from claims data to get insights using AWS Analytics and visualization services. We highlight on how extracted structured data from IDP can help against fraudulent claims using AWS Analytics services.

Pregled rešitev

The following diagram illustrates the phases if IDP using AWS AI services. In Part 1, we discussed the first three phases of the IDP workflow. In this post, we expand on the extraction step and the remaining phases, which include integrating IDP with AWS Analytics services.

We use these analytics services for further insights and visualizations, and to detect fraudulent claims using structured, normalized data from IDP. The following diagram illustrates the solution architecture.

Diagram arhitekture IDP

The phases we discuss in this post use the following key services:

Amazon Comprehend Medical is a HIPAA-eligible natural language processing (NLP) service that uses machine learning (ML) models that have been pre-trained to understand and extract health data from medical text, such as prescriptions, procedures, or diagnoses.
AWS lepilo is a part of the AWS Analytics services stack, and is a serverless data integration service that makes it easy to discover, prepare, and combine data for analytics, ML, and application development.
Amazon RedShift is another service in the Analytics stack. Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud.

Predpogoji

Preden začnete, si oglejte Del 1 for a high-level overview of the insurance use case with IDP and details about the data capture and classification stages.

For more information regarding the code samples, refer to our Repo za GitHub.

Faza ekstrakcije

In Part 1, we saw how to use Amazon Textract APIs to extract information like forms and tables from documents, and how to analyze invoices and identity documents. In this post, we enhance the extraction phase with Amazon Comprehend to extract default and custom entities specific to custom use cases.

Insurance carriers often come across dense text in insurance claims applications, such a patient’s discharge summary letter (see the following example image). It can be difficult to automatically extract information from such types of documents where there is no definite structure. To address this, we can use the following methods to extract key business information from the document:

Discharge summary sample

Extract default entities with the Amazon Comprehend DetectEntities API

We run the following code on the sample medical transcription document:

comprehend = boto3.client('comprehend') 

response = comprehend.detect_entities( Text=text, LanguageCode='en')

#print enitities from the response JSON

for entity in response['Entities']:
    print(f'{entity["Type"]} : {entity["Text"]}')

The following screenshot shows a collection of entities identified in the input text. The output has been shortened for the purposes of this post. Refer to the GitHub repo for a detailed list of entities.

Extract custom entities with Amazon Comprehend custom entity recognition

Odgovor s strani DetectEntities API includes the default entities. However, we’re interested in knowing specific entity values, such as the patient’s name (denoted by the default entity PERSON), or the patient’s ID (denoted by the default entity OTHER). To recognize these custom entities, we train an Amazon Comprehend custom entity recognizer model. We recommend following the comprehensive steps on how to train and deploy a custom entity recognition model in the Repo za GitHub.

After we deploy the custom model, we can use the helper function get_entities() to retrieve custom entities like PATIENT_NAME in PATIENT_D from the API response:

def get_entities(text):
try:
    #detect entities
    entities_custom = comprehend.detect_entities(LanguageCode="en",
                      Text=text, EndpointArn=ER_ENDPOINT_ARN) 
    df_custom = pd.DataFrame(entities_custom["Entities"], columns = ['Text',  
                'Type', 'Score'])
    df_custom = df_custom.drop_duplicates(subset=['Text']).reset_index()
    return df_custom
except Exception as e:
    print(e)

# call the get_entities() function 
response = get_entities(text) 
#print the response from the get_entities() function
print(response)

Naslednji posnetek zaslona prikazuje naše rezultate.

Faza obogatitve

In the document enrichment phase, we perform enrichment functions on healthcare-related documents to draw valuable insights. We look at the following types of enrichment:

Extract domain-specific language – We use Amazon Comprehend Medical to extract medical-specific ontologies like ICD-10-CM, RxNorm, and SNOMED CT
Uredite občutljive informacije – We use Amazon Comprehend to redact personally identifiable information (PII), and Amazon Comprehend Medical for protected health information (PHI) redaction

Extract medical information from unstructured medical text

Documents such as medical providers’ notes and clinical trial reports include dense medical text. Insurance claims carriers need to identify the relationships among the extracted health information from this dense text and link them to medical ontologies like ICD-10-CM, RxNorm, and SNOMED CT codes. This is very valuable in automating claim capture, validation, and approval workflows for insurance companies to accelerate and simplify claim processing. Let’s look at how we can use the Amazon Comprehend Medical InferICD10CM API to detect possible medical conditions as entities and link them to their codes:

cm_json_data = comprehend_med.infer_icd10_cm(Text=text)

print("nMedical codingn========")

for entity in cm_json_data["Entities"]:
      for icd in entity["ICD10CMConcepts"]:
           description = icd['Description']
           code = icd["Code"]
           print(f'{description}: {code}')

For the input text, which we can pass in from the Amazon Textract DetectDocumentText API, InferICD10CM API returns the following output (the output has been abbreviated for brevity).

Extract medical information from unstructured medical text

Similarly, we can use the Amazon Comprehend Medical InferRxNorm API to identify medications and the InferSNOMEDCT API to detect medical entities within healthcare-related insurance documents.

Perform PII and PHI redaction

Insurance claims packages require a lot of privacy compliance and regulations because they contain both PII and PHI data. Insurance carriers can reduce compliance risk by redacting information like policy numbers or the patient’s name.

Let’s look at an example of a patient’s discharge summary. We use the Amazon Comprehend DetectPiiEntities API to detect PII entities within the document and protect the patient’s privacy by redacting these entities:

resp = call_textract(input_document = f's3://{data_bucket}/idp/textract/dr-note-sample.png')
text = get_string(textract_json=resp, output_type=[Textract_Pretty_Print.LINES])

# call Amazon Comprehend Detect PII Entities API
entity_resp = comprehend.detect_pii_entities(Text=text, LanguageCode="en") 

pii = []
for entity in entity_resp['Entities']:
      pii_entity={}
      pii_entity['Type'] = entity['Type']
      pii_entity['Text'] = text[entity['BeginOffset']:entity['EndOffset']]
      pii.append(pii_entity)
print(pii)

We get the following PII entities in the response from the detect_pii_entities() API :

response from the detect_pii_entities() API

We can then redact the PII entities that were detected from the documents by utilizing the bounding box geometry of the entities from the document. For that, we use a helper tool called amazon-textract-overlayer. Za več informacij glejte Textract-Overlayer. The following screenshots compare a document before and after redaction.

Similar to the Amazon Comprehend DetectPiiEntities API, we can also use the DetectPHI API to detect PHI data in the clinical text being examined. For more information, refer to Detect PHI.

Faza pregleda in validacije

In the document review and validation phase, we can now verify if the claim package meets the business’s requirements, because we have all the information collected from the documents in the package from earlier stages. We can do this by introducing a human in the loop that can review and validate all the fields or just an auto-approval process for low dollar claims before sending the package to downstream applications. We can use Amazon, razširjeni AI (Amazon A2I) to automate the human review process for insurance claims processing.

Inteligentna obdelava dokumentov s storitvami AWS AI in Analytics v zavarovalništvu: 2. del PlatoBlockchain Data Intelligence. Navpično iskanje. Ai.

Now that we have all required data extracted and normalized from claims processing using AI services for IDP, we can extend the solution to integrate with AWS Analytics services such as AWS Glue and Amazon Redshift to solve additional use cases and provide further analytics and visualizations.

Detect fraudulent insurance claims

In this post, we implement a serverless architecture where the extracted and processed data is stored in a data lake and is used to detect fraudulent insurance claims using ML. We use Preprosta storitev shranjevanja Amazon (Amazon S3) to store the processed data. We can then use AWS lepilo or Amazonski EMR to cleanse the data and add additional fields to make it consumable for reporting and ML. After that, we use Amazon Redshift ML to build a fraud detection ML model. Finally, we build reports using Amazon QuickSight to get insights into the data.

Setup Amazon Redshift external schema

For the purpose of this example, we have created a vzorec nabora podatkov the emulates the output of an ETL (extract, transform, and load) process, and use AWS Glue Data Catalog as the metadata catalog. First, we create a database named idp_demo in the Data Catalog and an external schema in Amazon Redshift called idp_insurance_demo (see the following code). We use an AWS upravljanje identitete in dostopa (IAM) role to grant permissions to the Amazon Redshift cluster to access Amazon S3 and Amazon SageMaker. For more information about how to set up this IAM role with least privilege, refer to Združite in konfigurirajte nastavitev za skrbništvo Amazon Redshift ML.

CREATE EXTERNAL SCHEMA idp_insurance_demo
FROM DATA CATALOG
DATABASE 'idp_demo' 
IAM_ROLE '<<>>'
CREATE EXTERNAL DATABASE IF NOT EXISTS;

Create Amazon Redshift external table

The next step is to create an external table in Amazon Redshift referencing the S3 location where the file is located. In this case, our file is a comma-separated text file. We also want to skip the header row from the file, which can be configured in the table properties section. See the following code:

create external table idp_insurance_demo.claims(id INTEGER,
date_of_service date,
patients_address_city VARCHAR,
patients_address_state VARCHAR,
patients_address_zip VARCHAR,
patient_status VARCHAR,
insured_address_state VARCHAR,
insured_address_zip VARCHAR,
insured_date_of_birth date,
insurance_plan_name VARCHAR,
total_charges DECIMAL(14,4),
fraud VARCHAR,
duplicate varchar,
invalid_claim VARCHAR
)
row format delimited
fields terminated by ','
stored as textfile
location '<<>>'
table properties ( 'skip.header.line.count'='1');

Create training and test datasets

After we create the external table, we prepare our dataset for ML by splitting it into training set and test set. We create a new external table called claim_train, which consists of all records with ID <= 85000 from the claims table. This is the training set that we train our ML model on.

CREATE EXTERNAL TABLE
idp_insurance_demo.claims_train
row format delimited
fields terminated by ','
stored as textfile
location '<<>>/train'
table properties ( 'skip.header.line.count'='1')
AS select * from idp_insurance_demo.claims where id <= 850000

We create another external table called claim_test that consists of all records with ID >85000 to be the test set that we test the ML model on:

CREATE EXTERNAL TABLE
idp_insurance_demo.claims_test
row format delimited
fields terminated by ','
stored as textfile
location '<<>>/test'
table properties ( 'skip.header.line.count'='1')
AS select * from idp_insurance_demo.claims where id > 850000

Create an ML model with Amazon Redshift ML

Now we create the model using the USTVARJI MODEL command (see the following code). We select the relevant columns from the claims_train table that can determine a fraudulent transaction. The goal of this model is to predict the value of the fraud column; therefore, fraud is added as the prediction target. After the model is trained, it creates a function named insurance_fraud_model. This function is used for inference while running SQL statements to predict the value of the fraud column for new records.

CREATE MODEL idp_insurance_demo.insurance_fraud_model
FROM (SELECT 
total_charges ,
fraud ,
duplicate,
invalid_claim
FROM idp_insurance_demo.claims_train
)
TARGET fraud
FUNCTION insurance_fraud_model
IAM_ROLE '<<>>'
SETTINGS (
S3_BUCKET '<<>>'
);

Evaluate ML model metrics

After we create the model, we can run queries to check the accuracy of the model. We use the insurance_fraud_model function to predict the value of the fraud column for new records. Run the following query on the claims_test table to create a confusion matrix:

SELECT 
fraud,
idp_insurance_demo.insurance_fraud_model (total_charges ,duplicate,invalid_claim ) as fraud_calculcated,
count(1)
FROM idp_insurance_demo.claims_test
GROUP BY fraud , fraud_calculcated;

Detect fraud using the ML model

After we create the new model, as new claims data is inserted into the data warehouse or data lake, we can use the insurance_fraud_model function to calculate the fraudulent transactions. We do this by first loading the new data into a temporary table. Then we use the insurance_fraud_model function to calculate the fraud flag for each new transaction and insert the data along with the flag into the final table, which in this case is the claims miza.

Visualize the claims data

When the data is available in Amazon Redshift, we can create visualizations using QuickSight. We can then share the QuickSight dashboards with business users and analysts. To create the QuickSight dashboard, you first need to create an Amazon Redshift dataset in QuickSight. For instructions, refer to Ustvarjanje nabora podatkov iz baze podatkov.

After you create the dataset, you can create a new analysis in QuickSight using the dataset. The following are some sample reports we created:

Total number of claims by state, grouped by the fraud Polje – This chart shows us the proportion of fraudulent transactions compared to the total number of transactions in a particular state.
Sum of the total dollar value of the claims, grouped by the fraud Polje – This chart shows us the proportion of dollar amount of fraudulent transactions compared to the total dollar amount of transactions in a particular state.
Total number of transactions per insurance company, grouped by the fraud Polje – This chart shows us how many claims were filed for each insurance company and how many of them are fraudulent.

• Total number of transactions per insurance company, grouped by the fraud field

Total sum of fraudulent transactions by state displayed on a US map – This chart just shows the fraudulent transactions and displays the total charges for those transactions by state on the map. The darker shade of blue indicates higher total charges. We can further analyze this by city within that state and zip codes with the city to better understand the trends.

Inteligentna obdelava dokumentov s storitvami AWS AI in Analytics v zavarovalništvu: 2. del PlatoBlockchain Data Intelligence. Navpično iskanje. Ai.

Čiščenje

To prevent incurring future charges to your AWS account, delete the resources that you provisioned in the setup by following the instructions in the Oddelek za čiščenje v našem repoju.

zaključek

In this two-part series, we saw how to build an end-to-end IDP pipeline with little or no ML experience. We explored a claims processing use case in the insurance industry and how IDP can help automate this use case using services such as Amazon Textract, Amazon Comprehend, Amazon Comprehend Medical, and Amazon A2I. In Part 1, we demonstrated how to use AWS AI services for document extraction. In Part 2, we extended the extraction phase and performed data enrichment. Finally, we extended the structured data extracted from IDP for further analytics, and created visualizations to detect fraudulent claims using AWS Analytics services.

Priporočamo, da pregledate varnostne razdelke Amazonovo besedilo, Amazonsko razumevanjein Amazon A2I dokumentacijo in upoštevanje navedenih smernic. Če želite izvedeti več o cenah rešitve, si oglejte podrobnosti o cenah Amazonovo besedilo, Amazonsko razumevanjein Amazon A2I.

O avtorjih

Chinmayee Rane je specialist za rešitve AI/ML pri Amazon Web Services. Navdušena je nad uporabno matematiko in strojnim učenjem. Osredotoča se na oblikovanje inteligentnih rešitev za obdelavo dokumentov za stranke AWS. Zunaj službe uživa v plesu salse in bachate.

Uday Narayanan is an Analytics Specialist Solutions Architect at AWS. He enjoys helping customers find innovative solutions to complex business challenges. His core areas of focus are data analytics, big data systems, and machine learning. In his spare time, he enjoys playing sports, binge-watching TV shows, and traveling.

Sonali Sahu vodi skupino arhitektov rešitev Intelligent Document Processing AI/ML Solutions pri Amazon Web Services. Je strastna tehnofilka in uživa v delu s strankami pri reševanju kompleksnih problemov z uporabo inovacij. Njeno osrednje področje je umetna inteligenca in strojno učenje za inteligentno obdelavo dokumentov.

Časovni žig: November 3, 2022November 3, 2022

Časovni žig: September 13, 2023

Inteligentna obdelava dokumentov z AWS AI in storitvami Analytics v zavarovalništvu: 2. del

Ponovno objavil Platon

Pregled rešitev

Predpogoji

Faza ekstrakcije

Extract default entities with the Amazon Comprehend DetectEntities API

Extract custom entities with Amazon Comprehend custom entity recognition

Faza obogatitve

Extract medical information from unstructured medical text

Perform PII and PHI redaction

Faza pregleda in validacije

Detect fraudulent insurance claims

Setup Amazon Redshift external schema

Create Amazon Redshift external table

Create training and test datasets

Create an ML model with Amazon Redshift ML

Evaluate ML model metrics

Detect fraud using the ML model

Visualize the claims data

Čiščenje

zaključek

O avtorjih

Več od Strojno učenje AWS

Pohitri razvoj ML z uporabo SageMaker Feature Store in Apache Iceberg offline store compaction

Integrirajte Amazon SageMaker Data Wrangler z delovnimi tokovi MLOps

Zaženite PyTorch Lightning in izvorni PyTorch DDP na Amazon SageMaker Training, ki vključuje Amazon Search

AlexaTM 20B je zdaj na voljo v Amazon SageMaker JumpStart

ML sklepanje na robu z Amazon SageMaker Edge in Ambarella CV25

Preprosto prilagodite svoja obvestila, medtem ko uporabljate Amazon Lookout for Metrics

Vizualizirajte analizo Amazon Comprehend z besednim oblakom v Amazon QuickSight | Spletne storitve Amazon

O nas

Navpično iskanje in Ai

Platforma

Ostanite povezani

Račun