Feature Engineering At Scale For Healthcare And Life Sciences With Amazon SageMaker Data Wrangler

Ponovno objavil Platon

Spremljevalci: 0

Machine learning (ML) is disrupting a lot of industries at an unprecedented pace. The healthcare and life sciences (HCLS) industry has been going through a rapid evolution in recent years embracing ML across a multitude of use cases for delivering quality care and improving patient outcomes.

In a typical ML lifecycle, data engineers and scientists spend the majority of their time on the data preparation and feature engineering steps before even getting started with the process of model building and training. Having a tool that can lower the barrier to entry for data preparation, thereby improving productivity, is a highly desirable ask for these personas. Amazon SageMaker Data Wrangler is purpose built by AWS to reduce the learning curve and enable data practitioners to accomplish data preparation, cleaning, and feature engineering tasks in less effort and time. It offers a GUI interface with many built-in functions and integrations with other AWS services such as Preprosta storitev shranjevanja Amazon (Amazon S3) in Trgovina s funkcijami Amazon SageMaker, as well as partner data sources including Snowflake and Databricks.

In this post, we demonstrate how to use Data Wrangler to prepare healthcare data for training a model to predict heart failure, given a patient’s demographics, prior medical conditions, and lab test result history.

Pregled rešitev

Rešitev je sestavljena iz naslednjih korakov:

Acquire a healthcare dataset as input to Data Wrangler.
Use Data Wrangler’s built-in transformation functions to transform the dataset. This includes drop columns, featurize data/time, join datasets, impute missing values, encode categorical variables, scale numeric values, balance the dataset, and more.
Use Data Wrangler’s custom transform function (Pandas or PySpark code) to supplement additional transformations required beyond the built-in transformations and demonstrate the extensibility of Data Wrangler. This includes filter rows, group data, form new dataframes based on conditions, and more.
Use Data Wrangler’s built-in visualization functions to perform visual analysis. This includes target leakage, feature correlation, quick model, and more.
Use Data Wrangler’s built-in export options to export the transformed dataset to Amazon S3.
Launch a Jupyter notebook to use the transformed dataset in Amazon S3 as input to train a model.

Generate a dataset

Now that we have settled on the ML problem statement, we first set our sights on acquiring the data we need. Research studies such as Heart Failure Prediction may provide data that’s already in good shape. However, we often encounter scenarios where the data is quite messy and requires joining, cleansing, and several other transformations that are very specific to the healthcare domain before it can be used for ML training. We want to find or generate data that is messy enough and walk you through the steps of preparing it using Data Wrangler. With that in mind, we picked Synthea as a tool to generate synthetic data that fits our goal. Synthea is an open-source synthetic patient generator that models the medical history of synthetic patients. To generate your dataset, complete the following steps:

Follow the instructions as per the hiter začetek documentation to create an Amazon SageMaker Studio domain and launch Studio.
This is a prerequisite step. It is optional if Studio is already set up in your account.
After Studio is launched, on the Izstrelitev izberite jeziček Sistemski terminal.
This launches a terminal session that gives you a command line interface to work with.

To install Synthea and generate the dataset in CSV format, run the following commands in the launched terminal session:

$ sudo yum install -y java-1.8.0-openjdk-devel
$ export JAVA_HOME=/usr/lib/jvm/jre-1.8.0-openjdk.x86_64
$ export PATH=$JAVA_HOME/bin:$PATH
$ git clone https://github.com/synthetichealth/synthea
$ git checkout v3.0.0
$ cd synthea
$ ./run_synthea --exporter.csv.export=true -p 10000

We supply a parameter to generate the datasets with a population size of 10,000. Note the size parameter denotes the number of alive members of the population. Additionally, Synthea also generates data for dead members of the population which might add a few extra data points on top of the specified sample size.

Wait until the data generation is complete. This step usually takes around an hour or less. Synthea generates multiple datasets, including patients, medications, allergies, conditions, and more. For this post, we use three of the resulting datasets:

patients.csv – This dataset is about 3.2 MB and contains approximately 11,000 rows of patient data (25 columns including patient ID, birthdate, gender, address, and more)
conditions.csv – This dataset is about 47 MB and contains approximately 370,000 rows of medical condition data (six columns including patient ID, condition start date, condition code, and more)
observations.csv – This dataset is about 830 MB and contains approximately 5 million rows of observation data (eight columns including patient ID, observation date, observation code, value, and more)

There is a one-to-many relationship between the patients in conditions datasets. There is also a one-to-many relationship between the patients in observations datasets. For a detailed data dictionary, refer to CSV File Data Dictionary.

To upload the generated datasets to a source bucket in Amazon S3, run the following commands in the terminal session:
```
$ cd ./output/csv
$ aws s3 sync . s3://<source bucket name>/
```

Launch Data Wrangler

Izberite Viri SageMaker in the navigation page in Studio and on the projekti izberite meni Data Wrangler to create a Data Wrangler data flow. For detailed steps how to launch Data Wrangler from within Studio, refer to Začnite z Data Wrangler.

Uvozi podatke

Za uvoz podatkov izvedite naslednje korake:

Izberite Amazon S3 and locate the patients.csv file in the S3 bucket.
v podrobnosti podokno, izberite Najprej K za Vzorčenje.
Vnesite 1100 za Velikost vzorca.
In the preview pane, Data Wrangler pulls the first 100 rows from the dataset and lists them as a preview.
Izberite uvoz.
Data Wrangler selects the first 1,100 patients from the total patients (11,000 rows) generated by Synthea and imports the data. The sampling approach lets Data Wrangler only process the sample data. It enables us to develop our data flow with a smaller dataset, which results in quicker processing and a shorter feedback loop. After we create the data flow, we can submit the developed recipe to a SageMaker processing job to horizontally scale out the processing for the full or larger dataset in a distributed fashion.
Ta postopek ponovite za conditions in observations nabor podatkov
1. Za conditions dataset, enter 37000 za Velikost vzorca, which is 1/10 of the total 370,000 rows generated by Synthea.
2. Za observations dataset, enter 500000 za Velikost vzorca, which is 1/10 of the total observations 5 million rows generated by Synthea.

You should see three datasets as shown in the following screenshot.

Preoblikujte podatke

Data transformation is the process of changing the structure, value, or format of one or more columns in the dataset. The process is usually developed by a data engineer and can be challenging for people with a smaller data engineering skillset to decipher the logic proposed for the transformation. Data transformation is part of the broader feature engineering process, and the correct sequence of steps is another important criterion to keep in mind while devising such recipes.

Data Wrangler is designed to be a low-code tool to reduce the barrier of entry for effective data preparation. It comes with over 300 preconfigured data transformations for you to choose from without writing a single line of code. In the following sections, we see how to transform the imported datasets in Data Wrangler.

Drop columns in patients.csv

We first drop some columns from the patients dataset. Dropping redundant columns removes non-relevant information from the dataset and helps us reduce the amount of computing resources required to process the dataset and train a model. In this section, we drop columns such as SSN or passport number based on common sense that these columns have no predictive value. In other words, they don’t help our model predict heart failure. Our study is also not concerned about other columns such as birthplace or healthcare expenses’ influence to a patient’s heart failure, so we drop them as well. Redundant columns can also be identified by running the built-in analyses like target leakage, feature correlation, multicollinearity, and more, which are built into Data Wrangler. For more details on the supported analyses types, refer to Analizirajte in vizualizirajte. Poleg tega lahko uporabite Poročilo o kakovosti podatkov in vpogledih to perform automated analyses on the datasets to arrive at a list of redundant columns to eliminate.

Izberite znak plus poleg Vrste podatkov for the patients.csv dataset and choose Dodaj preoblikovanje.
Izberite Dodaj korak In izberite Upravljanje stolpcev.
za Transform¸ izberite Spustite stolpec.
za Stolpci za spuščanje, choose the following columns:
1. SSN
2. DRIVERS
3. PASSPORT
4. PREFIX
5. FIRST
6. LAST
7. SUFFIX
8. MAIDEN
9. RACE
10. ETHNICITY
11. BIRTHPLACE
12. ADDRESS
13. CITY
14. STATE
15. COUNTY
16. ZIP
17. LAT
18. LON
19. HEALTHCARE_EXPENSES
20. HEALTHCARE_COVERAGE
Izberite predogled to review the transformed dataset, then choose Dodaj.

You should see the step Spustite stolpec in your list of transforms.

Featurize date/time in patients.csv

Now we use the Featurize date/time function to generate the new feature Year Iz BIRTHDATE stolpec v patients dataset. We use the new feature in a subsequent step to calculate a patient’s age at the time of observation takes place.

v Preoblikuje pane of your Spustite stolpec stran za patients nabor podatkov, izberite Dodaj korak.
Izberite Predstavite datum/čas preoblikovati.
Izberite Izvleček stolpcev.
za Vnosni stolpci, add the column BIRTHDATE.
Izberite Leto in prekliči izbiro mesec, Dan, ura, Minute, Drugi.
Izberite predogled, nato izberite Dodaj.

Add transforms in observations.csv

Data Wrangler supports custom transforms using Python (user-defined functions), PySpark, Pandas, or PySpark (SQL). You can choose your transform type based on your familiarity with each option and preference. For the latter three options, Data Wrangler exposes the variable df for you to access the dataframe and apply transformations on it. For a detailed explanation and examples, refer to Custom Transforms. In this section, we add three custom transforms to the observations nabor podatkov.

Add a transform to observations.csv and drop the DESCRIPTION stolpec.
Izberite predogled, nato izberite Dodaj.
v Preoblikuje podokno, izberite Dodaj korak In izberite Preoblikovanje po meri.
V spustnem meniju izberite Python (pande).

Vnesite to kodo:

df = df[df["CODE"].isin(['8867-4','8480-6','8462-4','39156-5','777-3'])]

These are LONIC codes that correspond to the following observations we’re interested in using as features for predicting heart failure:

heart rate: 8867-4
systolic blood pressure: 8480-6
diastolic blood pressure: 8462-4
body mass index (BMI): 39156-5
platelets [#/volume] in Blood: 777-3

Izberite predogled, nato izberite Dodaj.
Add a transform to extract Year in Quarter Iz DATE stolpec.
Izberite predogled, nato izberite Dodaj.
Izberite Dodaj korak In izberite Preoblikovanje po meri.
V spustnem meniju izberite Python (PySpark).

The five types of observations may not always be recorded on the same date. For example, a patient may visit their family doctor on January 21 and have their systolic blood pressure, diastolic blood pressure, heart rate, and body mass index measured and recorded. However, a lab test that includes platelets may be done at a later date on February 2. Therefore, it’s not always possible to join dataframes by the observation date. Here we join dataframes on a coarse granularity at the quarter basis.

Vnesite to kodo:

from pyspark.sql.functions import col

systolic_df = (
    df.select("patient", "DATE_year", "DATE_quarter", "value")
                   .withColumnRenamed("value", "systolic")
                   .filter((col("code") == "8480-6"))
  )

diastolic_df = (
    df.select("patient", "DATE_year", "DATE_quarter", "value")
                   .withColumnRenamed('value', 'diastolic')
                   .filter((col("code") == "8462-4"))
    )

hr_df = (
    df.select("patient", "DATE_year", "DATE_quarter", "value")
                   .withColumnRenamed('value', 'hr')
                   .filter((col("code") == "8867-4"))
    )

bmi_df = (
    df.select("patient", "DATE_year", "DATE_quarter", "value")
                   .withColumnRenamed('value', 'bmi')
                   .filter((col("code") == "39156-5"))
    )

platelets_df = (
    df.select("patient", "DATE_year", "DATE_quarter", "value")
                   .withColumnRenamed('value', 'platelets')
                   .filter((col("code") == "777-3"))
    )

df = (
    systolic_df.join(diastolic_df, ["patient", "DATE_year", "DATE_quarter"])
                            .join(hr_df, ["patient", "DATE_year", "DATE_quarter"])
                            .join(bmi_df, ["patient", "DATE_year", "DATE_quarter"])
                            .join(platelets_df, ["patient", "DATE_year", "DATE_quarter"])
)

Izberite predogled, nato izberite Dodaj.
Izberite Dodaj korak, nato izberite Upravljanje vrstic.
za Transform, izberite Izpusti dvojnike.
Izberite predogled, nato izberite Dodaj.
Izberite Dodaj korak In izberite Preoblikovanje po meri.
V spustnem meniju izberite Python (pande).

Enter the following code to take an average of data points that share the same time value:

import pandas as pd
df.loc[:, df.columns != 'patient']=df.loc[:, df.columns != 'patient'].apply(pd.to_numeric)
df = df.groupby(['patient','DATE_year','DATE_quarter']).mean().round(0).reset_index()

Izberite predogled, nato izberite Dodaj.

Join patients.csv and observations.csv

In this step, we showcase how to effectively and easily perform complex joins on datasets without writing any code via Data Wrangler’s powerful UI. To learn more about the supported types of joins, refer to Pretvori podatke.

Desno od Transform: patients.csv, izberite znak plus zraven Koraki In izberite pridruži se.
You can see the transformed patients.csv file listed under Podatkovni nizi v levem podoknu.
Desno od Transform: observations.csv, kliknite na Koraki to initiate the join operation.
The transformed observations.csv file is now listed under Podatkovni nizi v levem podoknu.
Izberite Konfiguracija.
za Vrsta pridružitve, izberite Notranja.
za levo, izberite Id.
za Prav, izberite Bolnik.
Izberite predogled, nato izberite Dodaj.

Add a custom transform to the joined datasets

In this step, we calculate a patient’s age at the time of observation. We also drop columns that are no longer needed.

Izberite znak plus poleg 1st Join In izberite Dodaj preoblikovanje.

Add a custom transform in Pandas:

df['age'] = df['DATE_year'] - df['BIRTHDATE_year']
df = df.drop(columns=['BIRTHDATE','DEATHDATE','BIRTHDATE_year','patient'])

Izberite predogled, nato izberite Dodaj.

Add custom transforms to conditions.csv

Izberite znak plus poleg Transform: conditions.csv In izberite Dodaj preoblikovanje.

Add a custom transform in Pandas:

df = df[df["CODE"].isin(['84114007', '88805009', '59621000', '44054006', '53741008', '449868002', '49436004'])]
df = df.drop(columns=['DESCRIPTION','ENCOUNTER','STOP'])

Opomba: As we demonstrated earlier, you can drop columns either using custom code or using the built-in transformations provided by Data Wrangler. Custom transformations within Data Wrangler provides the flexibility to bring your own transformation logic in the form of code snippets in the supported frameworks. These snippets can later be searched and applied if needed.

The codes in the preceding transform are SNOMED-CT codes that correspond to the following conditions. The heart failure or chronic congestive heart failure condition becomes the label. We use the remaining conditions as features for predicting heart failure. We also drop a few columns that are no longer needed.

Heart failure: 84114007
Chronic congestive heart failure: 88805009
Hypertension: 59621000
Diabetes: 44054006
Coronary Heart Disease: 53741008
Smokes tobacco daily: 449868002
Atrial Fibrillation: 49436004

Next, let’s add a custom transform in PySpark:

from pyspark.sql.functions import col, when

heartfailure_df = (
    df.select("patient", "start")
                      .withColumnRenamed("start", "heartfailure")
                   .filter((col("code") == "84114007") | (col("code") == "88805009"))
  )

hypertension_df = (
    df.select("patient", "start")
                   .withColumnRenamed("start", "hypertension")
                   .filter((col("code") == "59621000"))
  )

diabetes_df = (
    df.select("patient", "start")
                   .withColumnRenamed("start", "diabetes")
                   .filter((col("code") == "44054006"))
  )

coronary_df = (
    df.select("patient", "start")
                   .withColumnRenamed("start", "coronary")
                   .filter((col("code") == "53741008"))
  )

smoke_df = (
    df.select("patient", "start")
                   .withColumnRenamed("start", "smoke")
                   .filter((col("code") == "449868002"))
  )

atrial_df = (
    df.select("patient", "start")
                   .withColumnRenamed("start", "atrial")
                   .filter((col("code") == "49436004"))
  )

df = (
    heartfailure_df.join(hypertension_df, ["patient"], "leftouter").withColumn("has_hypertension", when(col("hypertension") < col("heartfailure"), 1).otherwise(0))
    .join(diabetes_df, ["patient"], "leftouter").withColumn("has_diabetes", when(col("diabetes") < col("heartfailure"), 1).otherwise(0))
    .join(coronary_df, ["patient"], "leftouter").withColumn("has_coronary", when(col("coronary") < col("heartfailure"), 1).otherwise(0))
    .join(smoke_df, ["patient"], "leftouter").withColumn("has_smoke", when(col("smoke") < col("heartfailure"), 1).otherwise(0))
    .join(atrial_df, ["patient"], "leftouter").withColumn("has_atrial", when(col("atrial") < col("heartfailure"), 1).otherwise(0))
)

We perform a left outer join to keep all entries in the heart failure dataframe. A new column has_xxx is calculated for each condition other than heart failure based on the condition’s start date. We’re only interested in medical conditions that were recorded prior to the heart failure and use them as features for predicting heart failure.

Add a built-in Upravljanje stolpcev transform to drop the redundant columns that are no longer needed:
1. hypertension
2. diabetes
3. coronary
4. smoke
5. atrial
Izvleček Year in Quarter Iz heartfailure stolpec.
This matches the granularity we used earlier in the transformation of the observations nabor podatkov.
We should have a total of 6 steps for conditions.csv.

Join conditions.csv to the joined dataset

We now perform a new join to join the conditions dataset to the joined patients in observations nabor podatkov.

Izberite Transform: 1st Join.
Izberite znak plus in izberite pridruži se.
Izberite Koraki poleg Transform: conditions.csv.
Izberite Konfiguracija.
za Vrsta pridružitve, izberite Levi zunanji.
za levo, izberite Id.
za Prav, izberite Bolnik.
Izberite predogled, nato izberite Dodaj.

Add transforms to the joined datasets

Now that we have all three datasets joined, let’s apply some additional transformations.

Add the following custom transform in PySpark so has_heartfailure becomes our label column:

from pyspark.sql.functions import col, when
df = (
    df.withColumn("has_heartfailure", when(col("heartfailure").isNotNull(), 1).otherwise(0))
)

Add the following custom transformation in PySpark:
```
from pyspark.sql.functions import col

df = (
    df.filter(
      (col("has_heartfailure") == 0) | 
      ((col("has_heartfailure") == 1) & ((col("date_year") <= col("heartfailure_year")) | ((col("date_year") == col("heartfailure_year")) & (col("date_quarter") <= col("heartfailure_quarter")))))
    )
)
```
We’re only interested in observations recorded prior to when the heart failure condition is diagnosed and use them as features for predicting heart failure. Observations taken after heart failure is diagnosed may be affected by the medication a patient takes, so we want to exclude those ones.
Drop the redundant columns that are no longer needed:
1. Id
2. DATE_year
3. DATE_quarter
4. patient
5. heartfailure
6. heartfailure_year
7. heartfailure_quarter
o Analiza zavihek, za Vrsta analize¸ izberite Table summary.
A quick scan through the summary shows that the MARITAL column has missing data.
Izberite datum tab and add a step.
Izberite Handle Missing.
za Transform, izberite Manjka polnjenje.
za Vnosni stolpci, izberite MARITAL.
za Vrednost polnjenja, vnesite S.
Our strategy here is to assume the patient is single if the marital status has missing value. You can have a different strategy.
Izberite predogled, nato izberite Dodaj.
Fill the missing value as 0 for has_hypertension, has_diabetes, has_coronary, has_smoke, has_atrial.

Marital in Gender are categorial variables. Data Wrangler has a built-in function to encode categorial variables.

Add a step and choose Encode categorial.
za Transform, izberite Enkratno kodiranje.
za Vnosni stolpci, izberite MARITAL.
za Izhodni slog, izberite Stolpec.
This output style produces encoded values in separate columns.
Izberite predogled, nato izberite Dodaj.
Ponovite te korake za Spol stolpec.

The one-hot encoding splits the Marital column into Marital_M (married) and Marital_S (single), and splits the Gender column into Gender_M (moški) in Gender_F (female). Because Marital_M in Marital_S are mutually exclusive (as are Gender_M in Gender_F), we can drop one column to avoid redundant features.

Drop Marital_S in Gender_F.

Numeric features such as systolic, heart rate, and age have different unit standards. For a linear regression-based model, we need to normalize these numeric features first. Otherwise, some features with higher absolute values may have an unwarranted advantage over other features with lower absolute values and result in poor model performance. Data Wrangler has the built-in transform Min-max scaler to normalize the data. For a decision tree-based classification model, normalization isn’t required. Our study is a classification problem so we don’t need to apply normalization. Imbalanced classes are a common problem in classification. Imbalance happens when the training dataset contains severely skewed class distribution. For example, when our dataset contains disproportionally more patients without heart failure than patients with heart failure, it can cause the model to be biased toward predicting no heart failure and perform poorly. Data Wrangler has a built-in function to tackle the problem.

Add a custom transform in Pandas to convert data type of columns from “object” type to numeric type:
```
import pandas as pd
df=df.apply(pd.to_numeric)
```
Izberite Analiza tab.
za Vrsta analize¸ izberite Histogram.
za X os, izberite has_heartfailure.
Izberite predogled.

It’s obvious that we have an imbalanced class (more data points labeled as no heart failure than data points labeled as heart failure).
Pojdi nazaj na datum zavihek. Izberite Dodaj korak In izberite Podatki o bilanci.
za Ciljni stolpec, izberite has_heartfailure.
za Desired ratio, vnesite 1.
za Transform, izberite SMETE.

SMOTE stands for Synthetic Minority Over-sampling Technique. It’s a technique to create new minority instances and add to the dataset to reach class balance. For detailed information, refer to SMOTE: Tehnika pretiranega vzorčenja sintetičnih manjšin.
Izberite predogled, nato izberite Dodaj.
Repeat the histogram analysis in step 20-23. The result is a balanced class.

Visualize target leakage and feature correlation

Next, we’re going to perform a few visual analyses using Data Wrangler’s rich toolset of advanced ML-supported analysis types. First, we look at target leakage. Target leakage occurs when data in the training dataset is strongly correlated with the target label, but isn’t available in real-world data at inference time.

o Zavihek AnalizaZa Vrsta analize¸ izberite Ciljno puščanje.
za Vrsta težave, izberite Razvrstitev.
za ciljna, izberite has_heartfailure.
Izberite predogled.

Based on the analysis, hr is a target leakage. We’ll drop it in a subsequent step. age is flagged a target leakage. It’s reasonable to say that a patient’s age will be available during inference time, so we keep age as a feature. Systolic in diastolic are also flagged as likely target leakage. We expect to have the two measurements during inference time, so we keep them as features.
Izberite Dodaj to add the analysis.

Then, we look at feature correlation. We want to select features that are correlated with the target but are uncorrelated among themselves.

o Zavihek AnalizaZa Vrsta analize¸ izberite Korelacija značilnosti.
za Correlation Type¸ izberite linearna.
Izberite predogled.

The coefficient scores indicate strong correlations between the following pairs:

systolic in diastolic
bmi in age
has_hypertension in has_heartfailure (label)

For features that are strongly correlated, matrices are computationally difficult to invert, which can lead to numerically unstable estimates. To mitigate the correlation, we can simply remove one from the pair. We drop diastolic in bmi in obdrži systolic in age in a subsequent step.

Drop diastolic and bmi columns

Add additional transform steps to drop the hr, diastolic in bmi columns using the built-in transform.

Generate the Data Quality and Insights Report

AWS recently razglasitve the new Data Quality and Insights Report feature in Data Wrangler. This report automatically verifies data quality and detects abnormalities in your data. Data scientists and data engineers can use this tool to efficiently and quickly apply domain knowledge to process datasets for ML model training. This step is optional. To generate this report on our datasets, complete the following steps:

o Analiza zavihek, za Vrsta analize, izberite Poročilo o kakovosti podatkov in vpogledih.
za Ciljni stolpec, izberite has_heartfailure.
za Vrsta težavetako, da izberete Razvrstitev.
Izberite ustvarjanje.

In a few minutes, it generates a report with a summary, visuals, and recommendations.

Generate a Quick Model analysis

We have completed our data preparation, cleaning, and feature engineering. Data Wrangler has a built-in function that provides a rough estimate of the expected predicted quality and the predictive power of features in our dataset.

o Analiza zavihek, za Vrsta analize¸ izberite Hitri model.
za label, izberite has_heartfailure.
Izberite predogled.

As per our Quick Model analysis, we can see the feature has_hypertension has the highest feature importance score among all features.

Export the data and train the model

Now let’s export the transformed ML-ready features to a destination S3 bucket and scale the entire feature engineering pipeline we have created so far using the samples into the entire dataset in a distributed fashion.

Choose the plus sign next to the last box in the data flow and choose Dodaj cilj.
Izberite Amazon S3.
Vpišite Ime nabora podatkov. Forum Lokacija Amazon S3, choose a S3 bucket, then choose Dodaj cilj.
Izberite Ustvari službo to launch a distributed PySpark processing job to perform the transformation and output the data to the destination S3 bucket.

Depending on the size of the datasets, this option lets us easily configure the cluster and horizontally scale in a no-code fashion. We don’t have to worry about partitioning the datasets or managing the cluster and Spark internals. All of this is automatically taken care for us by Data Wrangler.
V levem podoknu izberite Next, 2. Configure job.
Potem izberite Run.

Alternatively, we can also export the transformed output to S3 via a Jupyter Notebook. With this approach, Data Wrangler automatically generates a Jupyter notebook with all the code needed to kick-off a processing job to apply the data flow steps (created using a sample) on the larger full dataset and use the transformed dataset as features to kick-off a training job later. The notebook code can be readily run with or without making changes. Let’s now walk through the steps on how to get this done via Data Wrangler’s UI.

Choose the plus sign next to the last step in the data flow and choose Izvozi v.
Izberite Amazon S3 (prek Jupyter Notebook).
It automatically opens a new tab with a Jupyter notebook.
In the Jupyter notebook, locate the cell in the (Optional) Next Steps section and change run_optional_steps iz False do True.
The enabled optional steps in the notebook perform the following:
- Train a model using XGBoost
Go back to the top of the notebook and on the Run izberite meni Zaženi vse celice.

If you use the generated notebook as is, it launches a SageMaker processing job that scales out the processing across two m5.4xlarge instances to processes the full dataset on the S3 bucket. You can adjust the number of instances and instance types based on the dataset size and time you need to complete the job.

Wait until the training job from the last cell is complete. It generates a model in the SageMaker default S3 bucket.

The trained model is ready for deployment for either real-time inference or batch transformation. Note that we used synthetic data to demonstrate functionalities in Data Wrangler and used processed data for training model. Given that the data we used is synthetic, the inference result from the trained model is not meant for real-world medical condition diagnosis or substitution of judgment from medical practitioners.

You can also directly export your transformed dataset into Amazon S3 by choosing izvoz on top of the transform preview page. The direct export option only exports the transformed sample if sampling was enabled during the import. This option is best suited if you’re dealing with smaller datasets. The transformed data can also be ingested directly into a feature store. For more information, refer to Trgovina s funkcijami Amazon SageMaker. The data flow can also be exported as a SageMaker pipeline that can be orchestrated and scheduled as per your requirements. For more information, see Amazonski cevovodi SageMaker.

zaključek

In this post, we showed how to use Data Wrangler to process healthcare data and perform scalable feature engineering in a tool-driven, low-code fashion. We learned how to apply the built-in transformations and analyses aptly wherever needed, combining it with custom transformations to add even more flexibility to our data preparation workflow. We also walked through the different options for scaling out the data flow recipe via distributed processing jobs. We also learned how the transformed data can be easily used for training a model to predict heart failure.

There are many other features in Data Wrangler we haven’t covered in this post. Explore what’s possible in Pripravite podatke ML z Amazon SageMaker Data Wrangler and learn how to leverage Data Wrangler for your next data science or machine learning project.

O avtorjih

Forrest Sun is a Senior Solution Architect with the AWS Public Sector team in Toronto, Canada. He has worked in the healthcare and finance industries for the past two decades. Outside of work, he enjoys camping with his family.

Inženiring funkcij v velikem obsegu za zdravstvo in znanosti o življenju z Amazon SageMaker Data Wrangler PlatoBlockchain Data Intelligence. Navpično iskanje. Ai. Arunprasath Shankar je arhitekt za umetno inteligenco in strojno učenje (AI / ML) Specialist Solutions Architect z AWS, ki svetovnim strankam pomaga, da svoje rešitve AI učinkovito in uspešno lestvijo v oblaku. V prostem času Arun rad gleda znanstvenofantastične filme in posluša klasično glasbo.