Use Amazon SageMaker Data Wrangler For Data Preparation And Studio Labs To Learn And Experiment With ML

Republicat de Platon

Urmaritori: 0

Amazon SageMaker Studio Lab is a free machine learning (ML) development environment based on open-source JupyterLab for anyone to learn and experiment with ML using AWS ML compute resources. It’s based on the same architecture and user interface as Amazon SageMaker Studio, but with a subset of Studio capabilities.

When you begin working on ML initiatives, you need to perform exploratory data analysis (EDA) or data preparation before proceeding with model building. Amazon SageMaker Data Wrangler este o capacitate de Amazon SageMaker that makes it faster for data scientists and engineers to prepare data for ML applications via a visual interface. Data Wrangler reduces the time it takes to aggregate and prepare data for ML from weeks to minutes.

A key accelerator of feature preparation in Data Wrangler is the Raport privind calitatea datelor și perspective. This report checks data quality and helps detect abnormalities in your data, so that you can perform the required data engineering to fix your dataset. You can use the Data Quality and Insights Report to perform an analysis of your data to gain insights into your dataset such as the number of missing values and number of outliers. If you have issues with your data, such as target leakage or imbalance, the insights report can bring those issues to your attention and help you identify the data preparation steps you need to perform.

Studio Lab users can benefit from Data Wrangler because data quality and feature engineering are critical for the predictive performance of your model. Data Wrangler helps with data quality and feature engineering by giving insights into data quality issues and easily enabling rapid feature iteration and engineering using a low-code UI.

In this post, we show you how to perform exploratory data analysis, prepare and transform data using Data Wrangler, and export the transformed and prepared data to Studio Lab to carry out model building.

Prezentare generală a soluțiilor

Soluția include următorii pași de nivel înalt:

Create AWS account and admin user. This is a prerequisite
Descărcați setul de date churn.csv.
Load the dataset to Serviciul Amazon de stocare simplă (Amazon S3).
Create a SageMaker Studio domain and launch Data Wrangler.
Import the dataset into the Data Wrangler flow from Amazon S3.
Create the Data Quality and Insights Report and draw conclusions on necessary feature engineering.
Perform the necessary data transforms in Data Wrangler.
Download the Data Quality and Insights Report and the transformed dataset.
Upload the data to a Studio Lab project for model training.

Următoarea diagramă ilustrează acest flux de lucru.

Cerințe preliminare

To use Data Wrangler and Studio Lab, you need the following prerequisites:

Build a data preparation workflow with Data Wrangler

Pentru a începe, parcurgeți următorii pași:

Upload your dataset to Amazon S3.
Pe consola SageMaker, sub Panoul de control în panoul de navigare, alegeți Studio.
Pe Lansați aplicația menu next to your user profile, choose Studio.

After you successfully log in to Studio, you should see a development environment like the following screenshot.
To create a new Data Wrangler workflow, on the Fișier meniu, alegeți Nou, Apoi alegeți Fluxul de date Wrangler.

The first step in Data Wrangler is to import your data. You can import data from multiple data sources, such as Amazon S3, Amazon Atena, Amazon RedShift, Fulg de nea, și cărămizi de date. In this example, we use Amazon S3.If you just want to see how Data Wrangler works, you can always choose Utilizați un set de date eșantion.
Alege Date de import.
Alege Amazon S3.
Choose the dataset you uploaded and choose Import.

Data Wrangler enables you to either import the entire dataset or sample a portion of it.
To quickly get insights on the dataset, choose În primul rând K pentru Prelevarea de probe and enter 50000 for Marime de mostra.

Understand data quality and get insights

Let’s use the Data Quality and Insights Report to perform an analysis of the data that we imported into Data Wrangler. You can use the report to understand what steps you need to take to clean and process your data. This report provides information such as the number of missing values and the number of outliers. If you have issues with your data, such as target leakage or imbalance, the insights report can bring those issues to your attention.

Alegeți semnul plus de lângă Tipuri de date Și alegeți Obțineți informații despre date.
Pentru Tipul analizei, alege Raport privind calitatea datelor și perspective.
Pentru Coloana țintă, alege Putinei?.
Pentru Tipul problemeiSelectați Clasificare.
Alege Crea.

You’re presented with a detailed report that you can review and download. The report includes several sections such as quick model, feature summary, feature correlation, and data insights. The following screenshots provide examples of these sections.

Observations from the report

From the report, we can make the following observations:

No duplicate rows were found.
State column appears to be quite evenly distributed, so the data is balanced in terms of state population.
Phone column presents too many unique values to be of any practical use. Too many unique values make this column not useful. We can drop the Phone column in our transformation.
Based on feature correlation section of the report, Mins și Charge are highly correlated. We can remove one of them.

Transformare

Based on our observations, we want to make the following transformations:

Scoateți Phone column because it has many unique values.
We also see several features that essentially have 100% correlation with one another. Including these feature pairs in some ML algorithms can create undesired problems, whereas in others it will only introduce minor redundancy and bias. Let’s remove one feature from each of the highly correlated pairs: Day Charge from the pair with Day Mins, Night Charge from the pair with Night Mins, și Intl Charge from the pair with Intl Mins.
Converti True or False în Churn column to be a numerical value of 1 or 0.

Return to the data flow and choose the plus sign next to Tipuri de date.
Alege Adăugați transformare.
Alege Adăugați pasul.
You can search for the transform you looking for (in our case, manage columns).
Alege Gestionați coloanele.
Pentru Transforma¸ alege Aruncați coloana.
Pentru Coloane de aruncat¸ alege Phone, Day Charge, Eve Charge, Night Charge, și Intl Charge.
Alege Anunţ, Apoi alegeți Actualizează.

Let’s add another transform to perform a categorical encode on the Churn? coloana.
Choose the transform Codificați categoric.
Pentru Transforma, alege Codul ordinal.
Pentru Coloane de intrare, alege Churn? coloana.
Pentru Strategie de manipulare nevalidă, alege Replace with NaN.
Alege Anunţ, Apoi alegeți Actualizează.

Acum True și False are converted to 1 and 0, respectively.

Now that we have a good understand of the data and have prepared and transformed the data for model building, we can move the data to Studio Lab for model building.

Upload the data to Studio Lab

To start using the data in Studio Lab, complete the following steps:

Alege Export de date la exporturile la o găleată S3.
Pentru Locație Amazon S3, enter your S3 path.
Specify the file type.
Alege Export de date.
After you export the data, you can download the data from the S3 bucket to your local computer.
Now you can go to Studio Lab and upload the file to Studio Lab.

Alternatively, you can connect to Amazon S3 from Studio Lab. For more information, refer to Use external resources in Amazon SageMaker Studio Lab.
Let’s install SageMaker and import Pandas.
Import all libraries as required.
Now we can read the CSV file.
Let’s print churn to confirm the dataset is correct.

Now that you have the processed dataset in Studio Lab, you can carry out further steps required for model building.

Data Wrangler pricing

You can perform all the steps in this post for EDA or data preparation within Data Wrangler and Plăti for the simple instance, jobs, and storage pricing based on usage or consumption. No upfront or licensing fees are required.

A curăța

When you’re not using Data Wrangler, it’s important to shut down the instance on which it runs to avoid incurring additional fees. To avoid losing work, save your data flow before shutting Data Wrangler down.

Pentru a salva fluxul de date în Studio, alegeți Fișier, Apoi alegeți Salvați fluxul de date Wrangler.
Data Wrangler salvează automat fluxul de date la fiecare 60 de secunde.
Pentru a închide instanța Data Wrangler, în Studio, alegeți Rularea instanțelor și a nucleelor.
În RULEAȚI APLICAȚII, alegeți pictograma de închidere de lângă sagemaker-data-wrangler-1.0 app.
Alege Închideți totul a confirma.

Data Wrangler rulează pe o instanță ml.m5.4xlarge. Această instanță dispare din INSTANȚE DE RELARE când închideți aplicația Data Wrangler.

După ce închideți aplicația Data Wrangler, aceasta trebuie să repornească data viitoare când deschideți un fișier de flux Data Wrangler. Acest lucru poate dura câteva minute.

Concluzie

In this post, we saw how you can gain insights into your dataset, perform exploratory data analysis, prepare and transform data using Data Wrangler within Studio, and export the transformed and prepared data to Studio Lab and carry out model building and other steps.

With SageMaker Data Wrangler, you can simplify the process of data preparation and feature engineering, and complete each step of the data preparation workflow, including data selection, cleansing, exploration, and visualization from a single visual interface.

Despre autori

Rajakumar Sampathkumar is a Principal Technical Account Manager at AWS, providing customers guidance on business-technology alignment and supporting the reinvention of their cloud operation models and processes. He is passionate about the cloud and machine learning. Raj is also a machine learning specialist and works with AWS customers to design, deploy, and manage their AWS workloads and architectures.

Meenakshisundaram Thandavarayan is a Senior AI/ML specialist with a passion to design, create and promote human-centered Data and Analytics experiences. He supports AWS Strategic customers on their transformation towards data driven organization.

James Wu este arhitect senior de soluții de specialitate AI/ML la AWS. ajutând clienții să proiecteze și să construiască soluții AI/ML. Munca lui James acoperă o gamă largă de cazuri de utilizare ML, cu un interes principal în viziunea computerizată, învățarea profundă și scalarea ML în întreaga întreprindere. Înainte de a se alătura AWS, James a fost arhitect, dezvoltator și lider tehnologic timp de peste 10 ani, inclusiv 6 ani în inginerie și 4 ani în industriile de marketing și publicitate.

Timestamp-ul: 15 Septembrie, 202215 Septembrie, 2022

Timestamp-ul: Mar 4, 2024

Utilizați Amazon SageMaker Data Wrangler pentru pregătirea datelor și Studio Labs pentru a învăța și a experimenta cu ML

Republicat de Platon

Prezentare generală a soluțiilor

Cerințe preliminare

Build a data preparation workflow with Data Wrangler

Understand data quality and get insights

Observations from the report

Transformare

Upload the data to Studio Lab

Data Wrangler pricing

A curăța

Concluzie

Despre autori

Mai mult de la Învățare automată AWS

Accenture creează o soluție de creație a documentelor de reglementare folosind serviciile AI generative AWS | Amazon Web Services

Amazon Comprehend Targeted Sentiment adaugă suport sincron

Cum a îmbunătățit BigBasket procesul de achiziție cu inteligență artificială în magazinele lor fizice folosind Amazon SageMaker | Amazon Web Services

Utilizați AWS CDK pentru a implementa configurațiile ciclului de viață Amazon SageMaker Studio | Amazon Web Services

Se anunță conectorul Salesforce actualizat (V2) pentru Amazon Kendra

Următoarea generație de experimente Amazon SageMaker – Organizați, urmăriți și comparați antrenamentele de învățare automată la scară

Obțineți maturitatea DevOps cu BMC AMI zAdviser Enterprise și Amazon Bedrock | Amazon Web Services

Deblocarea inovației: AWS și Anthropic împing limitele AI generative împreună | Amazon Web Services

Despre noi

Căutare verticală și Ai

Platformă

Rămâneți conectat

Cont