Amazon SageMaker Studio Lab is a free machine learning (ML) development environment based on open-source JupyterLab for anyone to learn and experiment with ML using AWS ML compute resources. It’s based on the same architecture and user interface as Amazon SageMaker Studio, but with a subset of Studio capabilities.
When you begin working on ML initiatives, you need to perform exploratory data analysis (EDA) or data preparation before proceeding with model building. Amazon SageMaker Data Wrangler este o capacitate de Amazon SageMaker that makes it faster for data scientists and engineers to prepare data for ML applications via a visual interface. Data Wrangler reduces the time it takes to aggregate and prepare data for ML from weeks to minutes.
A key accelerator of feature preparation in Data Wrangler is the Raport privind calitatea datelor și perspective. This report checks data quality and helps detect abnormalities in your data, so that you can perform the required data engineering to fix your dataset. You can use the Data Quality and Insights Report to perform an analysis of your data to gain insights into your dataset such as the number of missing values and number of outliers. If you have issues with your data, such as target leakage or imbalance, the insights report can bring those issues to your attention and help you identify the data preparation steps you need to perform.
Studio Lab users can benefit from Data Wrangler because data quality and feature engineering are critical for the predictive performance of your model. Data Wrangler helps with data quality and feature engineering by giving insights into data quality issues and easily enabling rapid feature iteration and engineering using a low-code UI.
In this post, we show you how to perform exploratory data analysis, prepare and transform data using Data Wrangler, and export the transformed and prepared data to Studio Lab to carry out model building.
Prezentare generală a soluțiilor
Soluția include următorii pași de nivel înalt:
- Create AWS account and admin user. This is a prerequisite
- Descărcați setul de date churn.csv.
- Load the dataset to Serviciul Amazon de stocare simplă (Amazon S3).
- Create a SageMaker Studio domain and launch Data Wrangler.
- Import the dataset into the Data Wrangler flow from Amazon S3.
- Create the Data Quality and Insights Report and draw conclusions on necessary feature engineering.
- Perform the necessary data transforms in Data Wrangler.
- Download the Data Quality and Insights Report and the transformed dataset.
- Upload the data to a Studio Lab project for model training.
Următoarea diagramă ilustrează acest flux de lucru.
Cerințe preliminare
To use Data Wrangler and Studio Lab, you need the following prerequisites:
Build a data preparation workflow with Data Wrangler
Pentru a începe, parcurgeți următorii pași:
- Upload your dataset to Amazon S3.
- Pe consola SageMaker, sub Panoul de control în panoul de navigare, alegeți Studio.
- Pe Lansați aplicația menu next to your user profile, choose Studio.
After you successfully log in to Studio, you should see a development environment like the following screenshot. - To create a new Data Wrangler workflow, on the Fișier meniu, alegeți Nou, Apoi alegeți Fluxul de date Wrangler.
The first step in Data Wrangler is to import your data. You can import data from multiple data sources, such as Amazon S3, Amazon Atena, Amazon RedShift, Fulg de nea, și cărămizi de date. In this example, we use Amazon S3.If you just want to see how Data Wrangler works, you can always choose Utilizați un set de date eșantion. - Alege Date de import.
- Alege Amazon S3.
- Choose the dataset you uploaded and choose Import.
Data Wrangler enables you to either import the entire dataset or sample a portion of it. - To quickly get insights on the dataset, choose În primul rând K pentru Prelevarea de probe and enter 50000 for Marime de mostra.
Understand data quality and get insights
Let’s use the Data Quality and Insights Report to perform an analysis of the data that we imported into Data Wrangler. You can use the report to understand what steps you need to take to clean and process your data. This report provides information such as the number of missing values and the number of outliers. If you have issues with your data, such as target leakage or imbalance, the insights report can bring those issues to your attention.
- Alegeți semnul plus de lângă Tipuri de date Și alegeți Obțineți informații despre date.
- Pentru Tipul analizei, alege Raport privind calitatea datelor și perspective.
- Pentru Coloana țintă, alege Putinei?.
- Pentru Tipul problemeiSelectați Clasificare.
- Alege Crea.
You’re presented with a detailed report that you can review and download. The report includes several sections such as quick model, feature summary, feature correlation, and data insights. The following screenshots provide examples of these sections.
Observations from the report
From the report, we can make the following observations:
- No duplicate rows were found.
-
State
column appears to be quite evenly distributed, so the data is balanced in terms of state population. -
Phone
column presents too many unique values to be of any practical use. Too many unique values make this column not useful. We can drop thePhone
column in our transformation. - Based on feature correlation section of the report,
Mins
șiCharge
are highly correlated. We can remove one of them.
Transformare
Based on our observations, we want to make the following transformations:
- Scoateți
Phone
column because it has many unique values. - We also see several features that essentially have 100% correlation with one another. Including these feature pairs in some ML algorithms can create undesired problems, whereas in others it will only introduce minor redundancy and bias. Let’s remove one feature from each of the highly correlated pairs:
Day Charge
from the pair withDay Mins
,Night Charge
from the pair withNight Mins
, șiIntl Charge
from the pair withIntl Mins
. - Converti
True
orFalse
înChurn
column to be a numerical value of 1 or 0.
- Return to the data flow and choose the plus sign next to Tipuri de date.
- Alege Adăugați transformare.
- Alege Adăugați pasul.
- You can search for the transform you looking for (in our case, manage columns).
- Alege Gestionați coloanele.
- Pentru Transforma¸ alege Aruncați coloana.
- Pentru Coloane de aruncat¸ alege
Phone
,Day Charge
,Eve Charge
,Night Charge
, șiIntl Charge
. - Alege Anunţ, Apoi alegeți Actualizează.
Let’s add another transform to perform a categorical encode on theChurn?
coloana. - Choose the transform Codificați categoric.
- Pentru Transforma, alege Codul ordinal.
- Pentru Coloane de intrare, alege
Churn?
coloana. - Pentru Strategie de manipulare nevalidă, alege Replace with NaN.
- Alege Anunţ, Apoi alegeți Actualizează.
Acum True
și False
are converted to 1 and 0, respectively.
Now that we have a good understand of the data and have prepared and transformed the data for model building, we can move the data to Studio Lab for model building.
Upload the data to Studio Lab
To start using the data in Studio Lab, complete the following steps:
- Alege Export de date la exporturile la o găleată S3.
- Pentru Locație Amazon S3, enter your S3 path.
- Specify the file type.
- Alege Export de date.
- After you export the data, you can download the data from the S3 bucket to your local computer.
- Now you can go to Studio Lab and upload the file to Studio Lab.
Alternatively, you can connect to Amazon S3 from Studio Lab. For more information, refer to Use external resources in Amazon SageMaker Studio Lab. - Let’s install SageMaker and import Pandas.
- Import all libraries as required.
- Now we can read the CSV file.
- Let’s print
churn
to confirm the dataset is correct.
Now that you have the processed dataset in Studio Lab, you can carry out further steps required for model building.
Data Wrangler pricing
You can perform all the steps in this post for EDA or data preparation within Data Wrangler and Plăti for the simple instance, jobs, and storage pricing based on usage or consumption. No upfront or licensing fees are required.
A curăța
When you’re not using Data Wrangler, it’s important to shut down the instance on which it runs to avoid incurring additional fees. To avoid losing work, save your data flow before shutting Data Wrangler down.
- Pentru a salva fluxul de date în Studio, alegeți Fișier, Apoi alegeți Salvați fluxul de date Wrangler.
Data Wrangler salvează automat fluxul de date la fiecare 60 de secunde. - Pentru a închide instanța Data Wrangler, în Studio, alegeți Rularea instanțelor și a nucleelor.
- În RULEAȚI APLICAȚII, alegeți pictograma de închidere de lângă
sagemaker-data-wrangler-1.0 app
. - Alege Închideți totul a confirma.
Data Wrangler rulează pe o instanță ml.m5.4xlarge. Această instanță dispare din INSTANȚE DE RELARE când închideți aplicația Data Wrangler.
După ce închideți aplicația Data Wrangler, aceasta trebuie să repornească data viitoare când deschideți un fișier de flux Data Wrangler. Acest lucru poate dura câteva minute.
Concluzie
In this post, we saw how you can gain insights into your dataset, perform exploratory data analysis, prepare and transform data using Data Wrangler within Studio, and export the transformed and prepared data to Studio Lab and carry out model building and other steps.
With SageMaker Data Wrangler, you can simplify the process of data preparation and feature engineering, and complete each step of the data preparation workflow, including data selection, cleansing, exploration, and visualization from a single visual interface.
Despre autori
Rajakumar Sampathkumar is a Principal Technical Account Manager at AWS, providing customers guidance on business-technology alignment and supporting the reinvention of their cloud operation models and processes. He is passionate about the cloud and machine learning. Raj is also a machine learning specialist and works with AWS customers to design, deploy, and manage their AWS workloads and architectures.
Meenakshisundaram Thandavarayan is a Senior AI/ML specialist with a passion to design, create and promote human-centered Data and Analytics experiences. He supports AWS Strategic customers on their transformation towards data driven organization.
James Wu este arhitect senior de soluții de specialitate AI/ML la AWS. ajutând clienții să proiecteze și să construiască soluții AI/ML. Munca lui James acoperă o gamă largă de cazuri de utilizare ML, cu un interes principal în viziunea computerizată, învățarea profundă și scalarea ML în întreaga întreprindere. Înainte de a se alătura AWS, James a fost arhitect, dezvoltator și lider tehnologic timp de peste 10 ani, inclusiv 6 ani în inginerie și 4 ani în industriile de marketing și publicitate.
- AI
- ai art
- ai art generator
- ai robot
- Amazon SageMaker
- Amazon SageMaker Data Wrangler
- inteligență artificială
- certificare de inteligență artificială
- inteligența artificială în domeniul bancar
- robot cu inteligență artificială
- roboți cu inteligență artificială
- software de inteligență artificială
- Învățare automată AWS
- blockchain
- conferință blockchain ai
- coingenius
- inteligența artificială conversațională
- criptoconferință ai
- dall-e
- învățare profundă
- google ai
- masina de învățare
- Plato
- platoul ai
- Informații despre date Platon
- Jocul lui Platon
- PlatoData
- platogaming
- scara ai
- sintaxă
- zephyrnet