Prepare Data From Databricks For Machine Learning Using Amazon SageMaker Data Wrangler

Republicat de Platon

Urmaritori: 0

Data science and data engineering teams spend a significant portion of their time in the data preparation phase of a machine learning (ML) lifecycle performing data selection, cleaning, and transformation steps. It’s a necessary and important step of any ML workflow in order to generate meaningful insights and predictions, because bad or low-quality data greatly reduces the relevance of the insights derived.

Data engineering teams are traditionally responsible for the ingestion, consolidation, and transformation of raw data for downstream consumption. Data scientists often need to do additional processing on data for domain-specific ML use cases such as natural language and time series. For example, certain ML algorithms may be sensitive to missing values, sparse features, or outliers and require special consideration. Even in cases where the dataset is in a good shape, data scientists may want to transform the feature distributions or create new features in order to maximize the insights obtained from the models. To achieve these objectives, data scientists have to rely on data engineering teams to accommodate requested changes, resulting in dependency and delay in the model development process. Alternatively, data science teams may choose to perform data preparation and feature engineering internally using various programming paradigms. However, it requires an investment of time and effort in installation and configuration of libraries and frameworks, which isn’t ideal because that time can be better spent optimizing model performance.

Amazon SageMaker Data Wrangler simplifies the data preparation and feature engineering process, reducing the time it takes to aggregate and prepare data for ML from weeks to minutes by providing a single visual interface for data scientists to select, clean, and explore their datasets. Data Wrangler offers over 300 built-in data transformations to help normalize, transform, and combine features without writing any code. You can import data from multiple data sources, such as Serviciul de stocare simplu Amazon (Amazon S3), Amazon Atena, Amazon RedShift, și Fulg de nea. You can now also use cărămizi de date ca sursă de date în Data Wrangler pentru a pregăti cu ușurință datele pentru ML.

The Databricks Lakehouse Platform combines the best elements of data lakes and data warehouses to deliver the reliability, strong governance and performance of data warehouses with the openness, flexibility and machine learning support of data lakes. With Databricks as a data source for Data Wrangler, you can now quickly and easily connect to Databricks, interactively query data stored in Databricks using SQL, and preview data before importing. Additionally, you can join your data in Databricks with data stored in Amazon S3, and data queried through Amazon Athena, Amazon Redshift, and Snowflake to create the right dataset for your ML use case.

In this post, we transform the Lending Club Loan dataset using Amazon SageMaker Data Wrangler for use in ML model training.

Prezentare generală a soluțiilor

Următoarea diagramă ilustrează arhitectura soluției noastre.

The Lending Club Loan dataset contains complete loan data for all loans issued through 2007–2011, including the current loan status and latest payment information. It has 39,717 rows, 22 feature columns, and 3 target labels.

To transform our data using Data Wrangler, we complete the following high-level steps:

Download and split the dataset.
Create a Data Wrangler flow.
Import data from Databricks to Data Wrangler.
Import data from Amazon S3 to Data Wrangler.
Join the data.
Apply transformations.
Export the dataset.

Cerințe preliminare

The post assumes you have a running Databricks cluster. If your cluster is running on AWS, verify you have the following configured:

Databricks setup

An profilul instanței with required permissions to access an S3 bucket
A politica de găleată with required permissions for the target S3 bucket

Urma Secure access to S3 buckets using instance profiles for the required Gestionarea identității și accesului AWS (IAM) roles, S3 bucket policy, and Databricks cluster configuration. Ensure the Databricks cluster is configured with the proper Instance Profile, selected under the advanced options, to access to the desired S3 bucket.

After the Databricks cluster is up and running with required access to Amazon S3, you can fetch the JDBC URL from your Databricks cluster to be used by Data Wrangler to connect to it.

Fetch the JDBC URL

To fetch the JDBC URL, complete the following steps:

In Databricks, navigate to the clusters UI.
Alegeți clusterul.
Pe Configuraţie fila, alegeți Opţiuni avansate.
În Opţiuni avansate, alege JDBC/ODBC tab.
Copy the JDBC URL.

Make sure to substitute your personal access semn în adresa URL.

Data Wrangler setup

This step assumes you have access to Amazon SageMaker, an instance of Amazon SageMaker Studio, and a Studio user.

To allow access to the Databricks JDBC connection from Data Wrangler, the Studio user requires following permission:

secretsmanager:PutResourcePolicy

Follow below steps to update the IAM execution role assigned to the Studio user with above permission, as an IAM administrative user.

Pe consola IAM, alegeți Roluri în panoul de navigare.
Choose the role assigned to your Studio user.
Alege Adăugați permisiuni.
Alege Creați o politică integrată.
For Service, choose Manager de secrete.
On Acţiuni, alege Nivel de acces.
Alege Gestionarea permisiunilor.
Alege PutResourcePolicy.
Pentru Resurse, alege Specific și selectați Oricare din acest cont.

Download and split the dataset

Puteți începe prin downloading the dataset. For demonstration purposes, we split the dataset by copying the feature columns id, emp_title, emp_length, home_owner, și annual_inc to create a second loans_2.csv file. We remove the aforementioned columns from the original loans file except the id column and rename the original file to loans_1.csv. Upload the loans_1.csv fișier de cărămizi de date to create a table loans_1 și loans_2.csv in an S3 bucket.

Create a Data Wrangler flow

For information on Data Wrangler pre-requisites, see Începeți cu Data Wrangler.

Let’s get started by creating a new data flow.

Pe consola Studio, pe Fișier meniu, alegeți Nou.
Alege Fluxul de date Wrangler.
Redenumiți fluxul după cum doriți.

Alternatively, you can create a new data flow from the Launcher.

Pe consola Studio, alegeți Amazon SageMaker Studio în panoul de navigare.
Alege Flux de date nou.

Creating a new flow can take a few minutes to complete. After the flow has been created, you see the Date de import .

Import data from Databricks into Data Wrangler

Next, we set up Databricks (JDBC) as a data source in Data Wrangler. To import data from Databricks, we first need to add Databricks as a data source.

Pe Date de import tab of your Data Wrangler flow, choose Adăugați o sursă de date.
În meniul derulant, alegeți Databricks (JDBC).

Pe Import data from Databricks page, you enter your cluster details.

Pentru Numele setului de date, enter a name you want to use in the flow file.
Pentru Şofer, choose the driver com.simba.spark.jdbc.Driver.
Pentru URL JDBC, enter the URL of your Databricks cluster obtained earlier.

The URL should resemble the following format jdbc:spark://<serve- hostname>:443/default;transportMode=http;ssl=1;httpPath=<http- path>;AuthMech=3;UID=token;PWD=<personal-access-token>.

In the SQL query editor, specify the following SQL SELECT statement:
```
select * from loans_1
```

If you chose a different table name while uploading data to Databricks, replace loans_1 in the above SQL query accordingly.

În Interogare SQL section in Data Wrangler, you can query any table connected to the JDBC Databricks database. The pre-selected Activați eșantionarea setting retrieves the first 50,000 rows of your dataset by default. Depending on the size of the dataset, unselecting Activați eșantionarea may result in longer import time.

Alege Alerga.

Running the query gives a preview of your Databricks dataset directly in Data Wrangler.

Alege Import.

Data Wrangler provides the flexibility to set up multiple concurrent connections to the one Databricks cluster or multiple clusters if required, enabling analysis and preparation on combined datasets.

Import the data from Amazon S3 into Data Wrangler

Next, let’s import the loan_2.csv file from Amazon S3.

On the Import tab, choose Amazon S3 ca sursă de date.
Navigate to the S3 bucket for the loan_2.csv fișier.

When you select the CSV file, you can preview the data.

În Detalii panoul, alegeți Configurare avansată să te asiguri Activați eșantionarea is selected and PARAGRAF este ales pentru delimitator.
Alege Import.

După loans_2.csv dataset is successfully imported, the data flow interface displays both the Databricks JDBC and Amazon S3 data sources.

Join the data

Now that we have imported data from Databricks and Amazon S3, let’s join the datasets using a common unique identifier column.

Pe Flux de date filă, pentru Tipuri de date, choose the plus sign for loans_1.
Alege Alatura-te.
Alege loans_2.csv file as the Dreapta set de date.
Alege Configurare pentru a stabili criteriile de alăturare.
Pentru Nume si Prenume, introduceți un nume pentru alăturare.
Pentru Tipul de alăturare, alege Interior pentru acest post.
Alege id column to join on.
Alege Aplică pentru a previzualiza setul de date alăturat.
Alege Adăuga to add it to the data flow.

Apply transformations

Data Wrangler comes with over 300 built-in transforms, which require no coding. Let’s use built-in transforms to prepare the dataset.

Aruncați coloana

First we drop the redundant ID column.

On the joined node, choose the plus sign.
Alege Adăugați transformare.
În Transforms, alege + Adăugați pas.
Alege Gestionați coloanele.
Pentru Transforma, alege Aruncați coloana.
Pentru Coloane de aruncat, alegeți coloana id_0.
Alege Anunţ.
Alege Adăuga.

Formatează șirul

Let’s apply string formatting to remove the percentage symbol from the int_rate și revol_util coloane.

Pe Date filă, sub transformări, alege + Adăugați pas.
Alege Formatează șirul.
Pentru Transforma, alege Strip characters from right.

Data Wrangler allows you to apply your chosen transformation on multiple columns simultaneously.

Pentru Coloane de intrare, alege int_rate și revol_util.
Pentru Characters to remove, introduce %.
Alege Anunţ.
Alege Adăuga.

Featurize text

Let’s now vectorize verification_status, a text feature column. We convert the text column into term frequency–inverse document frequency (TF-IDF) vectors by applying the count vectorizer and a standard tokenizer as described below. Data Wrangler also provides the option to bring your own tokenizer, if desired.

În transformatoare, alege + Adăugați pas.
Alege Featurize text.
Pentru Transforma, alege Vectorizează.
Pentru Coloane de intrare, alege verification_status.
Alege Anunţ.
Alege Adăuga.

Exportați setul de date

After we apply multiple transformations on different columns types, including text, categorical, and numeric, we’re ready to use the transformed dataset for ML model training. The last step is to export the transformed dataset to Amazon S3. In Data Wrangler, you have multiple options to choose from for downstream consumption of the transformations:

Alege Pasul de export to automatically generate a Jupyter notebook with SageMaker Processing code for processing and export the transformed dataset to an S3 bucket. For more information, see the Lansați lucrări de procesare cu câteva clicuri folosind Amazon SageMaker Data Wrangler.
Export a Studio notebook that creates a Conducta SageMaker with your data flow, or a notebook that creates an Magazinul de caracteristici Amazon SageMaker feature group and adds features to an offline or online feature store.
Alege Export de date to export directly to Amazon S3.

In this post, we take advantage of the Export de date opțiune în Transforma view to export the transformed dataset directly to Amazon S3.

Alege Export de date.
Pentru Locația S3, alege Naviga și alegeți-vă găleata S3.
Alege Export de date.

A curăța

Dacă lucrarea dvs. cu Data Wrangler este completă, închideți instanța Data Wrangler pentru a evita suportarea unor taxe suplimentare.

Concluzie

In this post, we covered how you can quickly and easily set up and connect Databricks as a data source in Data Wrangler, interactively query data stored in Databricks using SQL, and preview data before importing. Additionally, we looked at how you can join your data in Databricks with data stored in Amazon S3. We then applied data transformations on the combined dataset to create a data preparation pipeline. To explore more Data Wrangler’s analysis capabilities, including target leakage and bias report generation, refer to the following blog post Accelerați pregătirea datelor folosind Amazon SageMaker Data Wrangler pentru predicția de readmisie a pacienților diabetici.

Pentru a începe cu Data Wrangler, consultați Pregătiți datele ML cu Amazon SageMaker Data Wranglerși consultați cele mai recente informații despre Data Wrangler pagina produsului.

Despre Autori

Roop Bains este arhitect de soluții la AWS, care se concentrează pe AI/ML. Este pasionat de a ajuta clienții să inoveze și să-și atingă obiectivele de afaceri folosind inteligența artificială și învățarea automată. În timpul liber, lui Roop îi place să citească și să facă drumeții.

Igor Alekseev este arhitect de soluții partener la AWS în domeniul datelor și analizei. Igor lucrează cu parteneri strategici, ajutându-i să construiască arhitecturi complexe, optimizate pentru AWS. Înainte de a se alătura AWS, ca arhitect de date/soluții, a implementat multe proiecte în Big Data, inclusiv mai multe lacuri de date din ecosistemul Hadoop. În calitate de inginer de date, a fost implicat în aplicarea AI/ML pentru detectarea fraudelor și automatizarea biroului. Proiectele lui Igor au fost într-o varietate de industrii, inclusiv comunicații, finanțe, siguranță publică, producție și asistență medicală. Anterior, Igor a lucrat ca inginer full stack/leader tehnologic.

Pregătiți date din Databricks pentru învățarea automată folosind Amazon SageMaker Data Wrangler PlatoBlockchain Data Intelligence. Căutare verticală. Ai. Huong Nguyen este Sr. Product Manager la AWS. Ea conduce experiența utilizatorului pentru SageMaker Studio. Are o experiență de 13 ani în crearea de produse obsedate de clienți și produse bazate pe date, atât pentru întreprinderi, cât și pentru spații pentru consumatori. În timpul liber, îi place să citească, să fie în natură și să petreacă timp cu familia.

Henry Wang is a software development engineer at AWS. He recently joined the Data Wrangler team after graduating from UC Davis. He has an interest in data science and machine learning and does 3D printing as a hobby.

Timestamp-ul: Martie 31, 2022

Timestamp-ul: Jan 5, 2024

Pregătiți date din Databricks pentru învățarea automată folosind Amazon SageMaker Data Wrangler

Republicat de Platon

Prezentare generală a soluțiilor

Cerințe preliminare

Databricks setup

Fetch the JDBC URL

Data Wrangler setup

Download and split the dataset

Create a Data Wrangler flow

Import data from Databricks into Data Wrangler

Import the data from Amazon S3 into Data Wrangler

Join the data

Apply transformations

Aruncați coloana

Formatează șirul

Featurize text

Exportați setul de date

A curăța

Concluzie

Despre Autori

Mai mult de la Învățare automată AWS

PaddleOCR la bord cu Amazon SageMaker Projects pentru MLOps pentru a realiza recunoașterea optică a caracterelor pe documentele de identitate

Modele de găzduire a modelelor în SageMaker: cele mai bune practici în testarea și actualizarea modelelor pe SageMaker

Utilizați RStudio pe Amazon SageMaker pentru a crea trimiteri de reglementare pentru industria științelor vieții

Viziunea computerizată folosind seturi de date sintetice cu Amazon Rekognition Custom Labels și Dassault Systèmes 3DEXCITE

Despre noi

Căutare verticală și Ai

Platformă

Rămâneți conectat

Cont