Get More Control Of Your Amazon SageMaker Data Wrangler Workloads With Parameterized Datasets And Scheduled Jobs

प्लेटो द्वारा पुनर्प्रकाशित

अनुयायियों: 0

Data is transforming every field and every business. However, with data growing faster than most companies can keep track of, collecting data and getting value out of that data is a challenging thing to do. A modern data strategy can help you create better business outcomes with data. AWS provides the most complete set of services for the end-to-end data journey करने के लिए आप मदद unlock value from your data and turn it into insight.

Data scientists can spend up to 80% of their time preparing data for machine learning (ML) projects. This preparation process is largely undifferentiated and tedious work, and can involve multiple programming APIs and custom libraries. अमेज़न SageMaker डेटा रैंगलर helps data scientists and data engineers simplify and accelerate tabular and time series data preparation and feature engineering through a visual interface. You can import data from multiple data sources, such as अमेज़न सरल भंडारण सेवा (अमेज़न S3), अमेज़न एथेना, अमेज़न रेडशिफ्ट, or even third-party solutions like हिमपात का एक खंड or डेटाब्रिक्स, और अपने डेटा को 300 से अधिक अंतर्निर्मित डेटा ट्रांसफ़ॉर्मेशन और कोड स्निपेट की लाइब्रेरी के साथ संसाधित करें, ताकि आप बिना कोई कोड लिखे सुविधाओं को जल्दी से सामान्य, रूपांतरित और संयोजित कर सकें। आप अपने कस्टम रूपांतरणों को PySpark, SQL, या पांडा में भी ला सकते हैं।

This post demonstrates how you can schedule your data preparation jobs to run automatically. We also explore the new Data Wrangler capability of parameterized datasets, which allows you to specify the files to be included in a data flow by means of parameterized URIs.

समाधान अवलोकन

Data Wrangler now supports importing data using a parameterized URI. This allows for further flexibility because you can now import all datasets matching the specified parameters, which can be of type String, Number, Datetime, and Pattern, in the URI. Additionally, you can now trigger your Data Wrangler transformation jobs on a schedule.

In this post, we create a sample flow with the Titanic dataset to show how you can start experimenting with these two new Data Wrangler’s features. To download the dataset, refer to टाइटैनिक - आपदा से मशीन लर्निंग.

.. पूर्वापेक्षाएँ

To get all the features described in this post, you need to be running the latest kernel version of Data Wrangler. For more information, refer to डेटा रैंगलर अपडेट करें. Additionally, you need to be running अमेज़ॅन सैजमेकर स्टूडियो JupyterLab 3. To view the current version and update it, refer to ज्यूपिटरलैब संस्करण.

फ़ाइल संरचना

For this demonstration, we follow a simple file structure that you must replicate in order to reproduce the steps outlined in this post.

In Studio, create a new notebook.

Run the following code snippet to create the folder structure that we use (make sure you’re in the desired folder in your file tree):

!mkdir titanic_dataset
!mkdir titanic_dataset/datetime_data
!mkdir titanic_dataset/datetime_data/2021
!mkdir titanic_dataset/datetime_data/2022

!mkdir titanic_dataset/datetime_data/2021/01 titanic_dataset/datetime_data/2021/02 titanic_dataset/datetime_data/2021/03 
!mkdir titanic_dataset/datetime_data/2021/04 titanic_dataset/datetime_data/2021/05 titanic_dataset/datetime_data/2021/06
!mkdir titanic_dataset/datetime_data/2022/01 titanic_dataset/datetime_data/2022/02 titanic_dataset/datetime_data/2022/03 
!mkdir titanic_dataset/datetime_data/2022/04 titanic_dataset/datetime_data/2022/05 titanic_dataset/datetime_data/2022/06

!mkdir titanic_dataset/datetime_data/2021/01/01 titanic_dataset/datetime_data/2021/02/01 titanic_dataset/datetime_data/2021/03/01 
!mkdir titanic_dataset/datetime_data/2021/04/01 titanic_dataset/datetime_data/2021/05/01 titanic_dataset/datetime_data/2021/06/01
!mkdir titanic_dataset/datetime_data/2022/01/01 titanic_dataset/datetime_data/2022/02/01 titanic_dataset/datetime_data/2022/03/01 
!mkdir titanic_dataset/datetime_data/2022/04/01 titanic_dataset/datetime_data/2022/05/01 titanic_dataset/datetime_data/2022/06/01

!mkdir titanic_dataset/train_1 titanic_dataset/train_2 titanic_dataset/train_3 titanic_dataset/train_4 titanic_dataset/train_5
!mkdir titanic_dataset/train titanic_dataset/test

कॉपी करें train.csv और test.csv files from the original Titanic dataset to the folders titanic_dataset/train और titanic_dataset/test, क्रमशः।

Run the following code snippet to populate the folders with the necessary files:

import os
import math
import pandas as pd
batch_size = 100

#Get a list of all the leaf nodes in the folder structure
leaf_nodes = []

for root, dirs, files in os.walk('titanic_dataset'):
    if not dirs:
        if root != "titanic_dataset/test" and root != "titanic_dataset/train":
            leaf_nodes.append(root)
            
titanic_df = pd.read_csv('titanic_dataset/train/train.csv')

#Create the mini batch files
for i in range(math.ceil(titanic_df.shape[0]/batch_size)):
    batch_df = titanic_df[i*batch_size:(i+1)*batch_size]
    
    #Place a copy of each mini batch in each one of the leaf folders
    for node in leaf_nodes:
        batch_df.to_csv(node+'/part_{}.csv'.format(i), index=False)

We split the train.csv file of the Titanic dataset into nine different files, named part_x, where x is the number of the part. Part 0 has the first 100 records, part 1 the next 100, and so on until part 8. Every node folder of the file tree contains a copy of the nine parts of the training data except for the train और test folders, which contain train.csv और test.csv.

पैरामीटरयुक्त डेटासेट

Data Wrangler users can now specify parameters for the datasets imported from Amazon S3. Dataset parameters are specified at the resources’ URI, and its value can be changed dynamically, allowing for more flexibility for selecting the files that we want to import. Parameters can be of four data types:

नंबर – Can take the value of any integer
तार – Can take the value of any text string
पैटर्न – Can take the value of any regular expression
दिनांक समय – Can take the value of any of the supported date/time formats

In this section, we provide a walkthrough of this new feature. This is available only after you import your dataset to your current flow and only for datasets imported from Amazon S3.

From your data flow, choose the plus (+) sign next to the import step and choose डेटासेट संपादित करें.
The preferred (and easiest) method of creating new parameters is by highlighting a section of you URI and choosing कस्टम पैरामीटर बनाएं on the drop-down menu. You need to specify four things for each parameter you want to create:
1. नाम
2. प्रकार
3. डिफ़ॉल्ट मान
4. Description
Here we have created a String type parameter called filename_param with a default value of train.csv. Now you can see the parameter name enclosed in double brackets, replacing the portion of the URI that we previously highlighted. Because the defined value for this parameter was train.csv, we now see the file train.csv listed on the import table.
When we try to create a transformation job, on the कार्य कॉन्फ़िगर करें step, we now see a पैरामीटर्स section, where we can see a list of all of our defined parameters.
Choosing the parameter gives us the option to change the parameter’s value, in this case, changing the input dataset to be transformed according to the defined flow.
Assuming we change the value of filename_param से train.csv सेवा मेरे part_0.csv, the transformation job now takes part_0.csv (provided that a file with the name part_0.csv exists under the same folder) as its new input data.
Additionally, if you attempt to export your flow to an Amazon S3 destination (via a Jupyter notebook), you now see a new cell containing the parameters that you defined.
Note that the parameter takes their default value, but you can change it by replacing its value in the parameter_overrides dictionary (while leaving the keys of the dictionary unchanged).

Additionally, you can create new parameters from the पैरामीटर्स यूआई।
Open it up by choosing the parameters icon ({{}}) located next to the Go option; both of them are located next to the URI path value.
A table opens with all the parameters that currently exist on your flow file (filename_param इस समय)।
You can create new parameters for your flow by choosing पैरामीटर बनाएं.

A pop-up window opens to let you create a new custom parameter.
Here, we have created a new example_parameter as Number type with a default value of 0. This newly created parameter is now listed in the पैरामीटर्स table. Hovering over the parameter displays the options संपादित करें, मिटाना, तथा सम्मिलित करें.
भीतर से पैरामीटर्स UI, you can insert one of your parameters to the URI by selecting the desired parameter and choosing सम्मिलित करें.
This adds the parameter to the end of your URI. You need to move it to the desired section within your URI.
Change the parameter’s default value, apply the change (from the modal), choose Go, and choose the refresh icon to update the preview list using the selected dataset based on the newly defined parameter’s value.Let’s now explore other parameter types. Assume we now have a dataset split into multiple parts, where each file has a part number.
If we want to dynamically change the file number, we can define a Number parameter as shown in the following screenshot.Note that the selected file is the one that matches the number specified in the parameter.
Now let’s demonstrate how to use a Pattern parameter. Suppose we want to import all the part_1.csv files in all of the folders under the titanic-dataset/ folder. Pattern parameters can take any valid regular expression; there are some regex patterns shown as examples.
Create a Pattern parameter called any_pattern to match any folder or file under the titanic-dataset/ folder with default value .*.Notice that the wildcard is not a single * (asterisk) but also has a dot.
उजागर करें titanic-dataset/ part of the path and create a custom parameter. This time we choose the पैटर्न प्रकार।This pattern selects all the files called part-1.csv from any of the folders under titanic-dataset/.
A parameter can be used more than once in a path. In the following example, we use our newly created parameter any_pattern twice in our URI to match any of the part files in any of the folders under titanic-dataset/.
Finally, let’s create a Datetime parameter. Datetime parameters are useful when we’re dealing with paths that are partitioned by date and time, like those generated by अमेज़न Kinesis डेटा Firehose (देखें Dynamic Partitioning in Kinesis Data Firehose). For this demonstration, we use the data under the datetime-data folder.
Select the portion of your path that is a date/time and create a custom parameter. Choose the दिनांक समय parameter type.
When choosing the Datetime data type, you need to fill in more details.
First of all, you must provide a date format. You can choose any of the predefined date/time formats or create a custom one.
For the predefined date/time formats, the legend provides an example of a date matching the selected format. For this demonstration, we choose the format yyyy/MM/dd.
Next, specify a time zone for the date/time values.
For example, the current date may be January 1, 2022, in one time zone, but may be January 2, 2022, in another time zone.
Finally, you can select the time range, which lets you select the range of files that you want to include in your data flow.
You can specify your time range in hours, days, weeks, months, or years. For this example, we want to get all the files from the last year.
Provide a description of the parameter and choose बनाएं.
If you’re using multiple datasets with different time zones, the time is not converted automatically; you need to preprocess each file or source to convert it to one time zone.The selected files are all the files under the folders corresponding to last year’s data.
Now if we create a data transformation job, we can see a list of all of our defined parameters, and we can override their default values so that our transformation jobs pick the specified files.

Schedule processing jobs

You can now schedule processing jobs to automate running the data transformation jobs and exporting your transformed data to either Amazon S3 or अमेज़न SageMaker फ़ीचर स्टोर. You can schedule the jobs with the time and periodicity that suits your needs.

Scheduled processing jobs use अमेज़न EventBridge नियम to schedule the job’s run. Therefore, as a prerequisite, you have to make sure that the AWS पहचान और अभिगम प्रबंधन (IAM) role being used by Data Wrangler, namely the अमेज़न SageMaker निष्पादन भूमिका of the Studio instance, has permissions to create EventBridge rules.

Configure IAM

Proceed with the following updates on the IAM SageMaker execution role corresponding to the Studio instance where the Data Wrangler flow is running:

संलग्न करें AmazonEventBridgeFullAccess प्रबंधित नीति।

Attach a policy to grant permission to create a processing job:

{
	"Version": "2012-10-17",
	"Statement": [
		{
			"Effect": "Allow",
			"Action": "sagemaker:StartPipelineExecution",
			"Resource": "arn:aws:sagemaker:Region:AWS-account-id:pipeline/data-wrangler-*"
		}
	]
}

Grant EventBridge permission to assume the role by adding the following trust policy:

{
	"Effect": "Allow",
	"Principal": {
		"Service": "events.amazonaws.com"
	},
	"Action": "sts:AssumeRole"
}

Alternatively, if you’re using a different role to run the processing job, apply the policies outlined in steps 2 and 3 to that role. For details about the IAM configuration, refer to नए डेटा को स्वचालित रूप से संसाधित करने के लिए एक शेड्यूल बनाएं.

एक कार्यक्रम बनाएँ

To create a schedule, have your flow opened in the Data Wrangler flow editor.

पर डाटा प्रवाह टैब चुनें नौकरी पैदा करो.
Configure the required fields and chose अगला, 2. कार्य कॉन्फ़िगर करें.
विस्तार Associate Schedules.
चुनें नया शेड्यूल बनाएं.

RSI नया शेड्यूल बनाएं dialog opens, where you define the details of the processing job schedule.
The dialog offers great flexibility to help you define the schedule. You can have, for example, the processing job running at a specific time or every X hours, on specific days of the week.
The periodicity can be granular to the level of minutes.
Define the schedule name and periodicity, then choose बनाएं शेड्यूल सहेजने के लिए.
You have the option to start the processing job right away along with the scheduling, which takes care of future runs, or leave the job to run only according to the schedule.
You can also define an additional schedule for the same processing job.
To finish the schedule for the processing job, choose बनाएं.
You see a “Job scheduled successfully” message. Additionally, if you chose to leave the job to run only according to the schedule, you see a link to the EventBridge rule that you just created.

If you choose the schedule link, a new tab in the browser opens, showing the EventBridge rule. On this page, you can make further modifications to the rule and track its invocation history. To stop your scheduled processing job from running, delete the event rule that contains the schedule name.

The EventBridge rule shows a SageMaker pipeline as its target, which is triggered according to the defined schedule, and the processing job invoked as part of the pipeline.

To track the runs of the SageMaker pipeline, you can go back to Studio, choose the SageMaker संसाधन आइकन, चुनें पाइप लाइनें, and choose the pipeline name you want to track. You can now see a table with all current and past runs and status of that pipeline.

You can see more details by double-clicking a specific entry.

क्लीन अप

When you’re not using Data Wrangler, it’s recommended to shut down the instance on which it runs to avoid incurring additional fees.

काम खोने से बचने के लिए, डेटा रैंगलर को बंद करने से पहले अपने डेटा प्रवाह को बचाएं।

स्टूडियो में अपना डेटा प्रवाह सहेजने के लिए, चुनें पट्टिका, उसके बाद चुनो डेटा रैंगलर फ़्लो सहेजें. डेटा रैंगलर स्वचालित रूप से प्रत्येक 60 सेकंड में आपके डेटा प्रवाह को सहेजता है।
डेटा रैंगलर इंस्टेंस को बंद करने के लिए, स्टूडियो में, चुनें रनिंग इंस्टेंस और कर्नेल.
के अंतर्गत ऐप्स चालू हैं, के आगे शटडाउन आइकन चुनें sagemaker-data-wrangler-1.0 एप्लिकेशन को।
चुनें सब बंद करो पुष्टि करने के लिए।

डेटा रैंगलर ml.m5.4xबड़े उदाहरण पर चलता है। यह उदाहरण गायब हो जाता है चल रहे उदाहरण जब आप डेटा रैंगलर ऐप को बंद करते हैं।

डेटा रैंगलर ऐप को बंद करने के बाद, अगली बार जब आप डेटा रैंगलर फ़्लो फ़ाइल खोलते हैं, तो उसे पुनरारंभ करना होगा। इसमें कुछ मिनट लग सकते हैं।

निष्कर्ष

In this post, we demonstrated how you can use parameters to import your datasets using Data Wrangler flows and create data transformation jobs on them. Parameterized datasets allow for more flexibility on the datasets you use and allow you to reuse your flows. We also demonstrated how you can set up scheduled jobs to automate your data transformations and exports to either Amazon S3 or Feature Store, at the time and periodicity that suits your needs, directly from within Data Wrangler’s user interface.

डेटा रैंगलर के साथ डेटा प्रवाह का उपयोग करने के बारे में अधिक जानने के लिए, देखें डेटा रैंगलर फ़्लो बनाएँ और उपयोग करें और अमेज़न SageMaker मूल्य निर्धारण. डेटा रैंगलर के साथ आरंभ करने के लिए, देखें अमेज़ॅन सेजमेकर डेटा रैंगलर के साथ एमएल डेटा तैयार करें.

लेखक के बारे में

डेविड लारेडो is a Prototyping Architect for the Prototyping and Cloud Engineering team at Amazon Web Services, where he has helped develop multiple machine learning prototypes for AWS customers. He has been working in machine learning for the last 6 years, training and fine-tuning ML models and implementing end-to-end pipelines to productionize those models. His areas of interest are NLP, ML applications, and end-to-end ML.

गिवानिल्डो अल्वेस Amazon Web Services में प्रोटोटाइपिंग और क्लाउड इंजीनियरिंग टीम के साथ एक प्रोटोटाइप आर्किटेक्ट है, जो ग्राहकों को AWS पर संभव की कला दिखाते हुए नया करने और गति बढ़ाने में मदद करता है, पहले से ही कृत्रिम बुद्धिमत्ता के आसपास कई प्रोटोटाइप लागू कर चुका है। सॉफ्टवेयर इंजीनियरिंग में उनका लंबा करियर है और पहले उन्होंने Amazon.com.br पर सॉफ्टवेयर डेवलपमेंट इंजीनियर के रूप में काम किया था।

Adrian Fuentes is a Program Manager with the Prototyping and Cloud Engineering team at Amazon Web Services, innovating for customers in machine learning, IoT, and blockchain. He has over 15 years of experience managing and implementing projects and 1 year of tenure on AWS.

समय टिकट: नवम्बर 15/2022नवम्बर 15/2022

से अधिक AWS मशीन लर्निंग

अमेज़ॅन सैजमेकर जम्पस्टार्ट समाधान के साथ नाइट्रोजन के लिए मकई की प्रतिक्रिया का प्रतितथ्यात्मक विश्लेषण उत्पन्न करें

स्रोत क्लस्टर:

AWS मशीन लर्निंग

स्रोत नोड: 1821717

समय टिकट: अप्रैल 3, 2023

एडब्ल्यूएस और मिस्ट्रल एआई एक मजबूत सहयोग के साथ जेनेरिक एआई को लोकतांत्रिक बनाने के लिए प्रतिबद्ध हैं | अमेज़न वेब सेवाएँ

AWS मशीन लर्निंग

स्रोत नोड: 1961063

समय टिकट: अप्रैल 2, 2024

प्लेटो द्वारा पुनर्प्रकाशित

समाधान अवलोकन

.. पूर्वापेक्षाएँ

फ़ाइल संरचना

पैरामीटरयुक्त डेटासेट

Schedule processing jobs

Configure IAM

एक कार्यक्रम बनाएँ

क्लीन अप

निष्कर्ष

लेखक के बारे में

से अधिक AWS मशीन लर्निंग

AWS re:Invent 2022 में AI/ML के लिए आपका गाइड

Amazon SageMaker JumpStart में व्याख्यात्मक नोटबुक

ग्रेडिएंट AWS इन्फेरेंटिया | के साथ एलएलएम बेंचमार्किंग को लागत प्रभावी और सरल बनाता है अमेज़न वेब सेवाएँ

हमारे बारे में

ऊर्ध्वाधर खोज और ऐ

मंच

जुड़े रहें

लेखा