Uporabite vzorce Github z Amazon SageMaker Data Wrangler PlatoBlockchain Data Intelligence. Navpično iskanje. Ai.

Uporabite vzorce Github z Amazon SageMaker Data Wrangler

Amazon SageMaker Data Wrangler is a UI-based data preparation tool that helps perform data analysis, preprocessing, and visualization with features to clean, transform, and prepare data faster. Data Wrangler pre-built flow templates help make data preparation quicker for data scientists and machine learning (ML) practitioners by helping you accelerate and understand best practice patterns for data flows using common datasets.

You can use Data Wrangler flows to perform the following tasks:

  • Vizualizacija podatkov – Pregledovanje statističnih lastnosti za vsak stolpec v naboru podatkov, izdelava histogramov, preučevanje odstopanj
  • Čiščenje podatkov – Odstranjevanje dvojnikov, izpuščanje ali polnjenje vnosov z manjkajočimi vrednostmi, odstranjevanje izstopajočih vrednosti
  • Obogatitev podatkov in inženiring funkcij – Obdelava stolpcev za izgradnjo bolj izrazitih funkcij, izbiranje podnabora funkcij za usposabljanje

This post will help you understand Data Wrangler using the following sample pre-built flows on GitHub. The repository showcases tabular data transformation, time series data transformations, and joined dataset transforms. Each requires a different type of transformations because of their basic nature. Standard tabular or cross-sectional data is collected at a specific point in time. In contrast, time series data is captured repeatedly over time, with each successive data point dependent on its past values.

Let’s look at an example of how we can use the sample data flow for tabular data.

Predpogoji

Data Wrangler je Amazon SageMaker feature available within Amazon SageMaker Studio, so we need to follow the Studio onboarding process to spin up the Studio environment and notebooks. Although you can choose from a few authentication methods, the simplest way to create a Studio domain is to follow the Hiter začetek navodila. Hitri začetek uporablja iste privzete nastavitve kot standardna nastavitev Studio. Izberete lahko tudi vkrcanje z uporabo AWS IAM Identity Center (naslednik AWS Single Sign-On) za avtentikacijo (glejte Vključite se v domeno Amazon SageMaker z uporabo centra za identiteto IAM).

Import the dataset and flow files into Data Wrangler using Studio

The following steps outline how to import data into SageMaker to be consumed by Data Wrangler:

Initialize Data Wrangler via the Studio UI by choosing Nov pretok podatkov.

Kloniraj GitHub repo to download the flow files into your Studio environment.

Uporabite vzorce Github z Amazon SageMaker Data Wrangler PlatoBlockchain Data Intelligence. Navpično iskanje. Ai.

When the clone is complete, you should be able to see the repository content in the left pane.

Uporabite vzorce Github z Amazon SageMaker Data Wrangler PlatoBlockchain Data Intelligence. Navpično iskanje. Ai.

Izberite datoteko Hotel-Bookings-Classification.flow to import the flow file into Data Wrangler.

If you use the time series or joined data flow, the flow will appear as a different name.After the flow has been imported, you should see the following screenshot. This shows us errors because we need to make sure that the flow file points to the correct data source in Preprosta storitev shranjevanja Amazon (Amazon S3).

Uporabite vzorce Github z Amazon SageMaker Data Wrangler PlatoBlockchain Data Intelligence. Navpično iskanje. Ai.

Izberite Uredi nabor podatkov to bring up all your S3 buckets. Next, choose the dataset hotel_bookings.csv from your S3 bucket for running through the tabular data flow.

Note that if you’re using the joined data flow, you may have to import multiple datasets into Data WranglerUporabite vzorce Github z Amazon SageMaker Data Wrangler PlatoBlockchain Data Intelligence. Navpično iskanje. Ai.

In the right pane, make sure ODSTAVEK is chosen as the delimiter and Vzorčenje nastavljena na Najprej K. Our dataset is small enough to run Data Wrangler transformations on the full dataset, but we wanted to highlight how you can import the dataset. If you have a large dataset, consider using sampling. Choose uvoz to import this dataset to Data Wrangler.

Uporabite vzorce Github z Amazon SageMaker Data Wrangler PlatoBlockchain Data Intelligence. Navpično iskanje. Ai.

After the dataset is imported, Data Wrangler automatically validates the dataset and detects the data types. You can see that the errors have gone away because we’re pointing to the correct dataset. The flow editor now shows two blocks showcasing that the data was imported from a source and data types recognized. You can also edit the data types if needed.

Uporabite vzorce Github z Amazon SageMaker Data Wrangler PlatoBlockchain Data Intelligence. Navpično iskanje. Ai.

The following screenshot shows our data types.

Uporabite vzorce Github z Amazon SageMaker Data Wrangler PlatoBlockchain Data Intelligence. Navpično iskanje. Ai.

Let’s look at some of the transforms done as a part of this tabular flow. If you’re using the Časovne serije or pridružil data flows, check out some common transforms on the GitHub repo. We performed some basic exploratory data analysis using data insights reports that studied the target leakage and feature collinearity in the dataset, table summary analyses, and quick modeling capability. Explore the steps on the GitHub repo.

Now we drop columns based on the recommendations provided by the Data Insights and Quality Report.

  • For target leakage, drop reservation_status.
  • For redundant columns, drop days_in_waiting_list, hotel, reserved_room_type, arrival_date_month, reservation_status_date, babies, in arrival_date_day_of_month.
  • Based on linear correlation results, drop columns arrival_date_week_number in arrival_date_year because the correlation values for these feature (column) pairs are greater than the recommended threshold of 0.90.
  • Based on non-linear correlation results, drop reservation_status. This column was already marked to be dropped based on the target leakage analysis.
  • Process numeric values (min-max scaling) for lead_time, stays_in_weekend_nights, stays_in_weekday_nights, is_repeated_guest, prev_cancellations, prev_bookings_not_canceled, booking_changes, adr, total_of_specical_requests, in required_car_parking_spaces.
  • One-hot encode categorical variables like meal, is_repeated_guest, market_segment, assigned_room_type, deposit_type, in customer_type.
  • Balance the target variable Random oversample for class imbalance.Use the quick modeling capability to handle outliers and missing values.

Uporabite vzorce Github z Amazon SageMaker Data Wrangler PlatoBlockchain Data Intelligence. Navpično iskanje. Ai.

Izvoz v Amazon S3

Now we have gone through the different transforms and are ready to export the data to Amazon S3. This option creates a SageMaker processing job, which runs the Data Wrangler processing flow and saves the resulting dataset to a specified S3 bucket. Follow the next steps to set up the export to Amazon S3:

Choose the plus sign next to a collection of transformation elements and choose Dodaj cilj, Potem Amazon S3.

Uporabite vzorce Github z Amazon SageMaker Data Wrangler PlatoBlockchain Data Intelligence. Navpično iskanje. Ai.

  • za Ime nabora podatkov, enter a name for the new dataset, for example NYC_export.
  • za Tip datoteke, izberite CSV.
  • za Delititer, izberite Comma.
  • za Stiskanje, izberite Noben.
  • za Lokacija Amazon S3, use the same bucket name that we created earlier.
  • Izberite Dodaj cilj.

Uporabite vzorce Github z Amazon SageMaker Data Wrangler PlatoBlockchain Data Intelligence. Navpično iskanje. Ai.

Izberite Ustvari službo.

Uporabite vzorce Github z Amazon SageMaker Data Wrangler PlatoBlockchain Data Intelligence. Navpično iskanje. Ai.

za Ime delovnega mesta, enter a name or keep the autogenerated option and choose destinacija. We have only one destination, S3:testingtabulardata, but you might have multiple destinations from different steps in your workflow. Leave the KMS key ARN field empty and choose Naslednji.

Now you have to configure the compute capacity for a job. You can keep all default values for this example.

  • za Vrsta primerka, use ml.m5.4xlarge.
  • za Število primerov, use 2.
  • Lahko raziskujete Dodatna konfiguracija, but keep the default settings.
  • Izberite Run.

Uporabite vzorce Github z Amazon SageMaker Data Wrangler PlatoBlockchain Data Intelligence. Navpično iskanje. Ai.

Now your job has started, and it takes some time to process 6 GB of data according to our Data Wrangler processing flow. The cost for this job will be around $2 USD, because ml.m5.4xlarge costs $0.922 USD per hour and we’re using two of them.

If you choose the job name, you’re redirected to a new window with the job details.

Uporabite vzorce Github z Amazon SageMaker Data Wrangler PlatoBlockchain Data Intelligence. Navpično iskanje. Ai.

On the job details page, you can see all the parameters from the previous steps.

When the job status changes to Completed, you can also check the Processing time (seconds) value. This processing job takes around 5–10 minutes to complete.

Uporabite vzorce Github z Amazon SageMaker Data Wrangler PlatoBlockchain Data Intelligence. Navpično iskanje. Ai.

When the job is complete, the train and test output files are available in the corresponding S3 output folders. You can find the output location from the processing job configurations.

Uporabite vzorce Github z Amazon SageMaker Data Wrangler PlatoBlockchain Data Intelligence. Navpično iskanje. Ai.

After the Data Wrangler processing job is complete, we can check the results saved in our S3 bucket. Don’t forget to update the job_name variable with your job name.

You can now use this exported data for running ML models.

Čiščenje

Delete your S3 buckets in vaše Data Wrangler tok in order to delete the underlying resources and prevent unwanted costs after you finish the experiment.

zaključek

In this post, we showed how you can import the tabular pre-built data flow into Data Wrangler, plug it against our dataset, and export the results to Amazon S3. If your use cases require you to manipulate time series data or join multiple datasets, you can go through the other pre-built sample flows in the GitHub repo.

After you have imported a pre-built data prep workflow, you can integrate it with Amazon SageMaker Processing, Amazonski cevovodi SageMakerin Trgovina s funkcijami Amazon SageMaker to simplify the task of processing, sharing, and storing ML training data. You can also export this sample data flow to a Python script and create a custom ML data prep pipeline, thereby accelerating your release velocity.

Priporočamo vam, da si ogledate naše GitHub repozitorij pridobiti praktično prakso in najti nove načine za izboljšanje natančnosti modela! Če želite izvedeti več o SageMakerju, obiščite Vodnik za razvijalce Amazon SageMaker.


O avtorjih

Uporabite vzorce Github z Amazon SageMaker Data Wrangler PlatoBlockchain Data Intelligence. Navpično iskanje. Ai.Isha Dua je višji arhitekt rešitev s sedežem na območju zaliva San Francisco. Strankam AWS Enterprise pomaga pri rasti z razumevanjem njihovih ciljev in izzivov ter jih usmerja, kako lahko oblikujejo svoje aplikacije v oblaku, hkrati pa poskrbi, da so prožne in razširljive. Navdušena je nad tehnologijami strojnega učenja in okoljsko trajnostjo.

Časovni žig:

Več od Strojno učenje AWS