Create Random And Stratified Samples Of Data With Amazon SageMaker Data Wrangler

افلاطون کے ذریعہ دوبارہ شائع کیا گیا۔

فالونگ: 0

In this post, we walk you through two sampling techniques in ایمیزون سیج میکر ڈیٹا رینگلر so you can quickly create processing workflows for your data. We cover both random sampling and stratified sampling techniques to help you sample your data based on your specific requirements.

Data Wrangler reduces the time it takes to aggregate and prepare data for machine learning (ML) from weeks to minutes. You can simplify the process of data preparation and feature engineering, and complete each step of the data preparation workflow, including data selection, cleansing, exploration, and visualization, from a single visual interface. With Data Wrangler’s data selection tool, you can choose the data you want from various data sources and import it with a single click. Data Wrangler contains over 300 built-in data transformations so you can quickly normalize, transform, and combine features without having to write any code. With Data Wrangler’s visualization templates, you can quickly preview and inspect that these transformations are completed as you intended by viewing them in ایمیزون سیج میکر اسٹوڈیو, the first fully integrated development environment (IDE) for ML. After your data is prepared, you can build fully automated ML workflows with ایمیزون سیج میکر پائپ لائنز and save them for reuse in ایمیزون سیج میکر فیچر اسٹور.

What is sampling and how can it help

In statistical analysis, the total set of observations is known as the آبادی. When working with data, it’s often not computationally feasible to measure every observation from the population. شماریاتی نمونے لینے is a procedure that allows you to understand your data by selecting subsets from the population.

Sampling offers a practical solution that sacrifices some accuracy for the sake of practicality and ease. To ensure your sample is a good representation of overall population, you can employ sampling strategies. Data Wrangler supports two of the most common strategies: بے ترتیب سیمپلنگ اور مصنوعی نمونے لینے.

بے ترتیب سیمپلنگ

If you have a large dataset, experimentation on that dataset may be time-consuming. Data Wrangler provides random sampling so you can efficiently process and visualize your data. For example, you may want to compute the average number of purchases for a customer within a time frame, or you may want to compute the attrition rate of a subscriber. You can use a random sample to visualize approximations to these metrics.

A random sample from your dataset is chosen so that each element has an equal probability of being selected. This operation is performed in an efficient manner suitable for large datasets, so the sample size returned is approximately the size requested, and not necessarily equal to the size requested.

You can use random sampling if you want to do quick approximate calculations to understand your dataset. As the sample size gets larger, the random sample can better approximate the entire dataset, but unless you include all data points, your random sample may not include all outliers and edge cases. If you want to prepare your entire dataset interactively, you can also switch to a larger instance type.

As a general rule, the sampling error in computing the population mean using a random sample tends to 0 as the sample gets larger. As the sample size increases, the error decreases as the inverse of the square root of the sample size. The takeaway being, the larger the sample, the better the approximation.

سطحی نمونے لینے

In some cases, your population can be divided into strata, or mutually exclusive buckets, such as geographic location for addresses, publication year for songs, or tax brackets for incomes. Random sampling is the most popular sampling technique, but if some strata are uncommon in your population, you can use stratified sampling in Data Wrangler to ensure that each strata is proportionally represented in your sample. This may be useful to reduce sampling errors as well as to ensure you’re capturing edge cases during your experimentation.

In the real world, fraudulent credit card transactions are rare events and typically make up less than 1% of your data. If we were to sample randomly, it’s not uncommon for the sample to contain very few or no fraudulent transactions. As a result, when training a model, we would have too few fraudulent examples to learn an accurate model. We can use stratified sampling to make sure we have proportional representation of fraudulent transactions.

In stratified sampling, the size of each strata in the sample is proportional to the size of the strata in the population. This works by dividing your data into strata based on your specified column, selecting random samples from each strata with the correct proportion, and combining those samples into a stratified sample of the population.

Stratified sampling is a useful technique when you want to understand how different groups in your data compare with each other, and you want to ensure you have appropriate representation from each group.

Random sampling when importing from Amazon S3

In this section, we use random sampling with a dataset consisting of both fraudulent and non-fraudulent events from our fraud detection system. You can ڈاؤن لوڈ، اتارنا the dataset to follow along with this post (CC 4.0 international attribution license).

At the time of this writing, you can import datasets from ایمیزون سادہ اسٹوریج سروس (ایمیزون S3)، ایمیزون ایتینا, ایمیزون ریڈ شفٹ, and Snowflake. Our dataset is very large, containing 1 million rows. In this case, we want to sample 1,0000 rows on import from Amazon S3 for some interactive experimentation within Data Wrangler.

Open SageMaker Studio and create a new Data Wrangler flow.
کے تحت ڈیٹا درآمد کریں۔منتخب کریں ایمیزون S3.
Choose the dataset to import.
میں تفصیلات دیکھیں pane, provide your dataset name and file type.
کے لئے سیمپلنگمنتخب کریں رینڈم.
کے لئے نمونہ سائز، داخل کریں 10000.
میں سے انتخاب کریں درآمد کریں to load the dataset into Data Wrangler.

You can visualize two distinct steps on the data flow page in Data Wrangler. The first step indicates the loading of the sample dataset based on the sampling strategy you defined. After the data is loaded, Data Wrangler performs auto detection of the data types for each of the columns in the dataset. This step is added by default for all datasets.

You can now review the random sampled data in Data Wrangler by adding an analysis.

آگے جمع کا نشان منتخب کریں۔ ڈیٹا کی اقسام اور منتخب کریں تجزیہ.
کے لئے تجزیہ کی قسممنتخب کریں سکیٹر پلاٹ.
میں سے انتخاب کریں feat_1 اور feat_2 کے طور پر ایکس محور اور Y محوربالترتیب.
کے لئے کی طرف سے رنگمنتخب کریں is_fraud.

When you’re comfortable with the dataset, proceed to do further data transformations as per your business requirement to prepare your data for ML.

In the following screenshot, we can observe the fraudulent (dark blue) and non-fraudulent (light blue) transactions in our analysis.

In the next section, we discuss using stratified sampling to ensure the fraudulent cases are chosen proportionally.

Stratified sampling with a transform

Data Wrangler allows you to sample on import, as well as sampling via a transform. In this section, we discuss using stratified sampling via a transform after you have imported your dataset into Data Wrangler.

To initiate sampling, on the ڈیٹا کے بہاؤ tab, choose the plus sign next to the imported dataset and choose ٹرانسفارم شامل کریں۔.

At the time of this writing, Data Wrangler provides more than 300 built-in transformations. In addition to the built-in transforms, you can write your own custom transforms in Pandas or PySpark.

سے تبدیلی شامل کریں۔ فہرست، منتخب کریں سیمپلنگ.

You can now use three distinct sampling strategies: limit, random, and stratified.

کے لئے نمونے لینے کا طریقہمنتخب کریں بنا ہوا.
استعمال کریں is_fraud column as the stratify column.
میں سے انتخاب کریں پیش نظارہ to preview the transformation, then choose شامل کریں to add this transformation as a step to your transformation recipe.

Your data flow now reflects the added sampling step.

Now we can review the random sampled data by adding an analysis.

جمع کا نشان منتخب کریں اور منتخب کریں۔ تجزیہ.
کے لئے تجزیہ کی قسممنتخب کریں ہسٹگرام.
میں سے انتخاب کریں is_fraud دونوں کیلئے ایکس محور اور کی طرف سے رنگ.
میں سے انتخاب کریں پیش نظارہ.

In the following screenshot, we can observe the breakdown of fraudulent (dark blue) and non-fraudulent (light blue) cases chosen via stratified sampling in the correct proportions of 20% fraudulent and 80% non-fraudulent.

نتیجہ

It is essential to sample data correctly when working with extremely large datasets and to choose the right sampling strategy to meet your business requirements. The effectiveness of your sampling relies on various factors, including business outcome, data availability, and distribution. In this post, we covered how to use Data Wrangler and its built-in sampling strategies to prepare your data.

You can start using this capability today in all Regions where SageMaker Studio is available. To get started, visit ایمیزون سیج میکر ڈیٹا رینگلر کے ساتھ ایم ایل ڈیٹا تیار کریں۔.

منظوریاں

The authors would like to thank Jonathan Chung (Applied Scientist) for his review and valuable feedback on this article.

مصنفین کے بارے میں

بین ہیرس is a software engineer with experience designing, deploying, and maintaining scalable data pipelines and machine learning solutions across a variety of domains.

وشال کپور AWS AI کے ساتھ ایک سینئر اپلائیڈ سائنٹسٹ ہے۔ وہ ڈیٹا رینگلر میں صارفین کو ان کے ڈیٹا کو سمجھنے میں مدد کرنے کا پرجوش ہے۔ اپنے فارغ وقت میں، وہ پہاڑی بائیک، سنو بورڈز، اور اپنے خاندان کے ساتھ وقت گزارتا ہے۔

میناکشی سندرم تھنڈاورائن is a Senior AI/ML specialist with AWS. He helps Hi-Tech strategic accounts on their AI and ML journey. He is very passionate about data-driven AI.

اجے شرما ایمیزون سیج میکر کے پرنسپل پروڈکٹ مینیجر ہیں جہاں وہ ڈیٹا رینگلر پر توجہ مرکوز کرتے ہیں، جو ڈیٹا سائنسدانوں کے لیے بصری ڈیٹا کی تیاری کا آلہ ہے۔ AWS سے پہلے، Ajai McKinsey and Company میں ڈیٹا سائنس کے ماہر تھے، جہاں انہوں نے دنیا بھر میں معروف فنانس اور انشورنس فرموں کے لیے ML پر مرکوز مصروفیات کی قیادت کی۔ اجائی ڈیٹا سائنس کے بارے میں پرجوش ہے اور جدید ترین الگورتھم اور مشین لرننگ تکنیکوں کو دریافت کرنا پسند کرتا ہے۔

ٹائم اسٹیمپ: اپریل 26، 2022

ٹائم اسٹیمپ: جنوری 31، 2024

Amazon SageMaker Data Wrangler کے ساتھ ڈیٹا کے بے ترتیب اور مرتب شدہ نمونے بنائیں

افلاطون کے ذریعہ دوبارہ شائع کیا گیا۔

What is sampling and how can it help

بے ترتیب سیمپلنگ

سطحی نمونے لینے

Random sampling when importing from Amazon S3

Stratified sampling with a transform

نتیجہ

منظوریاں

مصنفین کے بارے میں

سے زیادہ AWS مشین لرننگ

Amazon Bedrock اور Amazon Transcribe کے ساتھ جنریٹو AI کا استعمال کرتے ہوئے ریکارڈنگ کے خلاصے بنائیں۔ ایمیزون ویب سروسز

LlamaIndex اور Llama 2-Chat | کا استعمال کرتے ہوئے علم سے چلنے والی گفتگو کی ایپلی کیشنز بنائیں ایمیزون ویب سروسز

Amazon Timestream اور Amazon Lookout for Equipment کے ساتھ بے ضابطگیوں کا پتہ لگانے اور ڈاؤن ٹائم کی پیش گوئی کرنے کے لیے مشین لرننگ کا استعمال کریں

Amazon Recognition Custom Labels کے ساتھ زراعت کی پیداوار کی پیمائش کرنے کے لیے کمپیوٹر وژن کا استعمال کریں۔

ایمیزون EKS اور ٹارچ تقسیم شدہ لچکدار کے ساتھ تقسیم کی تربیت

ہمارے متعلق

عمودی تلاش اور Ai

پلیٹ فارم

مربوط رہو

اکاؤنٹ