Unlock ML Insights Using The Amazon SageMaker Feature Store Feature Processor

প্লেটো দ্বারা প্রকাশিত

অনুসরণকারী: 0

আমাজন সেজমেকার ফিচার স্টোর provides an end-to-end solution to automate feature engineering for machine learning (ML). For many ML use cases, raw data like log files, sensor readings, or transaction records need to be transformed into meaningful features that are optimized for model training.

Feature quality is critical to ensure a highly accurate ML model. Transforming raw data into features using aggregation, encoding, normalization, and other operations is often needed and can require significant effort. Engineers must manually write custom data preprocessing and aggregation logic in Python or Spark for each use case.

This undifferentiated heavy lifting is cumbersome, repetitive, and error-prone. The SageMaker Feature Store Feature Processor reduces this burden by automatically transforming raw data into aggregated features suitable for batch training ML models. It lets engineers provide simple data transformation functions, then handles running them at scale on Spark and managing the underlying infrastructure. This enables data scientists and data engineers to focus on the feature engineering logic rather than implementation details.

In this post, we demonstrate how a car sales company can use the Feature Processor to transform raw sales transaction data into features in three steps:

Local runs of data transformations.
Remote runs at scale using Spark.
Operationalization via pipelines.

We show how SageMaker Feature Store ingests the raw data, runs feature transformations remotely using Spark, and loads the resulting aggregated features into a বৈশিষ্ট্য গ্রুপ. These engineered features are can then be used to train ML models.

For this use case, we see how SageMaker Feature Store helps convert the raw car sales data into structured features. These features are subsequently used to gain insights like:

Average and maximum price of red convertibles from 2010
Models with best mileage vs. price
Sales trends of new vs. used cars over the years
Differences in average MSRP across locations

We also see how SageMaker Feature Store pipelines keep the features updated as new data comes in, enabling the company to continually gain insights over time.

সমাধান ওভারভিউ

We work with the dataset car_data.csv, which contains specifications such as model, year, status, mileage, price, and MSRP for used and new cars sold by the company. The following screenshot shows an example of the dataset.

"Image displaying a table of car data, including car model, year, mileage, price, and MSRP for various vehicles."

The solution notebook feature_processor.ipynb contains the following main steps, which we explain in this post:

Create two feature groups: one called car-data for raw car sales records and another called car-data-aggregated for aggregated car sales records.
ব্যবহার @feature_processor decorator to load data into the car-data feature group from আমাজন সিম্পল স্টোরেজ সার্ভিস (Amazon S3)।
চালান @feature_processor code remotely as a Spark application to aggregate the data.
Operationalize the feature processor via সেজমেকার পাইপলাইন and schedule runs.
Explore the feature processing pipelines and বংশ in অ্যামাজন সেজমেকার স্টুডিও.
Use aggregated features to train an ML model.

পূর্বশর্ত

To follow this tutorial, you need the following:

For this post, we refer to the following নোটবই, which demonstrates how to get started with Feature Processor using the SageMaker Python SDK.

Create feature groups

To create the feature groups, complete the following steps:

Create a feature group definition for car-data নিম্নরূপ:

# Feature Group - Car Sales CAR_SALES_FG_NAME = "car-data"
CAR_SALES_FG_ARN = f"arn:aws:sagemaker:{region}:{aws_account_id}:feature-group/{CAR_SALES_FG_NAME}"
CAR_SALES_FG_ROLE_ARN = offline_store_role
CAR_SALES_FG_OFFLINE_STORE_S3_URI = f"s3://{s3_bucket}/{s3_offline_store_prefix}"
CAR_SALES_FG_FEATURE_DEFINITIONS = [
    FeatureDefinition(feature_name="id", feature_type=FeatureTypeEnum.STRING),
    FeatureDefinition(feature_name="model", feature_type=FeatureTypeEnum.STRING),
    FeatureDefinition(feature_name="model_year", feature_type=FeatureTypeEnum.STRING),
    FeatureDefinition(feature_name="status", feature_type=FeatureTypeEnum.STRING),
    FeatureDefinition(feature_name="mileage", feature_type=FeatureTypeEnum.STRING),
    FeatureDefinition(feature_name="price", feature_type=FeatureTypeEnum.STRING),
    FeatureDefinition(feature_name="msrp", feature_type=FeatureTypeEnum.STRING),
    FeatureDefinition(feature_name="ingest_time", feature_type=FeatureTypeEnum.FRACTIONAL),
]

The features correspond to each column in the car_data.csv ডেটাসেট (Model, Year, Status, Mileage, Price, এবং MSRP).

Add the record identifier id and event time ingest_time to the feature group:

CAR_SALES_FG_RECORD_IDENTIFIER_NAME = "id"
CAR_SALES_FG_EVENT_TIME_FEATURE_NAME = "ingest_time"

Create a feature group definition for car-data-aggregated নিম্নরূপ:

# Feature Group - Aggregated Car SalesAGG_CAR_SALES_FG_NAME = "car-data-aggregated"
AGG_CAR_SALES_FG_ARN = (
    f"arn:aws:sagemaker:{region}:{aws_account_id}:feature-group/{AGG_CAR_SALES_FG_NAME}"
)
AGG_CAR_SALES_FG_ROLE_ARN = offline_store_role
AGG_CAR_SALES_FG_OFFLINE_STORE_S3_URI = f"s3://{s3_bucket}/{s3_offline_store_prefix}"
AGG_CAR_SALES_FG_FEATURE_DEFINITIONS = [
    FeatureDefinition(feature_name="model_year_status", feature_type=FeatureTypeEnum.STRING),
    FeatureDefinition(feature_name="avg_mileage", feature_type=FeatureTypeEnum.STRING),
    FeatureDefinition(feature_name="max_mileage", feature_type=FeatureTypeEnum.STRING),
    FeatureDefinition(feature_name="avg_price", feature_type=FeatureTypeEnum.STRING),
    FeatureDefinition(feature_name="max_price", feature_type=FeatureTypeEnum.STRING),
    FeatureDefinition(feature_name="avg_msrp", feature_type=FeatureTypeEnum.STRING),
    FeatureDefinition(feature_name="max_msrp", feature_type=FeatureTypeEnum.STRING),
    FeatureDefinition(feature_name="ingest_time", feature_type=FeatureTypeEnum.FRACTIONAL),
]

For the aggregated feature group, the features are model year status, average mileage, max mileage, average price, max price, average MSRP, max MSRP, and ingest time. We add the record identifier model_year_status and event time ingest_time to this feature group.

এখন, তৈরি করুন car-data বৈশিষ্ট্য গ্রুপ:

# Create Feature Group - Car sale records.
car_sales_fg = FeatureGroup(
    name=CAR_SALES_FG_NAME,
    feature_definitions=CAR_SALES_FG_FEATURE_DEFINITIONS,
    sagemaker_session=sagemaker_session,
) create_car_sales_fg_resp = car_sales_fg.create(
        record_identifier_name=CAR_SALES_FG_RECORD_IDENTIFIER_NAME,
        event_time_feature_name=CAR_SALES_FG_EVENT_TIME_FEATURE_NAME,
        s3_uri=CAR_SALES_FG_OFFLINE_STORE_S3_URI,
        enable_online_store=True,
        role_arn=CAR_SALES_FG_ROLE_ARN,
    )

তৈরি করুন car-data-aggregated বৈশিষ্ট্য গ্রুপ:

# Create Feature Group - Aggregated car sales records.
agg_car_sales_fg = FeatureGroup(
    name=AGG_CAR_SALES_FG_NAME,
    feature_definitions=AGG_CAR_SALES_FG_FEATURE_DEFINITIONS,
    sagemaker_session=sagemaker_session,
) create_agg_car_sales_fg_resp = agg_car_sales_fg.create(       record_identifier_name=AGG_CAR_SALES_FG_RECORD_IDENTIFIER_NAME,  event_time_feature_name=AGG_CAR_SALES_FG_EVENT_TIME_FEATURE_NAME,
        s3_uri=AGG_CAR_SALES_FG_OFFLINE_STORE_S3_URI,
        enable_online_store=True,
        role_arn=AGG_CAR_SALES_FG_ROLE_ARN,
    )

You can navigate to the SageMaker Feature Store option under উপাত্ত সেজমেকার স্টুডিওতে হোম menu to see the feature groups.

Image from Sagemaker Feature store with headers Feature group name and description

Use the @feature_processor decorator to load data

In this section, we locally transform the raw input data (car_data.csv) from Amazon S3 into the car-data feature group using the Feature Store Feature Processor. This initial local run allows us to develop and iterate before running remotely, and could be done on a sample of the data if desired for faster iteration.

সঙ্গে সঙ্গে @feature_processor decorator, your transformation function runs in a Spark runtime environment where the input arguments provided to your function and its return value are Spark DataFrames.

ইনস্টল করুন Feature Processor SDK থেকে সেজমেকার পাইথন এসডিকে and its extras using the following command:

pip install sagemaker[feature-processor]

The number of input parameters in your transformation function must match the number of inputs configured in the @feature_processor decorator. In this case, the @feature_processor decorator has car-data.csv as input and the car-data feature group as output, indicating this is a batch operation with the target_store as OfflineStore:

from sagemaker.feature_store.feature_processor import (
    feature_processor,
    FeatureGroupDataSource,
    CSVDataSource,
) @feature_processor(
    inputs=[CSVDataSource(RAW_CAR_SALES_S3_URI)],
    output=CAR_SALES_FG_ARN,
    target_stores=["OfflineStore"],
)

সংজ্ঞায়িত করুন transform() function to transform the data. This function performs the following actions:
- Convert column names to lowercase.
- Add the event time to the ingest_time কলাম।
- Remove punctuation and replace missing values with NA.

def transform(raw_s3_data_as_df):
    """Load data from S3, perform basic feature engineering, store it in a Feature Group"""
    from pyspark.sql.functions import regexp_replace
    from pyspark.sql.functions import lit
    import time     transformed_df = (
        raw_s3_data_as_df.withColumn("Price", regexp_replace("Price", "$", ""))
        # Rename Columns
        .withColumnRenamed("Id", "id")
        .withColumnRenamed("Model", "model")
        .withColumnRenamed("Year", "model_year")
        .withColumnRenamed("Status", "status")
        .withColumnRenamed("Mileage", "mileage")
        .withColumnRenamed("Price", "price")
        .withColumnRenamed("MSRP", "msrp")
        # Add Event Time
        .withColumn("ingest_time", lit(int(time.time())))
        # Remove punctuation and fluff; replace with NA
        .withColumn("mileage", regexp_replace("mileage", "(,)|(mi.)", ""))
        .withColumn("mileage", regexp_replace("mileage", "Not available", "NA"))
        .withColumn("price", regexp_replace("price", ",", ""))
        .withColumn("msrp", regexp_replace("msrp", "(^MSRPs$)|(,)", ""))
        .withColumn("msrp", regexp_replace("msrp", "Not specified", "NA"))
        .withColumn("msrp", regexp_replace("msrp", "$d+[a-zA-Zs]+", "NA"))
        .withColumn("model", regexp_replace("model", "^dddds", ""))
    )

কল করুন transform() function to store the data in the car-data বৈশিষ্ট্য গ্রুপ:

# Execute the FeatureProcessor
transform()

The output shows that the data is ingested successfully into the car-data feature group.

এর আউটপুট transform_df.show() ফাংশন নিম্নরূপ:

INFO:sagemaker:Ingesting transformed data to arn:aws:sagemaker:us-west-2:416578662734:feature-group/car-data with target_stores: ['OfflineStore'] +---+--------------------+----------+------+-------+--------+-----+-----------+
| id|               model|model_year|status|mileage|   price| msrp|ingest_time|
+---+--------------------+----------+------+-------+--------+-----+-----------+
|  0|    Acura TLX A-Spec|      2022|   New|     NA|49445.00|49445| 1686627154|
|  1|    Acura RDX A-Spec|      2023|   New|     NA|50895.00|   NA| 1686627154|
|  2|    Acura TLX Type S|      2023|   New|     NA|57745.00|   NA| 1686627154|
|  3|    Acura TLX Type S|      2023|   New|     NA|57545.00|   NA| 1686627154|
|  4|Acura MDX Sport H...|      2019|  Used| 32675 |40990.00|   NA| 1686627154|
|  5|    Acura TLX A-Spec|      2023|   New|     NA|50195.00|50195| 1686627154|
|  6|    Acura TLX A-Spec|      2023|   New|     NA|50195.00|50195| 1686627154|
|  7|    Acura TLX Type S|      2023|   New|     NA|57745.00|   NA| 1686627154|
|  8|    Acura TLX A-Spec|      2023|   New|     NA|47995.00|   NA| 1686627154|
|  9|    Acura TLX A-Spec|      2022|   New|     NA|49545.00|   NA| 1686627154|
| 10|Acura Integra w/A...|      2023|   New|     NA|36895.00|36895| 1686627154|
| 11|    Acura TLX A-Spec|      2023|   New|     NA|48395.00|48395| 1686627154|
| 12|Acura MDX Type S ...|      2023|   New|     NA|75590.00|   NA| 1686627154|
| 13|Acura RDX A-Spec ...|      2023|   New|     NA|55345.00|   NA| 1686627154|
| 14|    Acura TLX A-Spec|      2023|   New|     NA|50195.00|50195| 1686627154|
| 15|Acura RDX A-Spec ...|      2023|   New|     NA|55045.00|   NA| 1686627154|
| 16|    Acura TLX Type S|      2023|   New|     NA|56445.00|   NA| 1686627154|
| 17|    Acura TLX A-Spec|      2023|   New|     NA|47495.00|47495| 1686627154|
| 18|   Acura TLX Advance|      2023|   New|     NA|52245.00|52245| 1686627154|
| 19|    Acura TLX A-Spec|      2023|   New|     NA|50595.00|50595| 1686627154|
+---+--------------------+----------+------+-------+--------+-----+-----------+
only showing top 20 rows

We have successfully transformed the input data and ingested it in the car-data বৈশিষ্ট্য গ্রুপ।

Run the @feature_processor code remotely

In this section, we demonstrate running the feature processing code remotely as a Spark application using the @remote decorator described earlier. We run the feature processing remotely using Spark to scale to large datasets. Spark provides distributed processing on clusters to handle data that is too big for a single machine. The @remote decorator runs the local Python code as a single or multi-node SageMaker training job.

ব্যবহার @remote decorator along with the @feature_processor decorator as follows:

@remote(spark_config=SparkConfig(), instance_type = "ml.m5.xlarge", ...)
@feature_processor(inputs=[FeatureGroupDataSource(CAR_SALES_FG_ARN)],
                   output=AGG_CAR_SALES_FG_ARN, target_stores=["OfflineStore"], enable_ingestion=False )

সার্জারির spark_config parameter indicates this is run as a Spark application. The SparkConfig instance configures the Spark configuration and dependencies.

সংজ্ঞায়িত করুন aggregate() function to aggregate the data using PySpark SQL and user-defined functions (UDFs). This function performs the following actions:
- বন্ধ করা model, year, এবং status তৈরী করতে model_year_status.
- Take the average of price তৈরী করতে avg_price.
- Take the max value of price তৈরী করতে max_price.
- Take the average of mileage তৈরী করতে avg_mileage.
- Take the max value of mileage তৈরী করতে max_mileage.
- Take the average of msrp তৈরী করতে avg_msrp.
- Take the max value of msrp তৈরী করতে max_msrp.
- গ্রুপ দ্বারা model_year_status.

def aggregate(source_feature_group, spark):
    """
    Aggregate the data using a SQL query and UDF.
    """
    import time
    from pyspark.sql.types import StringType
    from pyspark.sql.functions import udf     @udf(returnType=StringType())
    def custom_concat(*cols, delimeter: str = ""):
        return delimeter.join(cols)     spark.udf.register("custom_concat", custom_concat)     # Execute SQL string.
    source_feature_group.createOrReplaceTempView("car_data")
    aggregated_car_data = spark.sql(
        f"""
        SELECT
            custom_concat(model, "_", model_year, "_", status) as model_year_status,
            AVG(price) as avg_price,
            MAX(price) as max_price,
            AVG(mileage) as avg_mileage,
            MAX(mileage) as max_mileage,
            AVG(msrp) as avg_msrp,
            MAX(msrp) as max_msrp,
            "{int(time.time())}" as ingest_time
        FROM car_data
        GROUP BY model_year_status
        """
    )     aggregated_car_data.show()     return aggregated_car_data

চালান aggregate() function, which creates a SageMaker training job to run the Spark application:

# Execute the aggregate function
aggregate()

As a result, SageMaker creates a training job to the Spark application defined earlier. It will create a Spark runtime environment using the sagemaker-spark-processing image.

We use SageMaker Training jobs here to run our Spark feature processing application. With SageMaker Training, you can reduce startup times to 1 minute or less by using warm pooling, which is unavailable in SageMaker Processing. This makes SageMaker Training better optimized for short batch jobs like feature processing where startup time is important.

To view the details, on the SageMaker console, choose প্রশিক্ষণ কাজ অধীনে প্রশিক্ষণ in the navigation pane, then choose the job with the name aggregate-<timestamp>.

Image shows the Sagemaker training job

এর আউটপুট মোট() function generates telemetry code. Inside the output, you will see the aggregated data as follows:

+--------------------+------------------+---------+------------------+-----------+--------+--------+-----------+
|   model_year_status|         avg_price|max_price|       avg_mileage|max_mileage|avg_msrp|max_msrp|ingest_time|
+--------------------+------------------+---------+------------------+-----------+--------+--------+-----------+
|Acura CL 3.0_1997...|            7950.0|  7950.00|          100934.0|    100934 |    null|      NA| 1686634807|
|Acura CL 3.2 Type...|            6795.0|  7591.00|          118692.5|    135760 |    null|      NA| 1686634807|
|Acura CL 3_1998_Used|            9899.0|  9899.00|           63000.0|     63000 |    null|      NA| 1686634807|
|Acura ILX 2.0L Te...|         14014.125| 18995.00|         95534.875|     89103 |    null|      NA| 1686634807|
|Acura ILX 2.0L Te...|           15008.2| 16998.00|           94935.0|     88449 |    null|      NA| 1686634807|
|Acura ILX 2.0L Te...|           16394.6| 19985.00|           97719.4|     80000 |    null|      NA| 1686634807|
|Acura ILX 2.0L w/...|14567.181818181818| 16999.00| 96624.72727272728|     98919 |    null|      NA| 1686634807|
|Acura ILX 2.0L w/...|           16673.4| 18995.00|           84848.6|     96637 |    null|      NA| 1686634807|
|Acura ILX 2.0L w/...|12580.333333333334| 14546.00|100207.33333333333|     95782 |    null|      NA| 1686634807|
|Acura ILX 2.0L_20...|         14565.375| 17590.00|         92941.125|     81842 |    null|      NA| 1686634807|
|Acura ILX 2.0L_20...|           14877.9|  9995.00|           99739.5|     89252 |    null|      NA| 1686634807|
|Acura ILX 2.0L_20...|           15659.5| 15660.00|           82136.0|     89942 |    null|      NA| 1686634807|
|Acura ILX 2.0L_20...|17121.785714285714| 20990.00| 78278.14285714286|     96067 |    null|      NA| 1686634807|
|Acura ILX 2.4L (A...|           17846.0| 21995.00|          101558.0|     85974 |    null|      NA| 1686634807|
|Acura ILX 2.4L Pr...|           16327.0| 16995.00|           85238.0|     95356 |    null|      NA| 1686634807|
|Acura ILX 2.4L w/...|           12846.0| 12846.00|           75209.0|     75209 |    null|      NA| 1686634807|
|Acura ILX 2.4L_20...|           18998.0| 18998.00|           51002.0|     51002 |    null|      NA| 1686634807|
|Acura ILX 2.4L_20...|17908.615384615383| 19316.00| 74325.38461538461|     89116 |    null|      NA| 1686634807|
|Acura ILX 4DR SDN...|           18995.0| 18995.00|           37017.0|     37017 |    null|      NA| 1686634807|
|Acura ILX 8-SPD_2...|           24995.0| 24995.00|           22334.0|     22334 |    null|      NA| 1686634807|
+--------------------+------------------+---------+------------------+-----------+--------+--------+-----------+
only showing top 20 rows

When the training job is complete, you should see following output:

06-13 05:40 smspark-submit INFO     spark submit was successful. primary node exiting.
Training seconds: 153
Billable seconds: 153

Operationalize the feature processor via SageMaker pipelines

In this section, we demonstrate how to operationalize the feature processor by promoting it to a SageMaker pipeline and scheduling runs.

First, upload the transformation_code.py file containing the feature processing logic to Amazon S3:

car_data_s3_uri = s3_path_join("s3://", sagemaker_session.default_bucket(),
                               'transformation_code', 'car-data-ingestion.py')
S3Uploader.upload(local_path='car-data-ingestion.py', desired_s3_uri=car_data_s3_uri)
print(car_data_s3_uri)

Next, create a Feature Processor pipeline car_data_pipeline ব্যবহার করে .to_pipeline() ফাংশন:

car_data_pipeline_name = f"{CAR_SALES_FG_NAME}-ingestion-pipeline"
car_data_pipeline_arn = fp.to_pipeline(pipeline_name=car_data_pipeline_name,
                                      step=transform,
                                      transformation_code=TransformationCode(s3_uri=car_data_s3_uri) )
print(f"Created SageMaker Pipeline: {car_data_pipeline_arn}.")

To run the pipeline, use the following code:

car_data_pipeline_execution_arn = fp.execute(pipeline_name=car_data_pipeline_name)
print(f"Started an execution with execution arn: {car_data_pipeline_execution_arn}")

Similarly, you can create a pipeline for aggregated features called car_data_aggregated_pipeline and start a run.
সময়সূচী car_data_aggregated_pipeline to run every 24 hours:

fp.schedule(pipeline_name=car_data_aggregated_pipeline_name,
           schedule_expression="rate(24 hours)", state="ENABLED")
print(f"Created a schedule.")

In the output section, you will see the ARN of pipeline and the pipeline execution role, and the schedule details:

{'pipeline_arn': 'arn:aws:sagemaker:us-west-2:416578662734:pipeline/car-data-aggregated-ingestion-pipeline',
 'pipeline_execution_role_arn': 'arn:aws:iam::416578662734:role/service-role/AmazonSageMaker-ExecutionRole-20230612T120731',
 'schedule_arn': 'arn:aws:scheduler:us-west-2:416578662734:schedule/default/car-data-aggregated-ingestion-pipeline',
 'schedule_expression': 'rate(24 hours)',
 'schedule_state': 'ENABLED',
 'schedule_start_date': '2023-06-13T06:05:17Z',
 'schedule_role': 'arn:aws:iam::416578662734:role/service-role/AmazonSageMaker-ExecutionRole-20230612T120731'}

To get all the Feature Processor pipelines in this account, use the list_pipelines() function on the Feature Processor:

fp.list_pipelines()

আউটপুট নিম্নরূপ হবে:

[{'pipeline_name': 'car-data-aggregated-ingestion-pipeline'},
 {'pipeline_name': 'car-data-ingestion-pipeline'}]

We have successfully created SageMaker Feature Processor pipelines.

Explore feature processing pipelines and ML lineage

In SageMaker Studio, complete the following steps:

On the SageMaker Studio console, on the হোম মেনু, নির্বাচন করুন পাইপলাইন.

Image of Sagemaker Studio home tab highlighting pipelines option

You should see two pipelines created: car-data-ingestion-pipeline এবং car-data-aggregated-ingestion-pipeline.

Image of Sagemaker Studio pipelines with the list of pipelines

পছন্দ car-data-ingestion-pipeline.

It shows the run details on the ফাঁসি ট্যাব।

Image of Sagemaker Studio of the car data ingestion pipeline

To view the feature group populated by the pipeline, choose ফিচার স্টোর অধীনে উপাত্ত এবং নির্বাচন করুন car-data.

Image of Sagemaker Studio home highliting data

You will see the two feature groups we created in the previous steps.

Image of Sagemaker Studio with feature groups created

পছন্দ car-data বৈশিষ্ট্য গ্রুপ।

You will see the features details on the বৈশিষ্ট্য ট্যাব।

Image of Sagemaker Studio with feature group and the features in the group

View pipeline runs

To view the pipeline runs, complete the following steps:

উপরে Pipeline Executions tab, select car-data-ingestion-pipeline.

Unlock ML insights using the Amazon SageMaker Feature Store Feature Processor | Amazon Web Services PlatoBlockchain Data Intelligence. Vertical Search. Ai.

This will show all the runs.

Image shows the Sagemaker Feature group tab of the pipeline executions

Choose one of the links to see the details of the run.

Image shows the sagemaker UI with the pipelines in execution

To view lineage, choose Lineage.

The full lineage for car-data shows the input data source car_data.csv and upstream entities. The lineage for car-data-aggregated shows the input car-data বৈশিষ্ট্য গ্রুপ।

Image of Sagemaker UI of the feature group of car data

বেছে নিন Load features এবং তারপর নির্বাচন করুন Query upstream lineage on car-data এবং car-data-ingestion-pipeline to see all the upstream entities.

The full lineage for car-data feature group should look like the following screenshot.

Image shows the Sagemaker feature store with car lineage

Similarly, the lineage for the car-aggregated-data feature group should look like the following screenshot.

Image shoes the aggregated feature group from Sagemaker Feature Store UI

SageMaker Studio provides a single environment to track scheduled pipelines, view runs, explore lineage, and view the feature processing code.

The aggregated features such as average price, max price, average mileage, and more in the car-data-aggregated feature group provide insight into the nature of the data. You can also use these features as a dataset to train a model to predict car prices, or for other operations. However, training the model is out of scope for this post, which focuses on demonstrating the SageMaker Feature Store capabilities for feature engineering.

পরিষ্কার কর

চলমান চার্জ এড়াতে এই পোস্টের অংশ হিসাবে তৈরি সংস্থানগুলি পরিষ্কার করতে ভুলবেন না।

Disable the scheduled pipeline via the fp.schedule() method with the state parameter as Disabled:

# Disable the scheduled pipeline
fp.schedule(
pipeline_name=car_data_aggregated_pipeline_name,
schedule_expression="rate(24 hours)",
state="DISABLED",
)

Delete both feature groups:

# Delete feature groups
car_sales_fg.delete()
agg_car_sales_fg.delete()

The data residing in the S3 bucket and offline feature store can incur costs, so you should delete them to avoid any charges.

Delete the S3 objects.
রেকর্ড মুছে ফেলুন from the feature store.

উপসংহার

In this post, we demonstrated how a car sales company used SageMaker Feature Store Feature Processor to gain valuable insights from their raw sales data by:

Ingesting and transforming batch data at scale using Spark
Operationalizing feature engineering workflows via SageMaker pipelines
Providing lineage tracking and a single environment to monitor pipelines and explore features
Preparing aggregated features optimized for training ML models

By following these steps, the company was able to transform previously unusable data into structured features that could then be used to train a model to predict car prices. SageMaker Feature Store enabled them to focus on feature engineering rather than the underlying infrastructure.

We hope this post helps you unlock valuable ML insights from your own data using SageMaker Feature Store Feature Processor!

For more information on this, refer to বৈশিষ্ট্য প্রক্রিয়াকরণ and the SageMaker example on Amazon SageMaker Feature Store: Feature Processor Introduction.

লেখক সম্পর্কে

ধবল শাহ AWS-এর একজন সিনিয়র সলিউশন আর্কিটেক্ট, মেশিন লার্নিংয়ে বিশেষজ্ঞ। ডিজিটাল নেটিভ ব্যবসার উপর দৃঢ় ফোকাস দিয়ে, তিনি গ্রাহকদেরকে AWS লাভ করতে এবং তাদের ব্যবসায়িক বৃদ্ধি চালাতে ক্ষমতা দেন। একজন ML উত্সাহী হিসাবে, ধবল ইতিবাচক পরিবর্তন নিয়ে আসে এমন প্রভাবশালী সমাধান তৈরি করার জন্য তার আবেগ দ্বারা চালিত হয়৷ অবসর সময়ে, তিনি ভ্রমণের প্রতি তার ভালবাসায় লিপ্ত হন এবং তার পরিবারের সাথে মানসম্পন্ন মুহূর্তগুলি লালন করেন।

Unlock ML insights using the Amazon SageMaker Feature Store Feature Processor | Amazon Web Services PlatoBlockchain Data Intelligence. Vertical Search. Ai. Ninad Joshi is a Senior Solutions Architect at AWS, helping global AWS customers design secure, scalable, and cost effective solutions in cloud to solve their complex real-world business challenges. His work in Machine Learning (ML) covers a wide range of AI/ML use cases, with a primary focus on End-to-End ML, Natural Language Processing, and Computer Vision. Prior to joining AWS, Ninad worked as a software developer for 12+ years. Outside of his professional endeavors, Ninad enjoys playing chess and exploring different gambits.

এসইও চালিত বিষয়বস্তু এবং পিআর বিতরণ। আজই পরিবর্ধিত পান।
PlatoData.Network উল্লম্ব জেনারেটিভ Ai. নিজেকে ক্ষমতায়িত করুন। এখানে প্রবেশ করুন.
প্লেটোএআইস্ট্রিম। Web3 ইন্টেলিজেন্স। জ্ঞান প্রসারিত. এখানে প্রবেশ করুন.
প্লেটোইএসজি। কার্বন, ক্লিনটেক, শক্তি, পরিবেশ সৌর, বর্জ্য ব্যবস্থাপনা. এখানে প্রবেশ করুন.
প্লেটো হেলথ। বায়োটেক এবং ক্লিনিক্যাল ট্রায়াল ইন্টেলিজেন্স। এখানে প্রবেশ করুন.
উত্স: https://aws.amazon.com/blogs/machine-learning/unlock-ml-insights-using-the-amazon-sagemaker-feature-store-feature-processor/

সময় স্ট্যাম্প: সেপ্টেম্বর 19, 2023

সময় স্ট্যাম্প: ডিসেম্বর 19, 2022

Amazon SageMaker ফিচার স্টোর ফিচার প্রসেসর ব্যবহার করে ML অন্তর্দৃষ্টি আনলক করুন | আমাজন ওয়েব সার্ভিসেস

প্লেটো দ্বারা প্রকাশিত

সমাধান ওভারভিউ

পূর্বশর্ত

Create feature groups

Use the @feature_processor decorator to load data

Run the @feature_processor code remotely

Operationalize the feature processor via SageMaker pipelines

Explore feature processing pipelines and ML lineage

View pipeline runs

পরিষ্কার কর

উপসংহার

লেখক সম্পর্কে

থেকে আরো এডাব্লুএস মেশিন লার্নিং

AWS-এ Kubeflow ব্যবহার করে পুনরাবৃত্তিযোগ্য, সুরক্ষিত এবং এক্সটেনসিবল এন্ড-টু-এন্ড মেশিন লার্নিং ওয়ার্কফ্লো তৈরি করুন

AWS DeepRacer সমস্ত দক্ষতার স্তরের নির্মাতাদের উন্নত করতে এবং মেশিন লার্নিং শুরু করতে সক্ষম করে | আমাজন ওয়েব সার্ভিসেস

অ্যামাজন সেজমেকার ক্যানভাসের সাথে কোডের একক লাইন না লিখে মেশিন লার্নিং ব্যবহার করুন | আমাজন ওয়েব সার্ভিসেস

DTMF স্লট কনফিগার করুন এবং Amazon Lex এর সাথে পুনরায় চেষ্টা করার প্রম্পট অর্ডার করুন

Renate দিয়ে নিউরাল নেটওয়ার্কগুলিকে স্বয়ংক্রিয়ভাবে পুনরায় প্রশিক্ষণ দিন

আমাদের সম্পর্কে

উল্লম্ব অনুসন্ধান এবং আই

প্ল্যাটফর্ম

যোগাযোগ রেখো

হিসাব