Urmăriți-vă experimentele ML cap la cap cu Data Version Control și Amazon SageMaker Experiments PlatoBlockchain Data Intelligence. Căutare verticală. Ai.

Urmăriți-vă experimentele ML cap la cap cu Data Version Control și Amazon SageMaker Experiments

Data scientists often work towards understanding the effects of various data preprocessing and feature engineering strategies in combination with different model architectures and hyperparameters. Doing so requires you to cover large parameter spaces iteratively, and it can be overwhelming to keep track of previously run configurations and results while keeping experiments reproducible.

This post walks you through an example of how to track your experiments across code, data, artifacts, and metrics by using Experimente Amazon SageMaker în legătură cu Controlul versiunii datelor (DVC). We show how you can use DVC side by side with Amazon SageMaker processing and training jobs. We train different CatBoost models on the California housing dataset from the StatLib repository, and change holdout strategies while keeping track of the data version with DVC. In each individual experiment, we track input and output artifacts, code, and metrics using SageMaker Experiments.

Experimentele SageMaker

SageMaker Experiments is an AWS service for tracking machine learning (ML) experiments. The SageMaker Experiments Python SDK is a high-level interface to this service that helps you track experiment information using Python.

The goal of SageMaker Experiments is to make it as simple as possible to create experiments, populate them with trials, add tracking and lineage information, and run analytics across trials and experiments.

When discussing SageMaker Experiments, we refer to the following concepts:

  • Experiment – A collection of related trials. You add trials to an experiment that you want to compare together.
  • Proces – A description of a multi-step ML workflow. Each step in the workflow is described by a trial component.
  • Trial component – A description of a single step in an ML workflow, such as data cleaning, feature extraction, model training, or model evaluation.
  • Urmăritor – A Python context manager for logging information about a single trial component (for example, parameters, metrics, or artifacts).

Controlul versiunii datelor

Data Version Control (DVC) is a new type of data versioning, workflow, and experiment management software that builds upon merge (although it can work standalone). DVC reduces the gap between established engineering toolsets and data science needs, allowing you to take advantage of new caracteristici while reusing existing skills and intuition.

Data science experiment sharing and collaboration can be done through a regular Git flow (commits, branching, tagging, pull requests) the same way it works for software engineers. With Git and DVC, data science and ML teams can version experiments, manage large datasets, and make projects reproducible.

DVC has the following features:

  • DVC is a gratuit, open-source Linie de comanda instrument.
  • DVC works on top of Git repositories and has a similar command line interface and flow as Git. DVC can also work standalone, but without versiunilor capacităţi.
  • Data versioning is enabled by replacing large files, dataset directories, ML models, and so on with small metafiles (easy to handle with Git). These placeholders point to the original data, which is decoupled from source code management.
  • You can use on-premises or cloud storage to store the project’s data separate from its code base. This is how data scientists can transfer large datasets or share a GPU-trained model with others.
  • DVC makes data science projects reproducible by creating lightweight conducte using implicit dependency graphs, and by codifying the data and artifacts involved.
  • DVC is platform agnostic. It runs on all major operating systems (Linux, macOS, and Windows), and works independently of the programming languages (Python, R, Julia, shell scripts, and so on) or ML libraries (Keras, TensorFlow, PyTorch, Scipy, and more) used in the project.
  • DVC is quick to instala and doesn’t require special infrastructure, nor does it depend on APIs or external services. It’s a standalone CLI tool.

SageMaker Experiments and DVC sample

Următoarele Eșantion GitHub shows how to use DVC within the SageMaker environment. In particular, we look at how to build a custom image with DVC libraries installed by default to provide a consistent development environment to your data scientists in Amazon SageMaker Studio, and how to run DVC alongside SageMaker managed infrastructure for processing and training. Furthermore, we show how to enrich SageMaker tracking information with data versioning information from DVC, and visualize them within the Studio console.

Următoarea diagramă ilustrează arhitectura soluției și fluxul de lucru.

Build a custom Studio image with DVC already installed

În acest GitHub depozit, we explain how to create a custom image for Studio that has DVC already installed. The advantage of creating an image and making it available to all Studio users is that it creates a consistent environment for the Studio users, which they could also run locally. Although the sample is based on AWS Cloud9, you can also build the container on your local machine as long as you have Docker installed and running. This sample is based on the following Dockerfile și environment.yml. The resulting Docker image is stored in Registrul Amazon de containere elastice (Amazon EMR) in your AWS account. See the following code:

# Login to ECR
aws --region ${REGION} ecr get-login-password | docker login --username AWS --password-stdin ${ACCOUNT_ID}.dkr.ecr.${REGION}.amazonaws.com/smstudio-custom

# Create the ECR repository
aws --region ${REGION} ecr create-repository --repository-name smstudio-custom

# Build the image - it might take a few minutes to complete this step
docker build . -t ${IMAGE_NAME} -t ${ACCOUNT_ID}.dkr.ecr.${REGION}.amazonaws.com/smstudio-custom:${IMAGE_NAME}

# Push the image to ECR
docker push ${ACCOUNT_ID}.dkr.ecr.${REGION}.amazonaws.com/smstudio-custom:${IMAGE_NAME}

Puteți acum create a new Studio domain or update an existing Studio domain that has access to the newly created Docker image.

Noi folosim Kit AWS Cloud Development (AWS CDK) to create the following resources via Formarea AWS Cloud:

  • A SageMaker execution role with the right permissions to your new or existing Studio domain
  • A SageMaker image and SageMaker image version from the Docker image conda-env-dvc-kernel pe care le-am creat mai devreme
  • An AppImageConfig that specifies how the kernel gateway should be configured
  • A Studio user (data-scientist-dvc) with the correct SageMaker execution role and the custom Studio image available to it

Pentru instrucțiuni detaliate, consultați Associate a custom image to SageMaker Studio.

Run the lab

To run the lab, complete the following steps:

  1. In the Studio domain, launch Studio for the data-scientist-dvc utilizator.
  2. Choose the Git icon, then choose Clonează un depozit.
    Clonează un depozit
  3. Enter the URL of the repository (https://github.com/aws-samples/amazon-sagemaker-experiments-dvc-demo) și alegeți Clone.Clone a repo button
  4. In the file browser, choose the amazon-sagemaker-experiments-dvc-demo repertoriu.
  5. Deschideți dvc_sagemaker_script_mode.ipynb caiet.
  6. Pentru Imagine personalizată, choose the image conda-env-dvc-kernel.
  7. Alege Selectați.
    conda-env-dvc-kernel

Configure DVC for data versioning

We create a subdirectory where we prepare the data: sagemaker-dvc-sample. Within this subdirectory, we initialize a new Git repository and set the remote to a repository we create in AWS CodeCommit. The goal is to have DVC configurations and files for data tracking versioned in this repository. However, Git offers native capabilities to manage subprojects via, for example, git submodules and git subtrees, and you can extend this sample to use any of the aforementioned tools that best fit your workflow.

The main advantage of using CodeCommit with SageMaker in our case is its integration with Gestionarea identității și accesului AWS (IAM) for authentication and authorization, meaning we can use IAM roles to push and pull data without the need to fetch credentials (or SSH keys). Setting the appropriate permissions on the SageMaker execution role also allows the Studio notebook and the SageMaker training and processing job to interact securely with CodeCommit.

Although you can replace CodeCommit with any other source control service, such as GitHub, Gitlab, or Bitbucket, you need consider how to handle the credentials for your system. One possibility is to store these credentials on Manager de secrete AWS and fetch them at run time from the Studio notebook as well as from the SageMaker processing and training jobs.

Init DVC

Process and train with DVC and SageMaker

In this section, we explore two different approaches to tackle our problem and how we can keep track of the two tests using SageMaker Experiments according to the high-level conceptual architecture we showed you earlier.

Set up a SageMaker experiment

To track this test in SageMaker, we need to create an experiment. We need to also define the trial within the experiment. For the sake of simplicity, we just consider one trial for the experiment, but you can have any number of trials within an experiment, for example, if you want to test different algorithms.

We create an experiment named DEMO-sagemaker-experiments-dvc with two trials, dvc-trial-single-file și dvc-trial-multi-files, each representing a different version of the dataset.

Let’s create the DEMO-sagemaker-experiments-dvc experiment:

from smexperiments.experiment import Experiment
from smexperiments.trial import Trial
from smexperiments.trial_component import TrialComponent
from smexperiments.tracker import Tracker

experiment_name = 'DEMO-sagemaker-experiments-dvc'

# create the experiment if it doesn't exist
try:
    my_experiment = Experiment.load(experiment_name=experiment_name)
    print("existing experiment loaded")
except Exception as ex:
    if "ResourceNotFound" in str(ex):
        my_experiment = Experiment.create(
            experiment_name = experiment_name,
            description = "How to integrate DVC"
        )
        print("new experiment created")
    else:
        print(f"Unexpected {ex}=, {type(ex)}")
        print("Dont go forward!")
        raise

Test 1: Generate single files for training and validation

In this section, we create a processing script that fetches the raw data directly from Serviciul Amazon de stocare simplă (Amazon S3) as input; processes it to create the train, validation, and test datasets; and stores the results back to Amazon S3 using DVC. Furthermore, we show how you can track output artifacts generated by DVC with SageMaker when running processing and training jobs and via SageMaker Experiments.

First, we create the dvc-trial-single-file trial and add it to the DEMO-sagemaker-experiments-dvc experiment. By doing so, we keep all trial components related to this test organized in a meaningful way.

first_trial_name = "dvc-trial-single-file"

try:
    my_first_trial = Trial.load(trial_name=first_trial_name)
    print("existing trial loaded")
except Exception as ex:
    if "ResourceNotFound" in str(ex):
        my_first_trial = Trial.create(
            experiment_name=experiment_name,
            trial_name=first_trial_name,
        )
        print("new trial created")
    else:
        print(f"Unexpected {ex}=, {type(ex)}")
        print("Dont go forward!")
        raise

Use DVC in a SageMaker processing job to create the single file version

In this section, we create a processing script that gets the raw data directly from Amazon S3 as input using the managed data loading capability of SageMaker; processes it to create the train, validation, and test datasets; and stores the results back to Amazon S3 using DVC. It’s very important to understand that when using DVC to store data to Amazon S3 (or pull data from Amazon S3), we’re losing SageMaker managed data loading capabilities, which can potentially have an impact on performance and costs of our processing and training jobs, especially when working with very large datasets. For more information on the different SageMaker native input mode capabilities, refer to Accesați datele de antrenament.

Finally, we unify DVC tracking capabilities with SageMaker tracking capabilities when running processing jobs via SageMaker Experiments.

The processing script expects the address of the Git repository and the branch we want to create to store the DVC metadata passed via environmental variables. The datasets themselves are stored in Amazon S3 by DVC. Although environmental variables are automatically tracked in SageMaker Experiments and visible in the trial component parameters, we might want to enrich the trial components with further information, which then become available for visualization in the Studio UI using a tracker object. In our case, the trial components parameters include the following:

  • DVC_REPO_URL
  • DVC_BRANCH
  • USER
  • data_commit_hash
  • train_test_split_ratio

The preprocessing script clones the Git repository; generates the train, validation, and test datasets; and syncs it using DVC. As mentioned earlier, when using DVC, we can’t take advantage of native SageMaker data loading capabilities. Aside from the performance penalties we might suffer on large datasets, we also lose the automatic tracking capabilities for the output artifacts. However, thanks to the tracker and the DVC Python API, we can compensate for these shortcomings, retrieve such information at run time, and store it in the trial component with little effort. The added value by doing so is to have in single view of the input and output artifacts that belong to this specific processing job.

The full preprocessing Python script is available in the GitHub repo.

with Tracker.load() as tracker:
    tracker.log_parameters({"data_commit_hash": commit_hash})
    for file_type in file_types:
        path = dvc.api.get_url(
            f"{data_path}/{file_type}/california_{file_type}.csv",
            repo=dvc_repo_url,
            rev=dvc_branch
        )
        tracker.log_output(name=f"california_{file_type}",value=path)

SageMaker gives us the possibility to run our processing script on container images managed by AWS that are optimized to run on the AWS infrastructure. If our script requires additional dependencies, we can supply a requirements.txt file. When we start the processing job, SageMaker uses pip-install to install all the libraries we need (for example, DVC-related libraries). If you need to have a tighter control of all libraries installed on the containers, you can bring your own container in SageMaker, for example for prelucrare si antrenament.

We have now all the ingredients to run our SageMaker processing job:

  • A processing script that can process several arguments (--train-test-split-ratio) and two environmental variables (DVC_REPO_URL și DVC_BRANCH)
  • A requiremets.txt fişier
  • A Git repository (in CodeCommit)
  • A SageMaker experiment and trial
from sagemaker.processing import FrameworkProcessor, ProcessingInput
from sagemaker.sklearn.estimator import SKLearn

dvc_repo_url = "codecommit::{}://sagemaker-dvc-sample".format(region)
dvc_branch = my_first_trial.trial_name

script_processor = FrameworkProcessor(
    estimator_cls=SKLearn,
    framework_version='0.23-1',
    instance_count=1,
    instance_type='ml.m5.xlarge',
    env={
        "DVC_REPO_URL": dvc_repo_url,
        "DVC_BRANCH": dvc_branch,
        "USER": "sagemaker"
    },
    role=role
)

experiment_config={
    "ExperimentName": my_experiment.experiment_name,
    "TrialName": my_first_trial.trial_name
}

We then run the processing job with the preprocessing-experiment.py scenariu, experiment_config, dvc_repo_url, și dvc_branch we defined earlier.

%%time

script_processor.run(
    code='./source_dir/preprocessing-experiment.py',
    dependencies=['./source_dir/requirements.txt'],
    inputs=[ProcessingInput(source=s3_data_path, destination="/opt/ml/processing/input")],
    experiment_config=experiment_config,
    arguments=["--train-test-split-ratio", "0.2"]
)

The processing job takes approximately 5 minutes to complete. Now you can view the trial details for the single file dataset.

The following screenshot shows where you can find the stored information within Studio. Note the values for dvc-trial-single-file in DVC_BRANCH, DVC_REPO_URL, și data_commit_hash pe parametrii tab.

SageMaker Experiments parameters tab

Also note the input and output details on the Artefactele tab.

SageMaker Experiments artifacts tab

Create an estimator and fit the model with single file data version

To use DVC integration inside a SageMaker training job, we pass a dvc_repo_url și dvc_branch as environmental variables when you create the Estimator object.

We train on the dvc-trial-single-file branch first.

When pulling data with DVC, we use the following dataset structure:

dataset
    |-- train
    |   |-- california_train.csv
    |-- test
    |   |-- california_test.csv
    |-- validation
    |   |-- california_validation.csv

Now we create a Scikit-learn Estimator using the SageMaker Python SDK. This allows us to specify the following:

  • The path to the Python source file, which should be run as the entry point to training.
  • The IAM role that controls permissions for accessing Amazon S3 and CodeCommit data and running SageMaker functions.
  • A list of dictionaries that define the metrics used to evaluate the training jobs.
  • The number and type of training instances. We use one ml.m5.large instance.
  • Hyperparameters that are used for training.
  • Environment variables to use during the training job. We use DVC_REPO_URL, DVC_BRANCH, și USER.
metric_definitions = [{'Name': 'median-AE', 'Regex': "AE-at-50th-percentile: ([0-9.]+).*$"}]

hyperparameters={ 
        "learning_rate" : 1,
        "depth": 6
    }
estimator = SKLearn(
    entry_point='train.py',
    source_dir='source_dir',
    role=role,
    metric_definitions=metric_definitions,
    hyperparameters=hyperparameters,
    instance_count=1,
    instance_type='ml.m5.large',
    framework_version='0.23-1',
    base_job_name='training-with-dvc-data',
    environment={
        "DVC_REPO_URL": dvc_repo_url,
        "DVC_BRANCH": dvc_branch,
        "USER": "sagemaker"
    }
)

experiment_config={
    "ExperimentName": my_experiment.experiment_name,
    "TrialName": my_first_trial.trial_name
}

We call the fit method of the Estimator with the experiment_config we defined earlier to start the training.

%%time
estimator.fit(experiment_config=experiment_config)

The training job takes approximately 5 minutes to complete. The logs show those lines, indicating the files pulled by DVC:

Running dvc pull command
A       train/california_train.csv
A       test/california_test.csv
A       validation/california_validation.csv
3 files added and 3 files fetched
Starting the training.
Found train files: ['/opt/ml/input/data/dataset/train/california_train.csv']
Found validation files: ['/opt/ml/input/data/dataset/train/california_train.csv']

Test 2: Generate multiple files for training and validation

Creăm un nou dvc-trial-multi-files trial and add it to the current DEMO-sagemaker-experiments-dvc experiment.

second_trial_name = "dvc-trial-multi-files"
try:
    my_second_trial = Trial.load(trial_name=second_trial_name)
    print("existing trial loaded")
except Exception as ex:
    if "ResourceNotFound" in str(ex):
        my_second_trial = Trial.create(
            experiment_name=experiment_name,
            trial_name=second_trial_name,
        )
        print("new trial created")
    else:
        print(f"Unexpected {ex}=, {type(ex)}")
        print("Dont go forward!")
        raise

Differently from the first processing script, we now create out of the original dataset multiple files for training and validation and store the DVC metadata in a different branch.

You can explore the second preprocessing Python script on GitHub.

%%time

script_processor.run(
    code='./source_dir/preprocessing-experiment-multifiles.py',
    dependencies=['./source_dir/requirements.txt'],
    inputs=[ProcessingInput(source=s3_data_path, destination="/opt/ml/processing/input")],
    experiment_config=experiment_config,
    arguments=["--train-test-split-ratio", "0.1"]
)

The processing job takes approximately 5 minutes to complete. Now you can view the trial details for the multi-file dataset.

The following screenshots show where you can find the stored information within SageMaker Experiments in the Componente de încercare section within the Studio UI. Note the values for dvc-trial-multi-files in DVC_BRANCH, DVC_REPO_URL, și data_commit_hash pe parametrii tab.

SageMaker multi files experiments parameters tab

You can also review the input and output details on the Artefactele tab.

SageMaker multi files experiments artifacts tab

We now train on the dvc-trial-multi-files branch. When pulling data with DVC, we use the following dataset structure:

dataset
    |-- train
    |   |-- california_train_1.csv
    |   |-- california_train_2.csv
    |   |-- california_train_3.csv
    |   |-- california_train_4.csv
    |   |-- california_train_5.csv
    |-- test
    |   |-- california_test.csv
    |-- validation
    |   |-- california_validation_1.csv
    |   |-- california_validation_2.csv
    |   |-- california_validation_3.csv

Similar as we did before, we create a new Scikit-learn Estimator with the trial name dvc-trial-multi-files and start the training job.

%%time

estimator.fit(experiment_config=experiment_config)

The training job takes approximately 5 minutes to complete. On the training job logs output to the notebook, you can see those lines, indicating the files pulled by DVC:

Running dvc pull command
A       validation/california_validation_2.csv
A       validation/california_validation_1.csv
A       validation/california_validation_3.csv
A       train/california_train_4.csv
A       train/california_train_5.csv
A       train/california_train_2.csv
A       train/california_train_3.csv
A       train/california_train_1.csv
A       test/california_test.csv
9 files added and 9 files fetched
Starting the training.
Found train files: ['/opt/ml/input/data/dataset/train/california_train_2.csv', '/opt/ml/input/data/dataset/train/california_train_5.csv', '/opt/ml/input/data/dataset/train/california_train_4.csv', '/opt/ml/input/data/dataset/train/california_train_1.csv', '/opt/ml/input/data/dataset/train/california_train_3.csv']
Found validation files: ['/opt/ml/input/data/dataset/validation/california_validation_2.csv', '/opt/ml/input/data/dataset/validation/california_validation_1.csv', '/opt/ml/input/data/dataset/validation/california_validation_3.csv']

Host your model in SageMaker

After you train your ML model, you can deploy it using SageMaker. To deploy a persistent, real-time endpoint that makes one prediction at a time, we use Servicii de găzduire în timp real SageMaker.

from sagemaker.serializers import CSVSerializer

predictor = estimator.deploy(1, "ml.t2.medium", serializer=CSVSerializer())

First, we get the latest test dataset locally on the development notebook in Studio. For this purpose, we can use dvc.api.read() to load the raw data that was stored in Amazon S3 by the SageMaker processing job.

import io
import dvc.api

raw = dvc.api.read(
    "dataset/test/california_test.csv",
    repo=dvc_repo_url,
    rev=dvc_branch
)

Then we prepare the data using Pandas, load a test CSV file, and call predictor.predict to invoke the SageMaker endpoint created earlier, with data, and get predictions.

test = pd.read_csv(io.StringIO(raw), sep=",", header=None)
X_test = test.iloc[:, 1:].values
y_test = test.iloc[:, 0:1].values

predicted = predictor.predict(X_test)
for i in range(len(predicted)-1):
    print(f"predicted: {predicted[i]}, actual: {y_test[i][0]}")

Delete the endpoint

You should delete endpoints when they’re no longer in use, because they’re billed by the time deployed (for more information, see Prețuri Amazon SageMaker). Make sure to delete the endpoint to avoid unexpected costs.

predictor.delete_endpoint()

A curăța

Before you remove all the resources you created, make sure that all apps are deleted from the data-scientist-dvc user, including all KernelGateway apps, as well as the default JupiterServer app.

Then you can destroy the AWS CDK stack by running the following command:

cdk destroy

If you used an existing domain, also run the following commands:

# inject your DOMAIN_ID into the configuration file
sed -i 's/<your-sagemaker-studio-domain-id>/'"$DOMAIN_ID"'/' ../update-domain-no-custom-images.json
# update the sagemaker studio domain
aws --region ${REGION} sagemaker update-domain --cli-input-json file://../update-domain-no-custom-images.json

Concluzie

In this post, you walked through an example of how to track your experiments across code, data, artifacts, and metrics by using SageMaker Experiments and SageMaker processing and training jobs in conjunction with DVC. We created a Docker image containing DVC, which was required for Studio as the development notebook, and showed how you can use processing and training jobs with DVC. We prepared two versions of the data and used DVC to manage it with Git. Then you used SageMaker Experiments to track the processing and training with the two versions of the data in order to have a unified view of parameters, artifacts, and metrics in a single pane of glass. Finally, you deployed the model to a SageMaker endpoint and used a testing dataset from the second dataset version to invoke the SageMaker endpoint and get predictions.

As next step, you can extend the existing notebook and introduce your own feature engineering strategy and use DVC and SageMaker to run your experiments. Let’s go build!

Pentru citiri suplimentare, consultați următoarele resurse:


Despre Autori

Paolo Di FrancescoPaolo Di Francesco este arhitect de soluții la AWS. Are experiență în telecomunicații și inginerie software. Este pasionat de machine learning și se concentrează în prezent pe utilizarea experienței sale pentru a ajuta clienții să-și atingă obiectivele pe AWS, în special în discuțiile despre MLOps. În afara serviciului, îi place să joace fotbal și să citească.

Eitan SelaEitan Sela este arhitect de soluții specializat în învățare automată cu Amazon Web Services. El lucrează cu clienții AWS pentru a oferi îndrumare și asistență tehnică, ajutându-i să creeze și să opereze soluții de învățare automată pe AWS. În timpul liber, lui Eitan îi place să facă jogging și să citească cele mai recente articole de învățare automată.

Timestamp-ul:

Mai mult de la Învățare automată AWS