Run Multiple Deep Learning Models On GPU With Amazon SageMaker Multi-model Endpoints

Ponovno objavil Platon

Spremljevalci: 0

As AI adoption is accelerating across the industry, customers are building sophisticated models that take advantage of new scientific breakthroughs in deep learning. These next-generation models allow you to achieve state-of-the-art, human-like performance in the fields of natural language processing (NLP), computer vision, speech recognition, medical research, cybersecurity, protein structure prediction, and many others. For instance, large language models like GPT-3, OPT, and BLOOM can translate, summarize, and write text with human-like nuances. In the computer vision space, text-to-image diffusion models like DALL-E and Imagen can create photorealistic images from natural language with a higher level of visual and language understanding from the world around us. These multi-modal models provide richer features for various downstream tasks and the ability to fine-tune them for specific domains, and they bring powerful business opportunities to our customers.

These deep learning models keep growing in terms of size, and typically contain billions of model parameters to scale model performance for a wide variety of tasks, such as image generation, text summarization, language translation, and more. There is also a need to customize these models to deliver a hyper-personalized experience to individuals. As a result, a greater number of models are being developed by fine-tuning these models for various downstream tasks. To meet the latency and throughput goals of AI applications, GPU instances are preferred over CPU instances (given the computational power GPUs offer). However, GPU instances are expensive and costs can add up if you’re deploying more than 10 models. Although these models can potentially bring impactful AI applications, it may be challenging to scale these deep learning models in cost-effective ways due to their size and number of models.

Amazon SageMaker multi-model endpoints (MMEs) provide a scalable and cost-effective way to deploy a large number of deep learning models. MMEs are a popular hosting choice to host hundreds of CPU-based models among customers like Zendesk, Veeva, and AT&T. Previously, you had limited options to deploy hundreds of deep learning models that needed accelerated compute with GPUs. Today, we announce MME support for GPU. Now you can deploy thousands of deep learning models behind one SageMaker endpoint. MMEs can now run multiple models on a GPU core, share GPU instances behind an endpoint across multiple models, and dynamically load and unload models based on the incoming traffic. With this, you can significantly save cost and achieve the best price performance.

In this post, we show how to run multiple deep learning models on GPU with SageMaker MMEs.

SageMaker MME

SageMaker MMEs enable you to deploy multiple models behind a single inference endpoint that may contain one or more instances. With MMEs, each instance is managed to load and serve multiple models. MMEs enable you to break the linearly increasing cost of hosting multiple models and reuse infrastructure across all the models.

The following diagram illustrates the architecture of a SageMaker MME.

The SageMaker MME dynamically downloads models from Preprosta storitev shranjevanja Amazon (Amazon S3) when invoked, instead of downloading all the models when the endpoint is first created. As a result, an initial invocation to a model might see higher inference latency than the subsequent inferences, which are completed with low latency. If the model is already loaded on the container when invoked, then the download and load step is skipped and the model returns the inferences with low latency. For example, assume you have a model that is only used a few times a day. It is automatically loaded on demand, whereas frequently accessed models are retained in memory and invoked with consistently low latency.

SageMaker MMEs with GPU support

SageMaker MME z GPE delujejo z uporabo NVIDIA Triton Inference Server. NVIDIA Triton Inference Server is an open-source inference serving software that simplifies the inference serving process and provides high inference performance. Triton supports all major training and inference frameworks, such as TensorFlow, NVIDIA® TensorRT™, PyTorch, MXNet, Python, ONNX, XGBoost, Scikit-learn, RandomForest, OpenVINO, custom C++, and more. It offers dynamic batching, concurrent runs, post-training quantization, and optimal model configuration to achieve high-performance inference. Additionally, NVIDIA Triton Inference Server has been extended to implement pogodba MME API, to integrate with MME.

The following diagram illustrates an MME workflow.

Koraki poteka dela so naslednji:

The SageMaker MME receives an HTTP invocation request for a particular model using TargetModel in the request along with the payload.
SageMaker routes traffic to the right instance behind the endpoint where the target model is loaded. SageMaker understands the traffic pattern across all the models behind the MME and smartly routes requests.
SageMaker takes care of model management behind the endpoint, dynamically loads the model to the container’s memory, and unloads the model based from the shared fleet of GPU instances to give the best price performance.
SageMaker dynamically downloads models from Amazon S3 to the instance’s storage volume. If the invoked model isn’t available on the instance storage volume, the model is downloaded onto the instance storage volume. If the instance storage volume reaches capacity, SageMaker deletes any unused models from the storage volume.
SageMaker loads the model to the NVIDIA Triton container’s memory on a GPU accelerated instance and serve the inference request. The GPU core is shared by all the models in an instance. If the model is already loaded in the container memory, the subsequent requests are served faster because SageMaker doesn’t need to download and load it again.
SageMaker skrbi za oblikovanje prometa do končne točke MME in vzdržuje optimalne kopije modela na instancah GPE za najboljšo cenovno zmogljivost. Še naprej usmerja promet do primerka, kjer je naložen model. Če viri instance dosežejo zmogljivost zaradi visoke izkoriščenosti, SageMaker razloži najmanj uporabljene modele iz vsebnika, da sprosti vire za nalaganje pogosteje uporabljenih modelov.

SageMaker MMEs can horizontally scale using an auto scaling policy, and provision additional GPU compute instances based on metrics such as invocations per instance and GPU utilization to serve any traffic surge to MME endpoints.

Pregled rešitev

In this post, we show you how to use the new features of SageMaker MMEs with GPU with a computer vision use case. For demonstration purposes, we use a ResNet-50 convolutional neural network pre-trained model that can classify images into 1,000 categories. We discuss how to do the following:

Use an NVIDIA Triton inference container on SageMaker MMEs, using different Triton model framework backends such and PyTorch and TensorRT
Convert ResNet-50 models to optimized TensorRT engine format and deploy it with a SageMaker MME
Set up auto scaling policies for the MME
Get insights into instance and invocation metrics using amazoncloudwatch

Create model artifacts

This section walks through the steps to prepare a ResNet-50 pre-trained model to be deployed on an SageMaker MME using Triton Inference Server model configurations. You can reproduce all the steps using the step-by-step notebook on GitHub.

For this post, we demonstrate deployment with two models. However, you can prepare and deploy hundreds of models. The models may or may not share the same framework.

Prepare a PyTorch model

First, we load a pre-trained ResNet50 model using the torchvision models package. We save the model as a model.pt file in TorchScript optimized and serialized format. TorchScript compiles a forward pass of the ResNet50 model in eager mode with example inputs, so we pass one instance of an RGB image with three color channels of dimension 224 x 224.

Then we need to prepare the models for Triton Inference Server. The following code shows the model repository for the PyTorch framework backend. Triton uses the model.pt file placed in the model repository to serve predictions.

resnet
├── 1
│   └── model.pt
└── config.pbtxt

The model configuration file config.pbtxt must specify the name of the model (resnet), the platform and backend properties (pytorch_libtorch), max_batch_size (128), and the input and output tensors along with the data type (TYPE_FP32) information. Additionally, you can specify instance_group in dynamic_batching properties to achieve high performance inference. See the following code:

name: "resnet"
platform: "pytorch_libtorch"
max_batch_size: 128
input {
  name: "INPUT__0"
  data_type: TYPE_FP32
  dims: 3
  dims: 224
  dims: 224
}
output {
  name: "OUTPUT__0"
  data_type: TYPE_FP32
  dims: 1000
}

Pripravite model TensorRT

NVIDIA TensorRT is an SDK for high-performance deep learning inference, and includes a deep learning inference optimizer and runtime that delivers low latency and high throughput for inference applications. We use the command line tool trtexec to generate a TensorRT serialized engine from an ONNX model format. Complete the following steps to convert a ResNet-50 pre-trained model to NVIDIA TensorRT:

Export the pre-trained ResNet-50 model into an ONNX format using torch.onnx.This step runs the model one time to trace its run with a sample input and then exports the traced model to the specified file model.onnx.
Use trtexec to create a TensorRT engine plan from the model.onnx file. You can optionally reduce the precision of floating-point computations, either by simply running them in 16-bit floating point, or by quantizing floating point values so that calculations can be performed using 8-bit integers.

The following code shows the model repository structure for the TensorRT model:

resnet
├── 1
│   └── model.plan
└── config.pbtxt

For the TensorRT model, we specify tensorrt_plan as the platform and input the Tensor specifications of the image of dimension 224 x 224, which has the color channels. The output Tensor with 1,000 dimensions is of type TYPE_FP32, corresponding to the different object categories. See the following code:

name: "resnet"
platform: "tensorrt_plan"
max_batch_size: 128
input {
  name: "input"
  data_type: TYPE_FP32
  dims: 3
  dims: 224
  dims: 224
}
output {
  name: "output"
  data_type: TYPE_FP32
  dims: 1000
}
model_warmup {
    name: "bs128 Warmup"
    batch_size: 128
    inputs: {
        key: "input"
        value: {
            data_type: TYPE_FP32
            dims: 3
            dims: 224
            dims: 224
            zero_data: false
        }
    }
}

Store model artifacts in Amazon S3

SageMaker expects the model artifacts in .tar.gz format. They should also satisfy Triton container requirements such as model name, version, config.pbtxt files, and more. tar the folder containing the model file as .tar.gz and upload it to Amazon S3:

!mkdir -p triton-serve-pt/resnet/1/
!mv -f workspace/model.pt triton-serve-pt/resnet/1/
!tar -C triton-serve-pt/ -czf resnet_pt_v0.tar.gz resnet
model_uri_pt = sagemaker_session.upload_data(path="resnet_pt_v0.tar.gz", key_prefix="resnet-mme-gpu")
!mkdir -p triton-serve-trt/resnet/1/
!mv -f workspace/model.plan triton-serve-trt/resnet/1/
!tar -C triton-serve-trt/ -czf resnet_trt_v0.tar.gz resnet
model_uri_trt = sagemaker_session.upload_data(path="resnet_trt_v0.tar.gz", key_prefix="resnet-mme-gpu")

Now that we have uploaded the model artifacts to Amazon S3, we can create a SageMaker MME.

Deploy models with an MME

We now deploy a ResNet-50 model with two different framework backends (PyTorch and TensorRT) to a SageMaker MME.

Note that you can deploy hundreds of models, and the models can use the same framework. They can also use different frameworks, as shown in this post.

Mi uporabljamo AWS SDK za Python (Boto3) API-ji create_model, create_endpoint_configin create_endpoint to create an MME.

Določite posodo za serviranje

V definiciji vsebnika definirajte model_data_url to specify the S3 directory that contains all the models that the SageMaker MME uses to load and serve predictions. Set Mode do MultiModel to indicate that SageMaker creates the endpoint with MME container specifications. We set the container with an image that supports deploying MMEs with GPU. See the following code:

container = {
"Image": ,
"ModelDataUrl": ,
"Mode": "MultiModel"
}

Create a multi-model object

Use the SageMaker Boto3 client to create the model using the create_model API. Definicijo vsebnika posredujemo API-ju za ustvarjanje modela skupaj z ModelName in ExecutionRoleArn:

create_model_response = sm_client.create_model(
    ModelName=, ExecutionRoleArn=role, PrimaryContainer=container
)

Define MME configurations

Ustvarite konfiguracije MME z uporabo create_endpoint_config Boto3 API. Določite pospešeno računalniško instanco GPU v InstanceType (we use the g4dn.4xlarge instance type). We recommend configuring your endpoints with at least two instances. This allows SageMaker to provide a highly available set of predictions across multiple Availability Zones for the models.

Based on our findings, you can get better price performance on ML-optimized instances with a single GPU core. Therefore, MME support for GPU feature is only enabled for single-GPU core instances. For a full list of instances supported, refer to Supported GPU Instance types.

create_endpoint_config_response = sm_client.create_endpoint_config(
    EndpointConfigName=,
    ProductionVariants=[
        {
            "InstanceType": "ml.g4dn.4xlarge",
            "InitialVariantWeight": 1,
            "InitialInstanceCount": 2,
            "ModelName": ,
            "VariantName": "AllTraffic",
        }
    ],
)

Ustvarite MME

With the preceding endpoint configuration, we create a SageMaker MME using the create_endpoint API. SageMaker creates the MME, launches the ML compute instance g4dn.4xlarge, and deploys the PyTorch and TensorRT ResNet-50 models on them. See the following code:

create_endpoint_response = sm_client.create_endpoint(
    EndpointName=, EndpointConfigName=
)

Invoke the target model on the MME

After we create the endpoint, we can send an inference request to the MME using the invoke_enpoint API. Določimo TargetModel in the invocation call and pass in the payload for each model type. The following code is a sample invocation for the PyTorch model and TensorRT model:

runtime_sm_client.invoke_endpoint(
    EndpointName=,
    ContentType="application/octet-stream",
    Body=json.dumps(pt_payload),
    TargetModel='resnet_pt_v0.tar.gz', #PyTorch Model
)
runtime_sm_client.invoke_endpoint(
    EndpointName=, 
    ContentType="application/octet-stream", 
    Body=json.dumps(trt_payload),
    TargetModel='resnet_trt_v0.tar.gz' #TensorRT Model
)

Set up auto scaling policies for the GPU MME

SageMaker MMEs support automatic scaling for your hosted models. Auto scaling dynamically adjusts the number of instances provisioned for a model in response to changes in your workload. When the workload increases, auto scaling brings more instances online. When the workload decreases, auto scaling removes unnecessary instances so that you don’t pay for provisioned instances that you aren’t using.

In the following scaling policy, we use the custom metric GPUUtilization v TargetTrackingScalingPolicyConfiguration configuration and set a TargetValue of 60.0 for the target value of that metric. This autoscaling policy provisions additional instances up to MaxCapacity when GPU utilization is more than 60%.

auto_scaling_client = boto3.client('application-autoscaling')

resource_id='endpoint/' +  + '/variant/' + 'AllTraffic' 
response = auto_scaling_client.register_scalable_target(
    ServiceNamespace='sagemaker',
    ResourceId=resource_id,
    ScalableDimension='sagemaker:variant:DesiredInstanceCount',
    MinCapacity=1,
    MaxCapacity=5
)

response = auto_scaling_client.put_scaling_policy(
    PolicyName='GPUUtil-ScalingPolicy',
    ServiceNamespace='sagemaker',
    ResourceId=resource_id,
    ScalableDimension='sagemaker:variant:DesiredInstanceCount', 
    PolicyType='TargetTrackingScaling',
    TargetTrackingScalingPolicyConfiguration={
        'TargetValue': 60.0, 
        'CustomizedMetricSpecification':
        {
            'MetricName': 'GPUUtilization',
            'Namespace': '/aws/sagemaker/Endpoints',
            'Dimensions': [
                {'Name': 'EndpointName', 'Value':  },
                {'Name': 'VariantName','Value': 'AllTraffic'}
            ],
            'Statistic': 'Average',
            'Unit': 'Percent'
        },
        'ScaleInCooldown': 600,
        'ScaleOutCooldown': 200 
    }
)

Priporočamo uporabo GPUUtilization or InvocationsPerInstance to configure auto scaling policies for your MME. For more details, see Set Autoscaling Policies for Multi-Model Endpoint Deployments

CloudWatch metrics for GPU MMEs

SageMaker MMEs provide the following instance-level metrics to monitor:

LoadedModelCount – Število modelov, naloženih v zabojnike
Uporaba GPU – Odstotek enot GPE, ki jih uporabljajo vsebniki
GPUMemoryUtilization – Odstotek pomnilnika GPE, ki ga uporabljajo vsebniki
DiskUtilization – Odstotek prostora na disku, ki ga uporabljajo vsebniki

These metrics allow you to plan for effective utilization of GPU instance resources. In the following graph, we see GPUMemoryUtilization was 38.3% when more than 16 ResNet-50 models were loaded in the container. The sum of each individual CPU core’s utilization (CPUUtilization) was 60.9%, and percentage of memory used by the containers (MemoryUtilization) was 9.36%.

SageMaker MMEs also provide model loading metrics to get model invocation-level insights:

ModelLoadingWaitTime – Časovni interval za model, ki ga je treba prenesti ali naložiti
ModelUnloadingTime – Časovni interval za razkladanje modela iz posode
ModelDownloadingTime – Čas je za prenos modela iz Amazon S3
ModelCacheHit – Number of invocations to the model that are already loaded onto the container

In the following graph, we can observe that it took 8.22 seconds for a model to respond to an inference request (ModelLatency), and 24.1 milliseconds was added to end-to-end latency due to SageMaker overheads (OverheadLatency). We can also see any errors metrics from calls to invoke an endpoint API call, such as Invocation4XXErrors in Invocation5XXErrors.

For more information about MME CloudWatch metrics, refer to Meritve CloudWatch za uvedbe končnih točk z več modeli.

Povzetek

In this post, you learned about the new SageMaker multi-model support for GPU, which enables you to cost-effectively host hundreds of deep learning models on accelerated compute hardware. You learned how to use the NVIDIA Triton Inference Server, which creates a model repository configuration for different framework backends, and how to deploy an MME with auto scaling. This feature will allow you to scale hundreds of hyper-personalized models that are fine-tuned to cater to unique end-user experiences in AI applications. You can also leverage this feature to achieve needful price performance for your inference application using fractional GPUs.

To get started with MME support for GPU, see Multi-model endpoint support for GPU.

O avtorjih

Dhawal Patel je glavni arhitekt strojnega učenja pri AWS. Sodeloval je z organizacijami, od velikih podjetij do srednje velikih zagonskih podjetij, pri problemih, povezanih s porazdeljenim računalništvom in umetno inteligenco. Osredotoča se na globoko učenje, vključno z domenami NLP in računalniškega vida. Strankam pomaga pri doseganju visoko zmogljivega sklepanja o modelih na Amazon SageMaker.

Vikram Elango is a Senior AI/ML Specialist Solutions Architect at Amazon Web Services, based in Virginia, US. Vikram helps global financial and insurance industry customers with design, implementation and thought leadership to build and deploy machine learning applications at scale. He is currently focused on natural language processing, responsible AI, inference optimization, and scaling ML across the enterprise. In his spare time, he enjoys traveling, hiking, cooking, and camping with his family.

Saurabh Trikande je višji produktni vodja za Amazon SageMaker Inference. Navdušen je nad delom s strankami in ga motivira cilj demokratizacije strojnega učenja. Osredotoča se na ključne izzive, povezane z uvajanjem kompleksnih aplikacij ML, modelov ML z več najemniki, optimizacijo stroškov in zagotavljanjem dostopnosti uvajanja modelov globokega učenja. V prostem času Saurabh uživa v pohodništvu, spoznavanju inovativnih tehnologij, spremlja TechCrunch in preživlja čas s svojo družino.

Deepti Ragha je inženir za razvoj programske opreme v ekipi Amazon SageMaker. Njeno trenutno delo se osredotoča na gradnjo funkcij za učinkovito gostovanje modelov strojnega učenja. V prostem času rada potuje, planinari in goji rastline.

Nikhil Kulkarni is a software developer with AWS Machine Learning, focusing on making machine learning workloads more performant on the cloud and is a co-creator of AWS Deep Learning Containers for training and inference. He’s passionate about distributed Deep Learning Systems. Outside of work, he enjoys reading books, fiddling with the guitar, and making pizza.

Jiahong Liu je arhitekt rešitve v skupini ponudnikov storitev v oblaku pri NVIDIA. Strankam pomaga pri sprejemanju rešitev strojnega učenja in umetne inteligence, ki izkoriščajo pospešeno računalništvo NVIDIA za reševanje njihovih izzivov pri usposabljanju in sklepanju. V prostem času uživa v origamiju, DIY projektih in igra košarko.

Eliuth Triana je vodja odnosov z razvijalci v ekipi NVIDIA-AWS. Povezuje vodje izdelkov Amazon in AWS, razvijalce in znanstvenike s tehnologi in vodji izdelkov NVIDIA za pospešitev delovnih obremenitev Amazon ML/DL, izdelkov EC2 in storitev AI AWS. Poleg tega je Eliuth strasten gorski kolesar, smučar in igralec pokra.

Maximiliano Maccanti is a Principal Engineer at AWS currently with DynamoDB, I was in the launch team of SageMaker at re:Invent 2017 and spent the following 5 years in the hosting platform adding all kind of customers facing features. In my free time I collect, repair and play with vintage videogame consoles.

Časovni žig: Oktober 25, 2022Oktober 28, 2022

Časovni žig: Julij 14, 2022

Zaženite več modelov globokega učenja na GPE s končnimi točkami več modelov Amazon SageMaker

Ponovno objavil Platon

SageMaker MME

SageMaker MMEs with GPU support

Pregled rešitev

Create model artifacts

Prepare a PyTorch model

Pripravite model TensorRT

Store model artifacts in Amazon S3

Deploy models with an MME

Določite posodo za serviranje

Create a multi-model object

Define MME configurations

Ustvarite MME

Invoke the target model on the MME

Set up auto scaling policies for the GPU MME

CloudWatch metrics for GPU MMEs

Povzetek

O avtorjih

Več od Strojno učenje AWS

Napovedujemo nova orodja in zmogljivosti za omogočanje odgovornih inovacij AI | Spletne storitve Amazon

Avtomatizirajte klasifikacijo zahtevkov za storitve IT s klasifikatorjem po meri Amazon Comprehend

Pospešite sklepanje Amazon SageMaker z instancami Amazon EC6, ki temeljijo na C2i Intel

Priporočila za moč in iskanje z uporabo grafikona znanja IMDb – 3. del

Ustvarite modele Amazon SageMaker z uporabo PyTorch Model Zoo

Avtomatizirajte odkrivanje goljufij s hipotekarnimi dokumenti z uporabo modela ML in poslovno določenih pravil z Amazon Fraud Detector: 3. del | Spletne storitve Amazon

Zgradite cevovode za obdelavo dokumentov od konca do konca z Amazon Texttract IDP CDK Constructs

Uporabite strojno učenje za odkrivanje anomalij in napovedovanje izpadov z Amazon Timestream in Amazon Lookout for Equipment

Obogatitev tokov novic v realnem času s podatkovno knjižnico Refinitiv, storitvami AWS in Amazon SageMaker

Sledite svojim poskusom ML od konca do konca z nadzorom različic podatkov in poskusi Amazon SageMaker

O nas

Navpično iskanje in Ai

Platforma

Ostanite povezani

Račun