Design Patterns For Serial Inference On Amazon SageMaker

Republié par Platon

Suiveurs: 0

As machine learning (ML) goes mainstream and gains wider adoption, ML-powered applications are becoming increasingly common to solve a range of complex business problems. The solution to these complex business problems often requires using multiple ML models. These models can be sequentially combined to perform various tasks, such as preprocessing, data transformation, model selection, inference generation, inference consolidation, and post-processing. Organizations need flexible options to orchestrate these complex ML workflows. Serial inference pipelines are one such design pattern to arrange these workflows into a series of steps, with each step enriching or further processing the output generated by the previous steps and passing the output to the next step in the pipeline.

Additionally, these serial inference pipelines should provide the following:

Flexible and customized implementation (dependencies, algorithms, business logic, and so on)
Repeatable and consistent for production implementation
Undifferentiated heavy lifting by minimizing infrastructure management

In this post, we look at some common use cases for serial inference pipelines and walk through some implementation options for each of these use cases using Amazon Sage Maker. We also discuss considerations for each of these implementation options.

The following table summarizes the different use cases for serial inference, implementation considerations and options. These are discussed in this post.

Case Study	Use Case Description	Considérations principales	Overall Implementation Complexity	Recommended Implementation options	Sample Code Artifacts and Notebooks
Serial inference pipeline (with preprocessing and postprocessing steps included)	Inference pipeline needs to preprocess incoming data before invoking a trained model for generating inferences, and then postprocess generated inferences, so that they can be easily consumed by downstream applications	Facilité de mise en œuvre	Faible	Inference container using the SageMaker Inference Toolkit	Deploy a Trained PyTorch Model
Serial inference pipeline (with preprocessing and postprocessing steps included)	Inference pipeline needs to preprocess incoming data before invoking a trained model for generating inferences, and then postprocess generated inferences, so that they can be easily consumed by downstream applications	Decoupling, simplified deployment, and upgrades	Moyenne	Pipeline d'inférence SageMaker	Inference Pipeline with Custom Containers and xgBoost
Serial model ensemble	Inference pipeline needs to host and arrange multiple models sequentially, so that each model enhances the inference generated by the previous one, before generating the final inference	Decoupling, simplified deployment and upgrades, flexibility in model framework selection	Moyenne	Pipeline d'inférence SageMaker	Inference Pipeline with Scikit-learn and Linear Learner
Serial inference pipeline (with targeted model invocation from a group)	Inference pipeline needs to invoke a specific customized model from a group of deployed models, based on request characteristics or for cost-optimization, in addition to preprocessing and postprocessing tasks	Cost-optimization and customization	Haute	SageMaker inference pipeline with multi-model endpoints (MMEs)	Amazon SageMaker Multi-Model Endpoints using Linear Learner

Dans les sections suivantes, nous discutons de chaque cas d'utilisation plus en détail.

Serial inference pipeline using inference containers

Serial inference pipeline use cases have requirements to preprocess incoming data before invoking a pre-trained ML model for generating inferences. Additionally, in some cases, the generated inferences may need to be processed further, so that they can be easily consumed by downstream applications. This is a common scenario for use cases where a streaming data source needs to be processed in real time before a model can be fitted on it. However, this use case can manifest for batch inference as well.

SageMaker provides an option to customize inference containers and use them to build a serial inference pipeline. Inference containers use the Boîte à outils d'inférence SageMaker and are built on SageMaker Multi Model Server (MMS), which provides a flexible mechanism to serve ML models. The following diagram illustrates a reference pattern of how to implement a serial inference pipeline using inference containers.

SageMaker MMS expects a Python script that implements the following functions to load the model, preprocess input data, get predictions from the model, and postprocess the output data:

input_fn () – Responsible for deserializing and preprocessing the input data
modèle_fn() – Responsible for loading the trained model from artifacts in Service de stockage simple Amazon (Amazon S3)
prédire_fn () – Responsible for generating inferences from the model
sortie_fn() – Responsible for serializing and postprocessing the output data (inferences)

For detailed steps to customize an inference container, refer to Adapter votre propre conteneur d'inférence.

Inference containers are an ideal design pattern for serial inference pipeline use cases with the following primary considerations:

High cohesion – The processing logic and corresponding model drive single business functionality and need to be co-located
Low overall latency – The elapsed time between when an inference request is made and response is received

In a serial inference pipeline, the processing logic and model are encapsulated within the same single container, so much of the invocation calls remain within the container. This helps reduce the overall number of hops, resulting in better overall latency and responsiveness of the pipeline.

Also, for use cases where ease of implementation is an important criterion, inference containers can help, with various processing steps of the pipeline be co-located within the same container.

Serial inference pipeline using a SageMaker inference pipeline

Another variation of the serial inference pipeline use case requires clearer decoupling between the various steps in the pipeline (such as data preprocessing, inference generation, data postprocessing, and formatting and serialization). This could be due to a variety of reasons:

Découplage – Différentes étapes du pipeline ont un objectif clairement défini et doivent être exécutées sur des conteneurs séparés en raison des dépendances sous-jacentes impliquées. Cela permet également de garder le pipeline bien structuré.
Cadres – Différentes étapes du pipeline utilisent des cadres spécifiques adaptés à l'objectif (tels que scikit ou Spark ML) et doivent donc être exécutés sur des conteneurs distincts.
Isolement des ressources – Différentes étapes du pipeline ont des exigences de consommation de ressources variables et doivent donc être exécutées sur des conteneurs séparés pour plus de flexibilité et de contrôle.

Furthermore, for slightly more complex serial inference pipelines, multiple steps may be involved to process a request and generate an inference. Therefore, from an operational standpoint, it may be beneficial to host these steps on separate containers for better functional isolation, and facilitate easier upgrades and enhancements (change one step without impacting other models or processing steps).

If your use case aligns with some of these considerations, a Pipeline d'inférence SageMaker provides an easy and flexible option to build a serial inference pipeline. The following diagram illustrates a reference pattern of how to implement a serial inference pipeline using multiple steps hosted on dedicated containers using a SageMaker inference pipeline.

ml9154-inference-pipeline

A SageMaker inference pipeline consists of a linear sequence of 2–15 containers that process requests for inferences on data. The inference pipeline provides the option to use pre-trained SageMaker built-in algorithms or custom algorithms packaged in Docker containers. The containers are hosted on the same underlying instance, which helps reduce the overall latency and minimize cost.

The following code snippet shows how multiple processing steps and models can be combined to create a serial inference pipeline.

We start by building and specifying Spark ML and XGBoost-based models that we intend to use as part of the pipeline:

from sagemaker.model import Model
from sagemaker.pipeline_model import PipelineModel
from sagemaker.sparkml.model import SparkMLModel
sparkml_data = 's3://{}/{}/{}'.format(s3_model_bucket, s3_model_key_prefix, 'model.tar.gz')
sparkml_model = SparkMLModel(model_data=sparkml_data)
xgb_model = Model(model_data=xgb_model.model_data, image=training_image)

The models are then arranged sequentially within the pipeline model definition:

model_name = 'serial-inference-' + timestamp_prefix
endpoint_name = 'serial-inference-ep-' + timestamp_prefix
sm_model = PipelineModel(name=model_name, role=role, models=[sparkml_model, xgb_model])

The inference pipeline is then deployed behind an endpoint for real-time inference by specifying the type and number of host ML instances:

sm_model.deploy(initial_instance_count=1, instance_type='ml.c4.xlarge', endpoint_name=endpoint_name)

The entire assembled inference pipeline can be considered a SageMaker model that you can use to make either real-time predictions or process batch transforms directly, without any external preprocessing. Within an inference pipeline model, SageMaker handles invocations as a sequence of HTTP requests originating from an external application. The first container in the pipeline handles the initial request, performs some processing, and then dispatches the intermediate response as a request to the second container in the pipeline. This happens for each container in the pipeline, and finally returns the final response to the calling client application.

SageMaker inference pipelines are fully managed. When the pipeline is deployed, SageMaker installs and runs all the defined containers on each of the Cloud de calcul élastique Amazon (Amazon EC2) instances provisioned as part of the endpoint or batch transform job. Furthermore, because the containers are co-located and hosted on the same EC2 instance, the overall pipeline latency is reduced.

Serial model ensemble using a SageMaker inference pipeline

An ensemble model is an approach in ML where multiple ML models are combined and used as part of the inference process to generate final inferences. The motivations for ensemble models could include improving accuracy, reducing model sensitivity to specific input features, and reducing single model bias, among others. In this post, we focus on the use cases related to a serial model ensemble, where multiple ML models are sequentially combined as part of a serial inference pipeline.

Let’s consider a specific example related to a serial model ensemble where we need to group a user’s uploaded images based on certain themes or topics. This pipeline could consist of three ML models:

Modèle 1 – Accepts an image as input and evaluates image quality based on image resolution, orientation, and more. This model then attempts to upscale the image quality and sends the processed images that meet a certain quality threshold to the next model (Model 2).
Modèle 2 – Accepts images validated through Model 1 and performs reconnaissance d'image to identify objects, places, people, text, and other custom actions and concepts in images. The output from Model 2 that contains identified objects is sent to Model 3.
Modèle 3 – Accepts the output from Model 2 and performs natural language processing (NLP) tasks such as topic modeling for grouping images together based on themes. For example, images could be grouped based on location or people identified. The output (groupings) is sent back to the client application.

The following diagram illustrates a reference pattern of how to implement multiple ML models hosted on a serial model ensemble using a SageMaker inference pipeline.

ml9154-model-ensemble

As discussed earlier, the SageMaker inference pipeline is managed, which enables you to focus on the ML model selection and development, while reducing the undifferentiated heavy lifting associated with building the serial ensemble pipeline.

Additionally, some of the considerations discussed earlier around decoupling, algorithm and framework choice for model development, and deployment are relevant here as well. For instance, because each model is hosted on a separate container, you have flexibility in selecting the ML framework that best fits each model and your overall use case. Furthermore, from a decoupling and operational standpoint, you can continue to upgrade or modify individual steps much more easily, without affecting other models.

The SageMaker inference pipeline is also integrated with the Registre de modèles SageMaker for model cataloging, versioning, metadata management, and governed deployment to production environments to support consistent operational best practices. The SageMaker inference pipeline is also integrated with Amazon Cloud Watch to enable monitoring the multi-container models in inference pipelines. You can also get visibility into métriques en temps réel to better understand invocations and latency for each container in the pipeline, which helps with troubleshooting and resource optimization.

Serial inference pipeline (with targeted model invocation from a group) using a SageMaker inference pipeline

Points de terminaison multimodèles SageMaker (MMEs) provide a cost-effective solution to deploy a large number of ML models behind a single endpoint. The motivations for using multi-model endpoints could include invocating a specific customized model based on request characteristics (such as origin, geographic location, user personalization, and so on) or simply hosting multiple models behind the same endpoint to achieve cost-optimization.

When you deploy multiple models on a single multi-model enabled endpoint, all models share the compute resources and the model serving container. The SageMaker inference pipeline can be deployed on an MME, where one of the containers in the pipeline can dynamically serve requests based on the specific model being invoked. From a pipeline perspective, the models have identical preprocessing requirements and expect the same feature set, but are trained to align to a specific behavior. The following diagram illustrates a reference pattern of how this integrated pipeline would work.

ml9154-mme

With MMEs, the inference request that originates from the client application should specify the target model that needs to be invoked. The first container in the pipeline handles the initial request, performs some processing, and then dispatches the intermediate response as a request to the second container in the pipeline, which hosts multiple models. Based on the target model specified in the inference request, the model is invoked to generate an inference. The generated inference is sent to the next container in the pipeline for further processing. This happens for each subsequent container in the pipeline, and finally SageMaker returns the final response to the calling client application.

Multiple model artifacts are persisted in an S3 bucket. When a specific model is invoked, SageMaker dynamically loads it onto the container hosting the endpoint. If the model is already loaded in the container’s memory, invocation is faster because SageMaker doesn’t need to download the model from Amazon S3. If instance memory utilization is high and a new model is invoked and therefore needs to be loaded, unused models are unloaded from memory. The unloaded models remain in the instance’s storage volume, however, and can be loaded into the container’s memory later again, without being downloaded from the S3 bucket again.

One of the key considerations while using MMEs is to understand model invocation latency behavior. As discussed earlier, models are dynamically loaded into the container’s memory of the instance hosting the endpoint when invoked. Therefore, the model invocation may take longer when it’s invoked for the first time. When the model is already in the instance container’s memory, the subsequent invocations are faster. If an instance memory utilization is high and a new model needs to be loaded, unused models are unloaded. If the instance’s storage volume is full, unused models are deleted from the storage volume. SageMaker fully manages the loading and unloading of the models, without you having to take any specific actions. However, it’s important to understand this behavior because it has implications on the model invocation latency and therefore overall end-to-end latency.

Pipeline hosting options

SageMaker provides multiple type d'instance options to select from for deploying ML models and building out inference pipelines, based on your use case, throughput, and cost requirements. For example, you can choose CPU or GPU optimized instances to build serial inference pipelines, on a single container or across multiple containers. However, there are sometimes requirements where it is desired to have flexibility and support to run models on CPU or GPU based instances within the same pipeline for additional flexibility.

You can now use NVIDIA Triton Inference Server to serve models for inference on SageMaker for heterogeneous compute requirements. Check out Déployez une IA rapide et évolutive avec NVIDIA Triton Inference Server dans Amazon SageMaker pour plus de détails.

Conclusion

As organizations discover and build new solutions powered by ML, the tools required for orchestrating these pipelines should be flexible enough to support based on a given use case, while simplifying and reducing the ongoing operational overheads. SageMaker provides multiple options to design and build these serial inference workflows, based on your requirements.

We look forward to hearing from you about what use cases you’re building using serial inference pipelines. If you have questions or feedback, please share them in the comments.

À propos des auteurs

Modèles de conception pour l'inférence série sur Amazon SageMaker PlatoBlockchain Data Intelligence. Recherche verticale. Aï. Rahul Sharma is a Senior Solutions Architect at AWS Data Lab, helping AWS customers design and build AI/ML solutions. Prior to joining AWS, Rahul has spent several years in the finance and insurance sector, helping customers build data and analytical platforms.

Modèles de conception pour l'inférence série sur Amazon SageMaker PlatoBlockchain Data Intelligence. Recherche verticale. Aï. Anand Prakash est architecte de solutions senior chez AWS Data Lab. Anand se concentre sur l'aide aux clients pour concevoir et créer des solutions d'IA/ML, d'analyse de données et de bases de données pour accélérer leur chemin vers la production.

Modèles de conception pour l'inférence série sur Amazon SageMaker PlatoBlockchain Data Intelligence. Recherche verticale. Aï. Dhawal Patel est architecte principal en apprentissage machine chez AWS. Il a travaillé avec des organisations allant des grandes entreprises aux startups de taille moyenne sur des problèmes liés à l'informatique distribuée et à l'intelligence artificielle. Il se concentre sur l'apprentissage en profondeur, y compris les domaines de la PNL et de la vision par ordinateur. Il aide les clients à obtenir une inférence de modèle haute performance sur SageMaker.

Modèles de conception pour l'inférence série sur Amazon SageMaker PlatoBlockchain Data Intelligence. Recherche verticale. Aï. Saurabh Trikandé est chef de produit senior pour Amazon SageMaker Inference. Il est passionné par le travail avec les clients et rend l'apprentissage automatique plus accessible. Dans ses temps libres, Saurabh aime faire de la randonnée, découvrir des technologies innovantes, suivre TechCrunch et passer du temps avec sa famille.

Horodatage: 19 octobre 202222 octobre 2022

Horodatage: Le 21 septembre 2022

Modèles de conception pour l'inférence série sur Amazon SageMaker

Republié par Platon

Serial inference pipeline using inference containers

Serial inference pipeline using a SageMaker inference pipeline

Serial model ensemble using a SageMaker inference pipeline

Serial inference pipeline (with targeted model invocation from a group) using a SageMaker inference pipeline

Pipeline hosting options

Conclusion

À propos des auteurs

Plus de Apprentissage automatique AWS

Sécuriser les URL pré-signées d'Amazon SageMaker Studio Partie 3 : Accès multicompte à l'API privée de Studio

Réduisez les coûts et la complexité du prétraitement ML avec Amazon S3 Object Lambda

Accélérer la formation des réseaux neuronaux à grande échelle sur les processeurs avec ThirdAI et AWS Graviton | Services Web Amazon

Concevoir des villes résilientes à Arup à l'aide des capacités géospatiales d'Amazon SageMaker | Services Web Amazon

Amazon SageMaker Autopilot est jusqu'à huit fois plus rapide avec le nouveau mode de formation d'ensemble optimisé par AutoGluon

À propos de nous

Recherche verticale et Ai

Plateforme

Restez à l'affût

Compte