Introducing The Amazon SageMaker Serverless Inference Benchmarking Toolkit

بازنشر افلاطون

دنبال: 0

استنتاج بدون سرور Amazon SageMaker یک گزینه استنتاج هدفمند است که استقرار و مقیاس‌بندی مدل‌های یادگیری ماشین (ML) را برای شما آسان می‌کند. این یک مدل پرداخت به ازای استفاده ارائه می‌کند، که برای خدماتی که فراخوانی‌های نقطه پایانی نادر و غیرقابل پیش‌بینی هستند، ایده‌آل است. بر خلاف یک نقطه پایانی میزبانی بلادرنگ، که توسط یک نمونه طولانی مدت پشتیبانی می‌شود، منابع محاسباتی برای نقاط پایانی بدون سرور بر حسب تقاضا فراهم می‌شوند و در نتیجه نیاز به انتخاب انواع نمونه یا مدیریت سیاست‌های مقیاس‌بندی را از بین می‌برند.

The following high-level architecture illustrates how a serverless endpoint works. A client invokes an endpoint, which is backed by AWS managed infrastructure.

However, serverless endpoints are prone to cold starts in the order of seconds, and is therefore more suitable for intermittent or unpredictable workloads.

To help determine whether a serverless endpoint is the right deployment option from a cost and performance perspective, we have developed the SageMaker Serverless Inference Benchmarking Toolkit، که پیکربندی‌های نقطه پایانی مختلف را آزمایش می‌کند و بهینه‌ترین آنها را با یک نمونه میزبانی بلادرنگ مقایسه می‌کند.

در این پست به معرفی جعبه ابزار و مروری بر پیکربندی و خروجی های آن می پردازیم.

بررسی اجمالی راه حل

You can download the toolkit and install it from the GitHub repo. Getting started is easy: simply install the library, create a مدل SageMaker, and provide the name of your model along with a JSON lines formatted file containing a sample set of invocation parameters, including the payload body and content type. A convenience function is provided to convert a list of sample invocation arguments to a JSON lines file or a pickle file for binary payloads such as images, video, or audio.

Install the toolkit

First install the benchmarking library into your Python environment using pip:

pip install sm-serverless-benchmarking

You can run the following code from an Amazon SageMaker Studio نمونه، مثال، نمونه نوت بوک SageMaker, or any instance with دسترسی برنامه ای to AWS and the appropriate هویت AWS و مدیریت دسترسی (IAM) permissions. The requisite IAM permissions are documented in the GitHub repo. For additional guidance and example policies for IAM, refer to چگونه Amazon SageMaker با IAM کار می کند. این کد یک معیار را با مجموعه ای از پارامترهای پیش فرض روی مدلی اجرا می کند که انتظار ورودی CSV با دو رکورد مثال را دارد. ارائه یک مجموعه نمونه از نمونه ها برای تجزیه و تحلیل نحوه عملکرد نقطه پایانی با بارهای ورودی مختلف، تمرین خوبی است.

from sm_serverless_benchmarking import benchmark
from sm_serverless_benchmarking.utils import convert_invoke_args_to_jsonl
model_name = ""
example_invoke_args = [
        {'Body': '1,2,3,4,5', "ContentType": "text/csv"},
        {'Body': '6,7,8,9,10', "ContentType": "text/csv"}
        ]
example_args_file = convert_invoke_args_to_jsonl(example_invoke_args,
output_path=".")
r = benchmark.run_serverless_benchmarks(model_name, example_args_file)

Additionally, you can run the benchmark as a SageMaker Processing job, which may be a more reliable option for longer-running benchmarks with a large number of invocations. See the following code:

from sm_serverless_benchmarking.sagemaker_runner import run_as_sagemaker_job
run_as_sagemaker_job(
                    role="",
                    model_name="",
                    invoke_args_examples_file="",
                    )

Note that this will incur additional cost of running an ml.m5.large SageMaker Processing instance for the duration of the benchmark.

Both methods accept a number of parameters to configure, such as a list of memory configurations to benchmark and the number of times each configuration will be invoked. In most cases, the default options should suffice as a starting point, but refer to the GitHub repo برای لیست کامل و توضیحات هر پارامتر.

Benchmarking configuration

Before delving into what the benchmark does and what outputs it produces, it’s important to understand a few key concepts when it comes to configuring serverless endpoints.

وجود دارد two key configuration options: MemorySizeInMB و MaxConcurrency. MemorySizeInMB configures the amount of memory that is allocated to the instance, and can be 1024 MB, 2048 MB, 3072 MB, 4096 MB, 5120 MB, or 6144 MB. The number of vCPUs also scales proportionally to the amount of memory allocated. The MaxConcurrency parameter adjusts how many concurrent requests an endpoint is able to service. With a MaxConcurrency of 1, a serverless endpoint can only process a single request at a time.

به طور خلاصه، MemorySizeInMB پارامتر مکانیزمی برای مقیاس پذیری عمودی فراهم می کند و به شما امکان می دهد حافظه و منابع را برای ارائه مدل های بزرگتر تنظیم کنید، در حالی که MaxConcurrency مکانیزمی برای مقیاس پذیری افقی فراهم می کند که به نقطه پایانی شما اجازه می دهد تا درخواست های همزمان بیشتری را پردازش کند.

The cost of operating an endpoint is largely determined by the memory size, and there is no cost associated with increasing the max concurrency. However, there is a per-Region account limit for max concurrency across all endpoints. Refer to SageMaker endpoints and quotas for the latest limits.

محک زدن خروجی ها

Given this, the goal of benchmarking a serverless endpoint is to determine the most cost-effective and reliable memory size setting, and the minimum max concurrency that can handle your expected traffic patterns.

By default, the tool runs two benchmarks. The first is a stability benchmark, which deploys an endpoint for each of the specified memory configurations and invokes each endpoint with the provided sample payloads. The goal of this benchmark is to determine the most effective and stable MemorySizeInMB setting. The benchmark captures the invocation latencies and computes the expected per-invocation cost for each endpoint. It then compares the cost against a similar real-time hosting instance.

When the benchmarking is complete, the tool generates several outputs in the specified result_save_path directory with the following directory structure:

├── benchmarking_report
├── concurrency_benchmark_raw_results
├── concurrency_benchmark_summary_results
├── cost_analysis_summary_results
├── stability_benchmark_raw_results
├── stability_benchmark_summary_results

La benchmarking_report دایرکتوری حاوی یک گزارش تلفیقی با تمام خروجی های خلاصه است که در این پست به آنها اشاره می کنیم. دایرکتوری های اضافی حاوی خروجی های خام و میانی هستند که می توانید از آنها برای تجزیه و تحلیل های اضافی استفاده کنید. رجوع به GitHub repo for a more detailed description of each output artifact.

بیایید چند خروجی محک زدن واقعی را برای یک نقطه پایانی که مدل TensorFlow بینایی کامپیوتر MobileNetV2 را ارائه می‌کند، بررسی کنیم. اگر می خواهید این مثال را تکرار کنید، به ادامه مطلب مراجعه کنید نمونه نوت بوک directory in the GitHub repo.

The first output within the consolidated report is a summary table that provides the minimum, mean, medium, and maximum latency metrics for each MemorySizeInMB successful memory size configuration. As shown in the following table, the average invocation latency (invocation_latency_mean) continued to improve as memory configuration was increased to 3072 MB, but stopped improving thereafter.