Deploy BLOOM-176B And OPT-30B On Amazon SageMaker With Large Model Inference Deep Learning Containers And DeepSpeed

Republicat de Platon

Urmaritori: 0

The last few years have seen rapid development in the field of deep learning. Although hardware has improved, such as with the latest generation of accelerators from NVIDIA and Amazon, advanced machine learning (ML) practitioners still regularly encounter issues deploying their large deep learning models for applications such as natural language processing (NLP).

In an earlier post, we discussed capabilities and configurable settings in Amazon SageMaker model deployment that can make inference with these large models easier. Today, we announce a new Amazon SageMaker Deep Learning Container (DLC) that you can use to get started with large model inference in a matter of minutes. This DLC packages some of the most popular open-source libraries for model parallel inference, such as DeepSpeed and Hugging Face Accelerate.

In this post, we use a new SageMaker large model inference DLC to deploy two of the most popular large NLP models: BigScience’s BLOOM-176B și a lui Meta OPT-30B from the Hugging Face repository. In particular, we use Deep Java Library (DJL) serving and tensor parallelism techniques from DeepSpeed to achieve 0.1 second latency per token in a text generation use case.

You can find our complete example notebooks in our GitHub depozit.

Large model inference techniques

Language models have recently exploded in both size and popularity. With easy access from model zoos such as Hugging Face and improved accuracy and performance in NLP tasks such as classification and text generation, practitioners are increasingly reaching for these large models. However, large models are often too big to fit within the memory of a single accelerator. For example, the BLOOM-176B model can require more than 350 gigabytes of accelerator memory, which far exceeds the capacity of hardware accelerators available today. This necessitates the use of model parallel techniques from libraries like DeepSpeed and Hugging Face Accelerate to distribute a model across multiple accelerators for inference. In this post, we use the SageMaker large model inference container to generate and compare latency and throughput performance using these two open-source libraries.

DeepSpeed and Accelerate use different techniques to optimize large language models for inference. The key difference is DeepSpeed’s use of optimized kernels. These kernels can dramatically improve inference latency by reducing bottlenecks in the computation graph of the model. Optimized kernels can be difficult to develop and are typically specific to a particular model architecture; DeepSpeed supports popular large models such as OPT and BLOOM with these optimized kernels. In contrast, Hugging Face’s Accelerate library doesn’t include optimized kernels at the time of writing. As we discuss in our results section, this difference is responsible for much of the performance edge that DeepSpeed has over Accelerate.

A second difference between DeepSpeed and Accelerate is the type of model parallelism. Accelerate uses pipeline parallelism to partition a model between the hidden layers of a model, whereas DeepSpeed uses tensor parallelism to partition the layers themselves. Pipeline parallelism is a flexible approach that supports more model types and can improve throughput when larger batch sizes are used. Tensor parallelism requires more communication between GPUs because model layers can be spread across multiple devices, but can improve inference latency by engaging multiple GPUs simultaneously. You can learn more about parallelism techniques in Introducere în paralelismul modelului și Paralelism de model.

Prezentare generală a soluțiilor

To effectively host large language models, we need features and support in the following key areas:

Building and testing solutions – Given the iterative nature of ML development, we need the ability to build, rapidly iterate, and test how the inference endpoint will behave when these models are hosted, including the ability to fail fast. These models can typically be hosted only on larger instances like p4dn or g5, and given the size of the models, it can take a while to spin up an inference instance and run any test iteration. Local testing usually has constraints because you need a similar instance in size to test, and these models aren’t easy to obtain.
Deploying and running at scale – The model files need to be loaded onto the inference instances, which presents a challenge in itself given the size. Tar / Un-Tar as an example for the Bloom-176B takes about 1 hour to create and another hour to load. We need an alternate mechanism to allow easy access to the model files.
Loading the model as singleton – For a multi-worker process, we need to ensure the model gets loaded only once so we don’t run into race conditions and further spend unnecessary resources. In this post, we show a way to load directly from Serviciul Amazon de stocare simplă (Amazon S3). However, this only works if we use the default settings of the DJL. Furthermore, any scaling of the endpoints needs to be able to spin up in a few minutes, which calls for reconsidering how the models might be loaded and distributed.
Sharding frameworks – These models typically need to be , usually by a tensor parallelism mechanism or by pipeline sharding as the typical sharding techniques, and we have advanced concepts like ZeRO sharding built on top of tensor sharding. For more information about sharding techniques, refer to Paralelism de model. To achieve this, we can have various combinations and use frameworks from NIVIDIA, DeepSpeed, and others. This needs the ability to test BYOC or use 1P containers and iterate over solutions and run benchmarking tests. You might also want to test various hosting options like asynchronous, serverless, and others.
Hardware selection – Your choice in hardware is determined by all the aforementioned points and further traffic patterns, use case needs, and model sizes.

In this post, we use DeepSpeed’s optimized kernels and tensor parallelism techniques to host BLOOM-176B and OPT-30B on SageMaker. We also compare results from Accelerate to demonstrate the performance benefits of optimized kernels and tensor parallelism. For more information on DeepSpeed and Accelerate, refer to DeepSpeed Inference: Permiterea inferenței eficiente a modelelor de transformatoare la o scară fără precedent și Incredibly Fast BLOOM Inference with DeepSpeed and Accelerate.

We use DJLServing as the model serving solution in this example. DJLServing is a high-performance universal model serving solution powered by the Deep Java Library (DJL) that is programming language agnostic. To learn more about the DJL and DJLServing, refer to Implementați modele mari pe Amazon SageMaker folosind DJLServing și inferența paralelă a modelului DeepSpeed.

It’s worth noting that optimized kernels can result in precision changes and a modified computation graph, which could theoretically result in changed model behavior. Although this could occasionally change the inference outcome, we do not expect these differences to materially impact the basic evaluation metrics of a model. Nevertheless, practitioners are advised to confirm the model outputs are as expected when using these kernels.

The following steps demonstrate how to deploy a BLOOM-176B model in SageMaker using DJLServing and a SageMaker large model inference container. The complete example is also available in our GitHub depozit.

Using the DJLServing SageMaker DLC image

Use the following code to use the DJLServing SageMaker DLC image after replacing the region with your specific region you are running the notebook in:

763104351884.dkr.ecr..amazonaws.com/djl-inference:0.19.0-deepspeed0.7.3-cu113
# example uri might be like 763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.19.0-deepspeed0.7.3-cu113

Creați fișierul nostru model

Mai întâi, creăm un fișier numit serving.properties that contains only one line of code. This tells the DJL model server to use the DeepSpeed engine. The file contains the following code:

engine=DeepSpeed

serving.properties is a file defined by DJLServing that is used to configure per-model configuration.

În continuare, ne creăm model.py file, which defines the code needed to load and then serve the model. In our code, we read in the TENSOR_PARALLEL_DEGREE environment variable (the default value is 1). This sets the number of devices over which the tensor parallel modules are distributed. Note that DeepSpeed provides a few built-in partition definitions, including one for BLOOM models. We use it by specifying replace_method și relpace_with_kernel_inject. If you have a customized model and need DeepSpeed to partition effectively, you need to change relpace_with_kernel_inject la false și adăugați injection_policy pentru a face partiția de rulare să funcționeze. Pentru mai multe informații, consultați Inițializare pentru inferență. For our example, we used the pre-partitioned BLOOM model on DeepSpeed.

Secondly, in the model.py file, we also load the model from Amazon S3 after the endpoint has been spun up. The model is loaded into the /tmp space on the container because SageMaker maps the /tmp la Magazin Amazon Elastic Block (Amazon EBS) volume that is mounted when we specify the endpoint creation parameter VolumeSizeInGB. For instances like p4dn, which come pre-built with the volume instance, we can continue to leverage the /tmp on the container. See the following code:

from djl_python import Input, Output
import os
import deepspeed
import torch
import torch.distributed as dist
import sys
import subprocess
import time
from glob import glob
from transformers import pipeline, AutoModelForCausalLM, AutoTokenizer
from transformers.models.opt.modeling_opt import OPTDecoderLayer

predictor = None

def check_config():
    local_rank = os.getenv('LOCAL_RANK')
    
    if not local_rank:
        return False
    return True
    
def get_model():

    if not check_config():
        raise Exception("DJL:DeepSpeed configurations are not default. This code does not support non default configurations") 
    
    tensor_parallel = int(os.getenv('TENSOR_PARALLEL_DEGREE', '1'))
    local_rank = int(os.getenv('LOCAL_RANK', '0'))
    model_dir = "/tmp/model"
    bucket = os.environ.get("MODEL_S3_BUCKET")
    key_prefix = os.environ.get("MODEL_S3_PREFIX")
    print(f"rank: {local_rank}")
    if local_rank == 0:
        if f"{model_dir}/DONE" not in glob(f"{model_dir}/*"):
            print("Starting Model downloading files")
            try:
                proc_run = subprocess.run(
                    ["aws", "s3", "cp", "--recursive", f"s3://{bucket}/{key_prefix}", model_dir]
                )
                print("Model downloading finished")
                # write file when download complete. Could use dist.barrier() but this makes it easier to check if model is downloaded in case of retry
                with open(f"{model_dir}/DONE", "w") as f:
                    f.write("download_complete")
                    
                proc_run.check_returncode() # to throw the error in case there was one
                
            except subprocess.CalledProcessError as e:
                print ( "Model download failed: Error:nreturn code: ", e.returncode, "nOutput: ", e.stderr )
                raise # FAIL FAST  
                               
    dist.barrier()
                
    
    tokenizer = AutoTokenizer.from_pretrained(model_dir)
    
    # has to be FP16 as Int8 model loading not yet supported
    with deepspeed.OnDevice(dtype=torch.float16, device="meta"):
        model = AutoModelForCausalLM.from_config(
            AutoConfig.from_pretrained(model_dir), torch_dtype=torch.bfloat16
        )
    model = model.eval()
    
    model = deepspeed.init_inference(
        model,
        mp_size=tensor_parallel,
        dtype=torch.int8,
        base_dir = model_dir,
        checkpoint=os.path.join(model_dir, "ds_inference_config.json"),
        replace_method='auto',
        replace_with_kernel_inject=True
    )

    model = model.module
    dist.barrier()
    return model, tokenizer

DJLServing manages the runtime installation on any pip packages defined in requirement.txt. This file will have:

awscli
boto3

We have created a directory called code si model.py, serving.properties, și requirements.txt files are already created in this directory. To view the files, you can run the following code from the terminal:

mkdir -p code
cat code/model.py 
cat code/serving.properties 
cat code/requirements.txt

The following figure shows the structure of the model.tar.gz.

Lastly, we create the model file and upload it to Amazon S3:

tar cvfz model.tar.gz code
s3_code_artifact = sess.upload_data("model.tar.gz", bucket, s3_code_prefix)

Download and store the model from Hugging Face (Optional)

We have provided the steps in this section in case you want to download the model to Amazon S3 and use it from there. The steps are provided in the Jupyter file on GitHub. The following screenshot shows a snapshot of the steps.

Creați un model SageMaker

Acum creăm un Modelul SageMaker. Noi folosim Registrul Amazon de containere elastice (Amazon ECR) imagine furnizată de și artefactul model de la pasul anterior pentru a crea modelul SageMaker. În configurarea modelului, configuram TENSOR_PARALLEL_DEGREE=8, ceea ce înseamnă că modelul este partiționat de-a lungul a 8 GPU-uri. Vezi următorul cod:

PrimaryContainer={
        "Image": inference_image_uri,
        "ModelDataUrl": s3_code_artifact,
        "Environment": {
            "MODEL_S3_BUCKET": bucket,
            "MODEL_S3_PREFIX": s3_model_prefix,
            "TENSOR_PARALLEL_DEGREE": "8",
},

After you run the preceding cell in the Jupyter file, you see output similar to the following:

{
    "ModelArn": "arn:aws:sagemaker:us-east-1::model/bloom-djl-ds-"
}

Creați un punct final SageMaker

You can use any instances with multiple GPUs for testing. In this demo, we use a p4d.24xlarge instance. In the following code, note how we set the ModelDataDownloadTimeoutInSeconds, ContainerStartupHealthCheckTimeoutInSeconds, și VolumeSizeInGB parametrii pentru a se potrivi cu dimensiunea mare a modelului. The VolumeSizeInGB parametrul este aplicabil instanțelor GPU care acceptă atașarea volumului EBS.

endpoint_config_response = sm_client.create_endpoint_config(
    EndpointConfigName=endpoint_config_name,
    ProductionVariants=[
        {
            "VariantName": "variant1",
            "ModelName": model_name,
            "InstanceType": "ml.p4d.24xlarge",
            "InitialInstanceCount": 1,
            #"VolumeSizeInGB" : 200,
            "ModelDataDownloadTimeoutInSeconds": 2400,
            "ContainerStartupHealthCheckTimeoutInSeconds": 2400,
        },
    ],
)'

În cele din urmă, creăm un punct final SageMaker:

create_endpoint_response = sm_client.create_endpoint(
    EndpointName=f"{endpoint_name}", EndpointConfigName=endpoint_config_name
)

Îl vezi tipărit în următorul cod:

{
    "EndpointArn": "arn:aws:sagemaker:us-east-1::endpoint/bloom-djl-ds-"
}

Pornirea punctului final poate dura ceva timp. Puteți încerca încă de câteva ori dacă dați peste InsufficientInstanceCapacity eroare sau puteți trimite o solicitare către AWS pentru a crește limita din contul dvs.

Reglarea performanței

Dacă intenționați să utilizați această postare și notebook-ul însoțitor cu un alt model, poate doriți să explorați unii dintre parametrii reglabili pe care îi oferă SageMaker, DeepSpeed și DJL. Experimentarea iterativă cu acești parametri poate avea un impact material asupra latenței, debitului și costului modelului mare găzduit. Pentru a afla mai multe despre parametrii de reglare, cum ar fi numărul de lucrători, gradul de paralelism tensor, dimensiunea cozii de joburi și altele, consultați DJL Serving configurations și Implementați modele mari pe Amazon SageMaker folosind DJLServing și inferența paralelă a modelului DeepSpeed.

REZULTATE

In this post, we used DeepSpeed to host BLOOM-176B and OPT-30B on SageMaker ML instances. The following table summarizes our performance results, including a comparison with Hugging Face’s Accelerate. Latency reflects the number of milliseconds it takes to produce a 256-token string four times (batch_size=4) from the model. Throughput reflects the number of tokens produced per second for each test. For Hugging Face Accelerate, we used the library’s default loading with GPU memory mapping. For DeepSpeed, we used its faster checkpoint loading mechanism.

Model	Bibliotecă	Model Precision	Dimensiunea lotului	Parallel Degree	instanță	Time to Load (E)	Latency (4 x 256 Token Output)			.
.	.	.	.	.	.	.	P50 (Domnișoară)	P90 (Domnișoară)	P99 (Domnișoară)	tranzitată (tokens/sec)
BLOOM-176B	DeepSpeed	INT8	4	8	p4d.24xlarge	74.9	27,564	27,580	32,179	37.1
BLOOM-176B	Accelera	INT8	4	8	p4d.24xlarge	669.4	92,694	92,735	103,292	11.0
OPT-30B	DeepSpeed	FP16	4	4	g5.24xlarge	239.4	11,299	11,302	11,576	90.6
OPT-30B	Accelera	FP16	4	4	g5.24xlarge	533.8	63,734	63,737	67,605	16.1

From a latency perspective, DeepSpeed is about 3.4 times faster for BLOOM-176B and 5.6 times faster for OPT-30B than Accelerate. DeepSpeed’s optimized kernels are responsible for much of this difference in latency. Given these results, we recommend using DeepSpeed over Accelerate if your model of choice is supported.

It’s also worth noting that model loading times with DeepSpeed were much shorter, making it a better option if you anticipate needing to quickly scale up your number of endpoints. Accelerate’s more flexible pipeline parallelism technique may be a better option if you have models or model precisions that aren’t supported by DeepSpeed.

These results also demonstrate the difference in latency and throughput of different model sizes. In our tests, OPT-30B generates 2.4 times the number of tokens per unit time than BLOOM-176B on an instance type that is more than three times cheaper. On a price per unit throughput basis, OPT-30B on a g5.24xl instance is 8.9 times better than BLOOM-176B on a p4d.24xl instance. If you have strict latency, throughput, or cost limitations, consider using the smallest model possible that will still achieve functional requirements.

A curăța

As part of best practices it is always recommended to delete idle instances. The below code shows you how to delete the instances.

# - Delete the end point
sm_client.delete_endpoint(EndpointName=endpoint_name)

# - In case the end point failed we still want to delete the model
sm_client.delete_endpoint_config(EndpointConfigName=endpoint_config_name)
sm_client.delete_model(ModelName=model_name)

Optionally delete the model check point from your S3

!aws s3 rm --recursive s3:///{s3_model_prefix}

Concluzie

In this post, we demonstrated how to use SageMaker large model inference containers to host two large language models, BLOOM-176B and OPT-30B. We used DeepSpeed’s model parallel techniques with multiple GPUs on a single SageMaker ML instance.

For more details about Amazon SageMaker and its large model inference capabilities, refer to Amazon SageMaker now supports deploying large models through configurable volume size and timeout quotas și Inferință în timp real.

Despre autori

Simon Zamarin este un arhitect de soluții AI/ML al cărui obiectiv principal este să îi ajute pe clienți să extragă valoare din activele lor de date. În timpul liber, lui Simon îi place să petreacă timpul cu familia, să citească SF și să lucreze la diverse proiecte de bricolaj.

Rupinder Grewal este un arhitect specializat în soluții Sr Ai/ML cu AWS. În prezent, se concentrează pe servirea modelelor și a MLOps-ului pe SageMaker. Înainte de acest rol, a lucrat ca inginer de învățare automată, construind și găzduind modele. În afara serviciului, îi place să joace tenis și să meargă cu bicicleta pe traseele montane.

Frank Liu este inginer software pentru AWS Deep Learning. El se concentrează pe construirea de instrumente inovatoare de deep learning pentru inginerii software și oamenii de știință. În timpul liber, îi place drumețiile cu prietenii și familia.

Alan Tan este manager de produs senior, iar SageMaker conduce eforturile pentru inferența modelelor mari. Este pasionat de aplicarea Machine Learning-ului în domeniul Analytics. În afara serviciului, se bucură de aer liber.

Dhawal Patel este arhitect principal de învățare automată la AWS. El a lucrat cu organizații, de la întreprinderi mari până la startup-uri mijlocii, pe probleme legate de calculul distribuit și inteligența artificială. El se concentrează pe învățarea profundă, inclusiv pe domeniile NLP și Computer Vision. El îi ajută pe clienți să obțină inferențe de model de înaltă performanță pe SageMaker.

Qing Lan este inginer de dezvoltare software în AWS. El a lucrat la mai multe produse provocatoare în Amazon, inclusiv soluții de inferență ML de înaltă performanță și un sistem de înregistrare de înaltă performanță. Echipa Qing a lansat cu succes primul model cu miliard de parametri în Amazon Advertising, cu o latență foarte scăzută necesară. Qing are cunoștințe aprofundate despre optimizarea infrastructurii și accelerarea Deep Learning.

Qingwei Li este specialist în învățare automată la Amazon Web Services. Și-a primit doctoratul. în Cercetare operațională, după ce a spart contul de grant de cercetare al consilierului său și nu a reușit să dea Premiul Nobel pe care l-a promis. În prezent, ajută clienții din industria serviciilor financiare și a asigurărilor să construiască soluții de învățare automată pe AWS. În timpul liber, îi place să citească și să predea.

Robert Van Dusen este Senior Product Manager cu Amazon SageMaker. El conduce optimizarea modelelor de învățare profundă pentru aplicații precum inferența modelului mare.

Siddharth Venkatesan este inginer software în AWS Deep Learning. În prezent, se concentrează pe construirea de soluții pentru inferența modelelor mari. Înainte de AWS, a lucrat în organizația Amazon Grocery, creând noi funcții de plată pentru clienții din întreaga lume. În afara serviciului, îi place să schieze, în aer liber și să urmărească sporturi.

Timestamp-ul: Noiembrie 4, 2022Noiembrie 4, 2022

Mai mult de la Învățare automată AWS

Implementați mii de ansambluri de modele cu puncte finale multimodel Amazon SageMaker pe GPU pentru a minimiza costurile de găzduire | Amazon Web Services

Învățare automată AWS

Nodul sursă: 1822010

Timestamp-ul: Aprilie 4, 2023

Integrați platformele SaaS cu Amazon SageMaker pentru a activa aplicațiile bazate pe ML | Amazon Web Services

Cluster sursă:

Învățare automată AWS

Nodul sursă: 1856614

Timestamp-ul: Iulie 6, 2023

Implementați BLOOM-176B și OPT-30B pe Amazon SageMaker cu modele mari de inferență Deep Learning Containers și DeepSpeed

Republicat de Platon

Large model inference techniques

Prezentare generală a soluțiilor

Using the DJLServing SageMaker DLC image

Creați fișierul nostru model

Download and store the model from Hugging Face (Optional)

Creați un model SageMaker

Creați un punct final SageMaker

Reglarea performanței

REZULTATE

A curăța

Concluzie

Despre autori

Mai mult de la Învățare automată AWS

AWS dezvăluie noi funcții și îmbunătățiri ale serviciului AI la re:Invent 2022

Utilizați serviciile AWS AI și ML pentru a promova accesibilitatea și includerea persoanelor cu deficiențe de vedere sau de comunicare

Implementați RStudio în mediul dvs. AWS și accesați lacul de date folosind permisiunile AWS Lake Formation

Bongo Learn oferă feedback în timp real pentru a îmbunătăți rezultatele învățării cu Amazon Transcribe

Implementați un model de diarizare a difuzorului Hugging Face (PyAnnote) pe Amazon SageMaker ca punct final asincron | Amazon Web Services

Creați o soluție de verificare a vaccinării folosind funcția Interogări din Amazon Text | Amazon Web Services

Urmărirea poziției mingii în nor cu PGA TOUR | Amazon Web Services

Integrați platformele SaaS cu Amazon SageMaker pentru a activa aplicațiile bazate pe ML | Amazon Web Services

Despre noi

Căutare verticală și Ai

Platformă

Rămâneți conectat

Cont