Optimize Price-performance Of LLM Inference On NVIDIA GPUs Using The Amazon SageMaker Integration With NVIDIA NIM Microservices

प्लेटो द्वारा पुनर्प्रकाशित

अनुयायियों: 0

NVIDIA एनआईएम m icroservices now integrate with अमेज़न SageMaker, allowing you to deploy industry-leading large language models (LLMs) and optimize model performance and cost. You can deploy state-of-the-art LLMs in minutes instead of days using technologies such as NVIDIA TensorRT, एनवीडिया टेन्सोरआरटी-एलएलएम, तथा NVIDIA ट्राइटन अनुमान सर्वर on NVIDIA accelerated instances hosted by SageMaker.

NIM, part of the एनवीडिया एआई एंटरप्राइज software platform listed on AWS बाज़ार, is a set of inference microservices that bring the power of state-of-the-art LLMs to your applications, providing natural language processing (NLP) and understanding capabilities, whether you’re developing chatbots, summarizing documents, or implementing other NLP-powered applications. You can use pre-built NVIDIA containers to host popular LLMs that are optimized for specific NVIDIA GPUs for quick deployment or use NIM tools to create your own containers.

In this post, we provide a high-level introduction to NIM and show how you can use it with SageMaker.

An introduction to NVIDIA NIM

NIM provides optimized and pre-generated engines for a variety of popular models for inference. These microservices support a variety of LLMs, such as Llama 2 (7B, 13B, and 70B), Mistral-7B-Instruct, Mixtral-8x7B, NVIDIA Nemotron-3 22B Persona, and Code Llama 70B, out of the box using pre-built NVIDIA TensorRT engines tailored for specific NVIDIA GPUs for maximum performance and utilization. These models are curated with the optimal hyperparameters for model-hosting performance for deploying applications with ease.

If your model is not in NVIDIA’s set of curated models, NIM offers essential utilities such as the Model Repo Generator, which facilitates the creation of a TensorRT-LLM-accelerated engine and a NIM-format model directory through a straightforward YAML file. Furthermore, an integrated community backend of vLLM provides support for cutting-edge models and emerging features that may not have been seamlessly integrated into the TensorRT-LLM-optimized stack.

In addition to creating optimized LLMs for inference, NIM provides advanced hosting technologies such as optimized scheduling techniques like in-flight batching, which can break down the overall text generation process for an LLM into multiple iterations on the model. With in-flight batching, rather than waiting for the whole batch to finish before moving on to the next set of requests, the NIM runtime immediately evicts finished sequences from the batch. The runtime then begins running new requests while other requests are still in flight, making the best use of your compute instances and GPUs.

Deploying NIM on SageMaker

NIM integrates with SageMaker, allowing you to host your LLMs with performance and cost optimization while benefiting from the capabilities of SageMaker. When you use NIM on SageMaker, you can use capabilities such as scaling out the number of instances to host your model, performing blue/green deployments, and evaluating workloads using shadow testing—all with best-in-class observability and monitoring with अमेज़ॅन क्लाउडवॉच.

निष्कर्ष

Using NIM to deploy optimized LLMs can be a great option for both performance and cost. It also helps make deploying LLMs effortless. In the future, NIM will also allow for Parameter-Efficient Fine-Tuning (PEFT) customization methods like LoRA and P-tuning. NIM also plans to have LLM support by supporting Triton Inference Server, TensorRT-LLM, and vLLM backends.

We encourage you to learn more about NVIDIA microservices and how to deploy your LLMs using SageMaker and try out the benefits available to you. NIM is available as a paid offering as part of the NVIDIA AI Enterprise software subscription available on AWS Marketplace.

In the near future, we will post an in-depth guide for NIM on SageMaker.

लेखक के बारे में

Optimize price-performance of LLM inference on NVIDIA GPUs using the Amazon SageMaker integration with NVIDIA NIM Microservices | Amazon Web Services PlatoBlockchain Data Intelligence. Vertical Search. Ai. जेम्स पार्क Amazon Web Services में एक समाधान वास्तुकार है। वह AWS पर प्रौद्योगिकी समाधानों को डिजाइन, निर्माण और तैनात करने के लिए Amazon.com के साथ काम करता है, और AI और मशीन लर्निंग में उनकी विशेष रुचि है। अपने खाली समय में वह नई संस्कृतियों, नए अनुभवों की खोज करना और नवीनतम प्रौद्योगिकी रुझानों के साथ अद्यतित रहना पसंद करता है। आप उसे यहां पा सकते हैं लिंक्डइन.

सौरभ त्रिकंडे Amazon SageMaker Inference के वरिष्ठ उत्पाद प्रबंधक हैं। उन्हें ग्राहकों के साथ काम करने का शौक है और वह मशीन लर्निंग को लोकतांत्रिक बनाने के लक्ष्य से प्रेरित हैं। वह जटिल एमएल अनुप्रयोगों, बहु-किरायेदार एमएल मॉडल, लागत अनुकूलन, और गहन शिक्षण मॉडल की तैनाती को और अधिक सुलभ बनाने से संबंधित मुख्य चुनौतियों पर ध्यान केंद्रित करता है। अपने खाली समय में, सौरभ को हाइकिंग, नवीन तकनीकों के बारे में सीखने, टेकक्रंच का अनुसरण करने और अपने परिवार के साथ समय बिताने का आनंद मिलता है।

Optimize price-performance of LLM inference on NVIDIA GPUs using the Amazon SageMaker integration with NVIDIA NIM Microservices | Amazon Web Services PlatoBlockchain Data Intelligence. Vertical Search. Ai. किंग लैन एडब्ल्यूएस में सॉफ्टवेयर डेवलपमेंट इंजीनियर हैं। वह अमेज़ॅन में कई चुनौतीपूर्ण उत्पादों पर काम कर रहा है, जिसमें उच्च प्रदर्शन एमएल अनुमान समाधान और उच्च प्रदर्शन लॉगिंग सिस्टम शामिल हैं। किंग की टीम ने बहुत कम विलंबता के साथ अमेज़ॅन विज्ञापन में पहला बिलियन-पैरामीटर मॉडल सफलतापूर्वक लॉन्च किया। किंग को इंफ्रास्ट्रक्चर ऑप्टिमाइजेशन और डीप लर्निंग एक्सेलेरेशन का गहन ज्ञान है।

निखिल कुलकर्णी AWS मशीन लर्निंग के साथ एक सॉफ्टवेयर डेवलपर है, जो मशीन लर्निंग वर्कलोड को क्लाउड पर अधिक प्रदर्शन करने वाला बनाने पर ध्यान केंद्रित करता है, और प्रशिक्षण और अनुमान के लिए AWS डीप लर्निंग कंटेनर्स का सह-निर्माता है। उन्हें वितरित डीप लर्निंग सिस्टम का शौक है। काम के अलावा, उन्हें किताबें पढ़ना, गिटार बजाना और पिज़्ज़ा बनाना पसंद है।

Optimize price-performance of LLM inference on NVIDIA GPUs using the Amazon SageMaker integration with NVIDIA NIM Microservices | Amazon Web Services PlatoBlockchain Data Intelligence. Vertical Search. Ai. हरीश तुम्मलाचेरला सेजमेकर में डीप लर्निंग परफॉर्मेंस टीम के साथ सॉफ्टवेयर इंजीनियर हैं। वह सेजमेकर पर बड़े भाषा मॉडलों को कुशलतापूर्वक परोसने के लिए प्रदर्शन इंजीनियरिंग पर काम करता है। अपने खाली समय में वह दौड़ना, साइकिल चलाना और स्की पर्वतारोहण का आनंद लेते हैं।

एलीउथ ट्रायना इज़ाज़ा एनवीआईडीआईए में एक डेवलपर रिलेशन मैनेजर है जो अमेज़ॅन के एआई एमएलओप्स, डेवऑप्स, वैज्ञानिकों और एडब्ल्यूएस तकनीकी विशेषज्ञों को डेटा क्यूरेशन, जीपीयू प्रशिक्षण, मॉडल अनुमान और एडब्ल्यूएस जीपीयू उदाहरणों पर उत्पादन परिनियोजन से लेकर जेनरेटिव एआई फाउंडेशन मॉडल को तेज और अनुकूलित करने के लिए एनवीआईडीआईए कंप्यूटिंग स्टैक में महारत हासिल करने के लिए सशक्त बनाता है। . इसके अलावा, एलीउथ एक भावुक माउंटेन बाइकर, स्कीयर, टेनिस और पोकर खिलाड़ी है।

जियाहोंग लिउ NVIDIA में क्लाउड सेवा प्रदाता टीम पर एक समाधान वास्तुकार है। वह मशीन लर्निंग और एआई समाधानों को अपनाने में ग्राहकों की सहायता करता है जो उनके प्रशिक्षण और अनुमान चुनौतियों का समाधान करने के लिए एनवीआईडीआईए त्वरित कंप्यूटिंग का लाभ उठाते हैं। अपने ख़ाली समय में, वह ओरिगेमी, DIY प्रोजेक्ट्स और बास्केटबॉल खेलने का आनंद लेते हैं।

क्षितिज गुप्ता NVIDIA में एक समाधान वास्तुकार है। उन्हें GPU AI तकनीकों के बारे में क्लाउड ग्राहकों को शिक्षित करने में मज़ा आता है, NVIDIA को उनके मशीन लर्निंग और डीप लर्निंग एप्लिकेशन में तेजी लाने के लिए उन्हें पेश करना और उनकी सहायता करना है। काम के बाहर, उन्हें दौड़ना, लंबी पैदल यात्रा और वन्य जीवन देखना पसंद है।

एसईओ संचालित सामग्री और पीआर वितरण। आज ही प्रवर्धित हो जाओ।
प्लेटोडेटा.नेटवर्क वर्टिकल जेनरेटिव एआई। स्वयं को शक्तिवान बनाएं। यहां पहुंचें।
प्लेटोआईस्ट्रीम। Web3 इंटेलिजेंस। ज्ञान प्रवर्धित। यहां पहुंचें।
प्लेटोईएसजी. कार्बन, क्लीनटेक, ऊर्जा, पर्यावरण, सौर, कचरा प्रबंधन। यहां पहुंचें।
प्लेटोहेल्थ। बायोटेक और क्लिनिकल परीक्षण इंटेलिजेंस। यहां पहुंचें।
स्रोत: https://aws.amazon.com/blogs/machine-learning/optimize-price-performance-of-llm-inference-on-nvidia-gpus-using-the-amazon-sagemaker-integration-with-nvidia-nim-microservices/

समय टिकट: मार्च २०,२०२१

समय टिकट: जून 15, 2022

प्लेटो द्वारा पुनर्प्रकाशित

अमेज़ॅन ट्रांसक्राइब, अमेज़ॅन ट्रांसलेट और अमेज़ॅन पॉली के साथ भाषा बाधाओं को तोड़ें

Amazon SageMaker और AWS SSO के साथ टीम और उपयोगकर्ता प्रबंधन

Amazon SageMaker Data Wrangler में PySpark और Altair कोड स्निपेट के साथ तेजी से डेटा तैयार करें

हमारे बारे में

ऊर्ध्वाधर खोज और ऐ

मंच

जुड़े रहें

लेखा