Abilita parametri GPU basati su pod in Amazon CloudWatch

Ripubblicato da Platone

Seguaci: 0

Nel febbraio 2022 è stato aggiunto Amazon Web Services supporto per i parametri GPU NVIDIA in Amazon CloudWatch, rendendo possibile spingere le metriche da Agente Amazon CloudWatch a Amazon Cloud Watch e monitora il tuo codice per un utilizzo ottimale della GPU. Da allora, questa funzionalità è stata integrata in molte delle nostre Amazon Machine Image (AMI) gestite, come AMI di apprendimento profondo e la AMI AWS ParallelCluster. Per ottenere parametri a livello di istanza sull'utilizzo della GPU, puoi utilizzare Packer o Amazon ImageBuilder per avviare la tua AMI personalizzata e utilizzarla in varie offerte di servizi gestiti come Batch AWS, Servizio di container elastici Amazon (Amazon ECS), o Servizio Amazon Elastic Kubernetes (Amazon EKS). Tuttavia, per molte offerte di servizi e carichi di lavoro basati su container, è ideale acquisire i parametri di utilizzo a livello di container, pod o spazio dei nomi.

Questo post descrive in dettaglio come impostare i parametri GPU basati su contenitore e fornisce un esempio di raccolta di questi parametri dai pod EKS.

Panoramica della soluzione

Per dimostrare le metriche GPU basate su container, creiamo un cluster EKS con g5.2xlarge istanze; tuttavia, funzionerà con qualsiasi famiglia di istanze accelerate NVIDIA supportata.

Distribuiamo l'operatore GPU NVIDIA per consentire l'uso delle risorse GPU e Esportatore NVIDIA DCGM per abilitare la raccolta delle metriche GPU. Quindi esploriamo due architetture. Il primo collega i parametri da NVIDIA DCGM Exporter a CloudWatch tramite un agente CloudWatch, come mostrato nel diagramma seguente.

Architettura di monitoraggio GPU con CloudWatch

La seconda architettura (vedere il diagramma seguente) collega i parametri da DCGM Exporter a Prometeo, allora usiamo a graminacee dashboard per visualizzare tali parametri.

Architettura di monitoraggio GPU con Grafana

Prerequisiti

Per semplificare la riproduzione dell'intero stack da questo post, utilizziamo un contenitore in cui sono già installati tutti gli strumenti necessari (aws cli, eksctl, helm e così via). Per clonare il progetto contenitore da GitHub, avrai bisogno git. Per creare ed eseguire il contenitore, avrai bisogno di docker. Per distribuire l'architettura, avrai bisogno Credenziali AWS. Per abilitare l'accesso ai servizi Kubernetes utilizzando il port forwarding, avrai bisogno anche di kubectl.

Questi prerequisiti possono essere installati sul computer locale, Istanza EC2 con BELLA DCV, o AWS Cloud9. In questo post utilizzeremo a c5.2xlarge Istanza Cloud9 con a 40GB volume di archiviazione locale. Quando utilizzi Cloud9, disabilita le credenziali temporanee gestite da AWS visitando Cloud9->Preferences->AWS Settings come mostrato nello screenshot qui sotto.

Abilita parametri GPU basati su pod in Amazon CloudWatch | Amazon Web Services PlatoBlockchain Data Intelligence. Ricerca verticale. Ai.

Crea ed esegui il contenitore aws-do-eks

Apri una shell terminale nel tuo ambiente preferito ed esegui i seguenti comandi:

git clone https://github.com/aws-samples/aws-do-eks
cd aws-do-eks
./build.sh
./run.sh
./exec.sh

Il risultato è il seguente:

root@e5ecb162812f:/eks#

Ora disponi di una shell in un ambiente contenitore che dispone di tutti gli strumenti necessari per completare le attività seguenti. La chiameremo "shell aws-do-eks". Eseguirai i comandi nelle sezioni seguenti in questa shell, a meno che non venga specificato diversamente.

Crea un cluster EKS con un gruppo di nodi

Questo gruppo include una famiglia di istanze GPU di tua scelta; in questo esempio, usiamo il g5.2xlarge tipo di istanza.

Il progetto aws-do-eks viene fornito con una raccolta di configurazioni di cluster. Puoi impostare la configurazione del cluster desiderata con una singola modifica alla configurazione.

Nella shell del contenitore, esegui ./env-config.sh e poi impostare CONF=conf/eksctl/yaml/eks-gpu-g5.yaml
Per verificare la configurazione del cluster, eseguire ./eks-config.sh

Dovresti vedere il seguente manifest del cluster:

apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata: name: do-eks-yaml-g5 version: "1.25" region: us-east-1
availabilityZones: - us-east-1a - us-east-1b - us-east-1c - us-east-1d
managedNodeGroups: - name: sys instanceType: m5.xlarge desiredCapacity: 1 iam: withAddonPolicies: autoScaler: true cloudWatch: true - name: g5 instanceType: g5.2xlarge instancePrefix: g5-2xl privateNetworking: true efaEnabled: false minSize: 0 desiredCapacity: 1 maxSize: 10 volumeSize: 80 iam: withAddonPolicies: cloudWatch: true
iam: withOIDC: true

Per creare il cluster, esegui il comando seguente nel contenitore

./eks-create.sh

L'output è il seguente:

root@e5ecb162812f:/eks# ./eks-create.sh /eks/impl/eksctl/yaml /eks ./eks-create.sh Mon May 22 20:50:59 UTC 2023
Creating cluster using /eks/conf/eksctl/yaml/eks-gpu-g5.yaml ... eksctl create cluster -f /eks/conf/eksctl/yaml/eks-gpu-g5.yaml 2023-05-22 20:50:59 [ℹ] eksctl version 0.133.0
2023-05-22 20:50:59 [ℹ] using region us-east-1
2023-05-22 20:50:59 [ℹ] subnets for us-east-1a - public:192.168.0.0/19 private:192.168.128.0/19
2023-05-22 20:50:59 [ℹ] subnets for us-east-1b - public:192.168.32.0/19 private:192.168.160.0/19
2023-05-22 20:50:59 [ℹ] subnets for us-east-1c - public:192.168.64.0/19 private:192.168.192.0/19
2023-05-22 20:50:59 [ℹ] subnets for us-east-1d - public:192.168.96.0/19 private:192.168.224.0/19
2023-05-22 20:50:59 [ℹ] nodegroup "sys" will use "" [AmazonLinux2/1.25]
2023-05-22 20:50:59 [ℹ] nodegroup "g5" will use "" [AmazonLinux2/1.25]
2023-05-22 20:50:59 [ℹ] using Kubernetes version 1.25
2023-05-22 20:50:59 [ℹ] creating EKS cluster "do-eks-yaml-g5" in "us-east-1" region with managed nodes
2023-05-22 20:50:59 [ℹ] 2 nodegroups (g5, sys) were included (based on the include/exclude rules)
2023-05-22 20:50:59 [ℹ] will create a CloudFormation stack for cluster itself and 0 nodegroup stack(s)
2023-05-22 20:50:59 [ℹ] will create a CloudFormation stack for cluster itself and 2 managed nodegroup stack(s)
2023-05-22 20:50:59 [ℹ] if you encounter any issues, check CloudFormation console or try 'eksctl utils describe-stacks --region=us-east-1 --cluster=do-eks-yaml-g5'
2023-05-22 20:50:59 [ℹ] Kubernetes API endpoint access will use default of {publicAccess=true, privateAccess=false} for cluster "do-eks-yaml-g5" in "us-east-1"
2023-05-22 20:50:59 [ℹ] CloudWatch logging will not be enabled for cluster "do-eks-yaml-g5" in "us-east-1"
2023-05-22 20:50:59 [ℹ] you can enable it with 'eksctl utils update-cluster-logging --enable-types={SPECIFY-YOUR-LOG-TYPES-HERE (e.g. all)} --region=us-east-1 --cluster=do-eks-yaml-g5'
2023-05-22 20:50:59 [ℹ] 2 sequential tasks: { create cluster control plane "do-eks-yaml-g5", 2 sequential sub-tasks: { 4 sequential sub-tasks: { wait for control plane to become ready, associate IAM OIDC provider, 2 sequential sub-tasks: { create IAM role for serviceaccount "kube-system/aws-node", create serviceaccount "kube-system/aws-node", }, restart daemonset "kube-system/aws-node", }, 2 parallel sub-tasks: { create managed nodegroup "sys", create managed nodegroup "g5", }, } }
2023-05-22 20:50:59 [ℹ] building cluster stack "eksctl-do-eks-yaml-g5-cluster"
2023-05-22 20:51:00 [ℹ] deploying stack "eksctl-do-eks-yaml-g5-cluster"
2023-05-22 20:51:30 [ℹ] waiting for CloudFormation stack "eksctl-do-eks-yaml-g5-cluster"
2023-05-22 20:52:00 [ℹ] waiting for CloudFormation stack "eksctl-do-eks-yaml-g5-cluster"
2023-05-22 20:53:01 [ℹ] waiting for CloudFormation stack "eksctl-do-eks-yaml-g5-cluster"
2023-05-22 20:54:01 [ℹ] waiting for CloudFormation stack "eksctl-do-eks-yaml-g5-cluster"
2023-05-22 20:55:01 [ℹ] waiting for CloudFormation stack "eksctl-do-eks-yaml-g5-cluster"
2023-05-22 20:56:02 [ℹ] waiting for CloudFormation stack "eksctl-do-eks-yaml-g5-cluster"
2023-05-22 20:57:02 [ℹ] waiting for CloudFormation stack "eksctl-do-eks-yaml-g5-cluster"
2023-05-22 20:58:02 [ℹ] waiting for CloudFormation stack "eksctl-do-eks-yaml-g5-cluster"
2023-05-22 20:59:02 [ℹ] waiting for CloudFormation stack "eksctl-do-eks-yaml-g5-cluster"
2023-05-22 21:00:03 [ℹ] waiting for CloudFormation stack "eksctl-do-eks-yaml-g5-cluster"
2023-05-22 21:01:03 [ℹ] waiting for CloudFormation stack "eksctl-do-eks-yaml-g5-cluster"
2023-05-22 21:02:03 [ℹ] waiting for CloudFormation stack "eksctl-do-eks-yaml-g5-cluster"
2023-05-22 21:03:04 [ℹ] waiting for CloudFormation stack "eksctl-do-eks-yaml-g5-cluster"
2023-05-22 21:05:07 [ℹ] building iamserviceaccount stack "eksctl-do-eks-yaml-g5-addon-iamserviceaccount-kube-system-aws-node"
2023-05-22 21:05:10 [ℹ] deploying stack "eksctl-do-eks-yaml-g5-addon-iamserviceaccount-kube-system-aws-node"
2023-05-22 21:05:10 [ℹ] waiting for CloudFormation stack "eksctl-do-eks-yaml-g5-addon-iamserviceaccount-kube-system-aws-node"
2023-05-22 21:05:40 [ℹ] waiting for CloudFormation stack "eksctl-do-eks-yaml-g5-addon-iamserviceaccount-kube-system-aws-node"
2023-05-22 21:05:40 [ℹ] serviceaccount "kube-system/aws-node" already exists
2023-05-22 21:05:41 [ℹ] updated serviceaccount "kube-system/aws-node"
2023-05-22 21:05:41 [ℹ] daemonset "kube-system/aws-node" restarted
2023-05-22 21:05:41 [ℹ] building managed nodegroup stack "eksctl-do-eks-yaml-g5-nodegroup-sys"
2023-05-22 21:05:41 [ℹ] building managed nodegroup stack "eksctl-do-eks-yaml-g5-nodegroup-g5"
2023-05-22 21:05:42 [ℹ] deploying stack "eksctl-do-eks-yaml-g5-nodegroup-sys"
2023-05-22 21:05:42 [ℹ] waiting for CloudFormation stack "eksctl-do-eks-yaml-g5-nodegroup-sys"
2023-05-22 21:05:42 [ℹ] deploying stack "eksctl-do-eks-yaml-g5-nodegroup-g5"
2023-05-22 21:05:42 [ℹ] waiting for CloudFormation stack "eksctl-do-eks-yaml-g5-nodegroup-g5"
2023-05-22 21:06:12 [ℹ] waiting for CloudFormation stack "eksctl-do-eks-yaml-g5-nodegroup-sys"
2023-05-22 21:06:12 [ℹ] waiting for CloudFormation stack "eksctl-do-eks-yaml-g5-nodegroup-g5"
2023-05-22 21:06:55 [ℹ] waiting for CloudFormation stack "eksctl-do-eks-yaml-g5-nodegroup-sys"
2023-05-22 21:07:11 [ℹ] waiting for CloudFormation stack "eksctl-do-eks-yaml-g5-nodegroup-g5"
2023-05-22 21:08:29 [ℹ] waiting for CloudFormation stack "eksctl-do-eks-yaml-g5-nodegroup-g5"
2023-05-22 21:08:45 [ℹ] waiting for CloudFormation stack "eksctl-do-eks-yaml-g5-nodegroup-sys"
2023-05-22 21:09:52 [ℹ] waiting for CloudFormation stack "eksctl-do-eks-yaml-g5-nodegroup-g5"
2023-05-22 21:09:53 [ℹ] waiting for the control plane to become ready
2023-05-22 21:09:53 [✔] saved kubeconfig as "/root/.kube/config"
2023-05-22 21:09:53 [ℹ] 1 task: { install Nvidia device plugin }
W0522 21:09:54.155837 1668 warnings.go:70] spec.template.metadata.annotations[scheduler.alpha.kubernetes.io/critical-pod]: non-functional in v1.16+; use the "priorityClassName" field instead
2023-05-22 21:09:54 [ℹ] created "kube-system:DaemonSet.apps/nvidia-device-plugin-daemonset"
2023-05-22 21:09:54 [ℹ] as you are using the EKS-Optimized Accelerated AMI with a GPU-enabled instance type, the Nvidia Kubernetes device plugin was automatically installed. to skip installing it, use --install-nvidia-plugin=false.
2023-05-22 21:09:54 [✔] all EKS cluster resources for "do-eks-yaml-g5" have been created
2023-05-22 21:09:54 [ℹ] nodegroup "sys" has 1 node(s)
2023-05-22 21:09:54 [ℹ] node "ip-192-168-18-137.ec2.internal" is ready
2023-05-22 21:09:54 [ℹ] waiting for at least 1 node(s) to become ready in "sys"
2023-05-22 21:09:54 [ℹ] nodegroup "sys" has 1 node(s)
2023-05-22 21:09:54 [ℹ] node "ip-192-168-18-137.ec2.internal" is ready
2023-05-22 21:09:55 [ℹ] kubectl command should work with "/root/.kube/config", try 'kubectl get nodes'
2023-05-22 21:09:55 [✔] EKS cluster "do-eks-yaml-g5" in "us-east-1" region is ready Mon May 22 21:09:55 UTC 2023
Done creating cluster using /eks/conf/eksctl/yaml/eks-gpu-g5.yaml /eks

Per verificare che il cluster sia stato creato correttamente, esegui il comando seguente

kubectl get nodes -L node.kubernetes.io/instance-type

L'output è simile al seguente:

NAME STATUS ROLES AGE VERSION INSTANCE_TYPE
ip-192-168-18-137.ec2.internal Ready <none> 47m v1.25.9-eks-0a21954 m5.xlarge
ip-192-168-214-241.ec2.internal Ready <none> 46m v1.25.9-eks-0a21954 g5.2xlarge

In questo esempio, abbiamo un'istanza m5.xlarge e una g5.2xlarge nel nostro cluster; pertanto, vediamo due nodi elencati nell'output precedente.

Durante il processo di creazione del cluster, verrà installato il plug-in del dispositivo NVIDIA. Dovrai rimuoverlo dopo la creazione del cluster perché utilizzeremo il file Operatore GPU NVIDIA anziché.

Elimina il plugin con il seguente comando

kubectl -n kube-system delete daemonset nvidia-device-plugin-daemonset

Otteniamo il seguente output:

daemonset.apps "nvidia-device-plugin-daemonset" deleted

Installa il repository NVIDIA Helm

Installa il repository NVIDIA Helm con il seguente comando:

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia && helm repo update

Distribuisci l'esportatore DCGM con NVIDIA GPU Operator

Per distribuire l'esportatore DCGM, completare i seguenti passaggi:

Preparare la configurazione delle metriche GPU dell'esportatore DCGM

curl https://raw.githubusercontent.com/NVIDIA/dcgm-exporter/main/etc/dcp-metrics-included.csv > dcgm-metrics.csv

Hai la possibilità di modificare il file dcgm-metrics.csv file. Puoi aggiungere o rimuovere qualsiasi metrica secondo necessità.

Creare lo spazio dei nomi dell'operatore GPU e l'esportatore DCGM ConfigMap

kubectl create namespace gpu-operator && /
kubectl create configmap metrics-config -n gpu-operator --from-file=dcgm-metrics.csv

L'output è il seguente:

namespace/gpu-operator created
configmap/metrics-config created

Applicare l'operatore GPU al cluster EKS

helm install --wait --generate-name -n gpu-operator --create-namespace nvidia/gpu-operator --set dcgmExporter.config.name=metrics-config --set dcgmExporter.env[0].name=DCGM_EXPORTER_COLLECTORS --set dcgmExporter.env[0].value=/etc/dcgm-exporter/dcgm-metrics.csv --set toolkit.enabled=false

L'output è il seguente:

NAME: gpu-operator-1684795140
LAST DEPLOYED: Day Month Date HH:mm:ss YYYY
NAMESPACE: gpu-operator
STATUS: deployed
REVISION: 1
TEST SUITE: None

Conferma che il pod di esportazione DCGM sia in esecuzione

kubectl -n gpu-operator get pods | grep dcgm

L'output è il seguente:

nvidia-dcgm-exporter-lkmfr       1/1     Running    0   1m

Se controlli i log, dovresti vedere il file “Starting webserver” messaggio:

kubectl -n gpu-operator logs -f $(kubectl -n gpu-operator get pods | grep dcgm | cut -d ' ' -f 1)

L'output è il seguente:

Defaulted container "nvidia-dcgm-exporter" out of: nvidia-dcgm-exporter, toolkit-validation (init)
time="2023-05-22T22:40:08Z" level=info msg="Starting dcgm-exporter"
time="2023-05-22T22:40:08Z" level=info msg="DCGM successfully initialized!"
time="2023-05-22T22:40:08Z" level=info msg="Collecting DCP Metrics"
time="2023-05-22T22:40:08Z" level=info msg="No configmap data specified, falling back to metric file /etc/dcgm-exporter/dcgm-metrics.csv"
time="2023-05-22T22:40:08Z" level=info msg="Initializing system entities of type: GPU"
time="2023-05-22T22:40:09Z" level=info msg="Initializing system entities of type: NvSwitch"
time="2023-05-22T22:40:09Z" level=info msg="Not collecting switch metrics: no switches to monitor"
time="2023-05-22T22:40:09Z" level=info msg="Initializing system entities of type: NvLink"
time="2023-05-22T22:40:09Z" level=info msg="Not collecting link metrics: no switches to monitor"
time="2023-05-22T22:40:09Z" level=info msg="Kubernetes metrics collection enabled!"
time="2023-05-22T22:40:09Z" level=info msg="Pipeline starting"
time="2023-05-22T22:40:09Z" level=info msg="Starting webserver"

NVIDIA DCGM Exporter espone un endpoint dei parametri Prometheus, che può essere acquisito dall'agente CloudWatch. Per visualizzare l'endpoint, utilizzare il comando seguente:

kubectl -n gpu-operator get services | grep dcgm

Otteniamo il seguente output:

nvidia-dcgm-exporter    ClusterIP   10.100.183.207   <none>   9400/TCP   10m

Per generare un certo utilizzo della GPU, distribuiamo un pod che esegue masterizzazione GPU binario

kubectl apply -f https://raw.githubusercontent.com/aws-samples/aws-do-eks/main/Container-Root/eks/deployment/gpu-metrics/gpu-burn-deployment.yaml

L'output è il seguente:

deployment.apps/gpu-burn created

Questa distribuzione utilizza una singola GPU per produrre un modello continuo di utilizzo del 100% per 20 secondi seguito da un utilizzo dello 0% per 20 secondi.

Per assicurarti che l'endpoint funzioni, puoi eseguire un contenitore temporaneo che utilizza curl per leggerne il contenuto http://nvidia-dcgm-exporter:9400/metrics

kubectl -n gpu-operator run -it --rm curl --restart='Never' --image=curlimages/curl --command -- curl http://nvidia-dcgm-exporter:9400/metrics

Otteniamo il seguente output:

# HELP DCGM_FI_DEV_SM_CLOCK SM clock frequency (in MHz).
# TYPE DCGM_FI_DEV_SM_CLOCK gauge
DCGM_FI_DEV_SM_CLOCK{gpu="0",UUID="GPU-ff76466b-22fc-f7a9-abe2-ce3ac453b8b3",device="nvidia0",modelName="NVIDIA A10G",Hostname="nvidia-dcgm-exporter-48cwd",DCGM_FI_DRIVER_VERSION="470.182.03",container="main",namespace="kube-system",pod="gpu-burn-c68d8c774-ltg9s"} 1455
# HELP DCGM_FI_DEV_MEM_CLOCK Memory clock frequency (in MHz).
# TYPE DCGM_FI_DEV_MEM_CLOCK gauge
DCGM_FI_DEV_MEM_CLOCK{gpu="0",UUID="GPU-ff76466b-22fc-f7a9-abe2-ce3ac453b8b3",device="nvidia0",modelName="NVIDIA A10G",Hostname="nvidia-dcgm-exporter-48cwd",DCGM_FI_DRIVER_VERSION="470.182.03",container="main",namespace="kube-system",pod="gpu-burn-c68d8c774-ltg9s"} 6250
# HELP DCGM_FI_DEV_GPU_TEMP GPU temperature (in C).
# TYPE DCGM_FI_DEV_GPU_TEMP gauge
DCGM_FI_DEV_GPU_TEMP{gpu="0",UUID="GPU-ff76466b-22fc-f7a9-abe2-ce3ac453b8b3",device="nvidia0",modelName="NVIDIA A10G",Hostname="nvidia-dcgm-exporter-48cwd",DCGM_FI_DRIVER_VERSION="470.182.03",container="main",namespace="kube-system",pod="gpu-burn-c68d8c774-ltg9s"} 65
# HELP DCGM_FI_DEV_POWER_USAGE Power draw (in W).
# TYPE DCGM_FI_DEV_POWER_USAGE gauge
DCGM_FI_DEV_POWER_USAGE{gpu="0",UUID="GPU-ff76466b-22fc-f7a9-abe2-ce3ac453b8b3",device="nvidia0",modelName="NVIDIA A10G",Hostname="nvidia-dcgm-exporter-48cwd",DCGM_FI_DRIVER_VERSION="470.182.03",container="main",namespace="kube-system",pod="gpu-burn-c68d8c774-ltg9s"} 299.437000
# HELP DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION Total energy consumption since boot (in mJ).
# TYPE DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION counter
DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION{gpu="0",UUID="GPU-ff76466b-22fc-f7a9-abe2-ce3ac453b8b3",device="nvidia0",modelName="NVIDIA A10G",Hostname="nvidia-dcgm-exporter-48cwd",DCGM_FI_DRIVER_VERSION="470.182.03",container="main",namespace="kube-system",pod="gpu-burn-c68d8c774-ltg9s"} 15782796862
# HELP DCGM_FI_DEV_PCIE_REPLAY_COUNTER Total number of PCIe retries.
# TYPE DCGM_FI_DEV_PCIE_REPLAY_COUNTER counter
DCGM_FI_DEV_PCIE_REPLAY_COUNTER{gpu="0",UUID="GPU-ff76466b-22fc-f7a9-abe2-ce3ac453b8b3",device="nvidia0",modelName="NVIDIA A10G",Hostname="nvidia-dcgm-exporter-48cwd",DCGM_FI_DRIVER_VERSION="470.182.03",container="main",namespace="kube-system",pod="gpu-burn-c68d8c774-ltg9s"} 0
# HELP DCGM_FI_DEV_GPU_UTIL GPU utilization (in %).
# TYPE DCGM_FI_DEV_GPU_UTIL gauge
DCGM_FI_DEV_GPU_UTIL{gpu="0",UUID="GPU-ff76466b-22fc-f7a9-abe2-ce3ac453b8b3",device="nvidia0",modelName="NVIDIA A10G",Hostname="nvidia-dcgm-exporter-48cwd",DCGM_FI_DRIVER_VERSION="470.182.03",container="main",namespace="kube-system",pod="gpu-burn-c68d8c774-ltg9s"} 100
# HELP DCGM_FI_DEV_MEM_COPY_UTIL Memory utilization (in %).
# TYPE DCGM_FI_DEV_MEM_COPY_UTIL gauge
DCGM_FI_DEV_MEM_COPY_UTIL{gpu="0",UUID="GPU-ff76466b-22fc-f7a9-abe2-ce3ac453b8b3",device="nvidia0",modelName="NVIDIA A10G",Hostname="nvidia-dcgm-exporter-48cwd",DCGM_FI_DRIVER_VERSION="470.182.03",container="main",namespace="kube-system",pod="gpu-burn-c68d8c774-ltg9s"} 38
# HELP DCGM_FI_DEV_ENC_UTIL Encoder utilization (in %).
# TYPE DCGM_FI_DEV_ENC_UTIL gauge
DCGM_FI_DEV_ENC_UTIL{gpu="0",UUID="GPU-ff76466b-22fc-f7a9-abe2-ce3ac453b8b3",device="nvidia0",modelName="NVIDIA A10G",Hostname="nvidia-dcgm-exporter-48cwd",DCGM_FI_DRIVER_VERSION="470.182.03",container="main",namespace="kube-system",pod="gpu-burn-c68d8c774-ltg9s"} 0
# HELP DCGM_FI_DEV_DEC_UTIL Decoder utilization (in %).
# TYPE DCGM_FI_DEV_DEC_UTIL gauge
DCGM_FI_DEV_DEC_UTIL{gpu="0",UUID="GPU-ff76466b-22fc-f7a9-abe2-ce3ac453b8b3",device="nvidia0",modelName="NVIDIA A10G",Hostname="nvidia-dcgm-exporter-48cwd",DCGM_FI_DRIVER_VERSION="470.182.03",container="main",namespace="kube-system",pod="gpu-burn-c68d8c774-ltg9s"} 0
# HELP DCGM_FI_DEV_XID_ERRORS Value of the last XID error encountered.
# TYPE DCGM_FI_DEV_XID_ERRORS gauge
DCGM_FI_DEV_XID_ERRORS{gpu="0",UUID="GPU-ff76466b-22fc-f7a9-abe2-ce3ac453b8b3",device="nvidia0",modelName="NVIDIA A10G",Hostname="nvidia-dcgm-exporter-48cwd",DCGM_FI_DRIVER_VERSION="470.182.03",container="main",namespace="kube-system",pod="gpu-burn-c68d8c774-ltg9s"} 0
# HELP DCGM_FI_DEV_FB_FREE Framebuffer memory free (in MiB).
# TYPE DCGM_FI_DEV_FB_FREE gauge
DCGM_FI_DEV_FB_FREE{gpu="0",UUID="GPU-ff76466b-22fc-f7a9-abe2-ce3ac453b8b3",device="nvidia0",modelName="NVIDIA A10G",Hostname="nvidia-dcgm-exporter-48cwd",DCGM_FI_DRIVER_VERSION="470.182.03",container="main",namespace="kube-system",pod="gpu-burn-c68d8c774-ltg9s"} 2230
# HELP DCGM_FI_DEV_FB_USED Framebuffer memory used (in MiB).
# TYPE DCGM_FI_DEV_FB_USED gauge
DCGM_FI_DEV_FB_USED{gpu="0",UUID="GPU-ff76466b-22fc-f7a9-abe2-ce3ac453b8b3",device="nvidia0",modelName="NVIDIA A10G",Hostname="nvidia-dcgm-exporter-48cwd",DCGM_FI_DRIVER_VERSION="470.182.03",container="main",namespace="kube-system",pod="gpu-burn-c68d8c774-ltg9s"} 20501
# HELP DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL Total number of NVLink bandwidth counters for all lanes.
# TYPE DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL counter
DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL{gpu="0",UUID="GPU-ff76466b-22fc-f7a9-abe2-ce3ac453b8b3",device="nvidia0",modelName="NVIDIA A10G",Hostname="nvidia-dcgm-exporter-48cwd",DCGM_FI_DRIVER_VERSION="470.182.03",container="main",namespace="kube-system",pod="gpu-burn-c68d8c774-ltg9s"} 0
# HELP DCGM_FI_DEV_VGPU_LICENSE_STATUS vGPU License status
# TYPE DCGM_FI_DEV_VGPU_LICENSE_STATUS gauge
DCGM_FI_DEV_VGPU_LICENSE_STATUS{gpu="0",UUID="GPU-ff76466b-22fc-f7a9-abe2-ce3ac453b8b3",device="nvidia0",modelName="NVIDIA A10G",Hostname="nvidia-dcgm-exporter-48cwd",DCGM_FI_DRIVER_VERSION="470.182.03",container="main",namespace="kube-system",pod="gpu-burn-c68d8c774-ltg9s"} 0
# HELP DCGM_FI_DEV_UNCORRECTABLE_REMAPPED_ROWS Number of remapped rows for uncorrectable errors
# TYPE DCGM_FI_DEV_UNCORRECTABLE_REMAPPED_ROWS counter
DCGM_FI_DEV_UNCORRECTABLE_REMAPPED_ROWS{gpu="0",UUID="GPU-ff76466b-22fc-f7a9-abe2-ce3ac453b8b3",device="nvidia0",modelName="NVIDIA A10G",Hostname="nvidia-dcgm-exporter-48cwd",DCGM_FI_DRIVER_VERSION="470.182.03",container="main",namespace="kube-system",pod="gpu-burn-c68d8c774-ltg9s"} 0
# HELP DCGM_FI_DEV_CORRECTABLE_REMAPPED_ROWS Number of remapped rows for correctable errors
# TYPE DCGM_FI_DEV_CORRECTABLE_REMAPPED_ROWS counter
DCGM_FI_DEV_CORRECTABLE_REMAPPED_ROWS{gpu="0",UUID="GPU-ff76466b-22fc-f7a9-abe2-ce3ac453b8b3",device="nvidia0",modelName="NVIDIA A10G",Hostname="nvidia-dcgm-exporter-48cwd",DCGM_FI_DRIVER_VERSION="470.182.03",container="main",namespace="kube-system",pod="gpu-burn-c68d8c774-ltg9s"} 0
# HELP DCGM_FI_DEV_ROW_REMAP_FAILURE Whether remapping of rows has failed
# TYPE DCGM_FI_DEV_ROW_REMAP_FAILURE gauge
DCGM_FI_DEV_ROW_REMAP_FAILURE{gpu="0",UUID="GPU-ff76466b-22fc-f7a9-abe2-ce3ac453b8b3",device="nvidia0",modelName="NVIDIA A10G",Hostname="nvidia-dcgm-exporter-48cwd",DCGM_FI_DRIVER_VERSION="470.182.03",container="main",namespace="kube-system",pod="gpu-burn-c68d8c774-ltg9s"} 0
# HELP DCGM_FI_PROF_GR_ENGINE_ACTIVE Ratio of time the graphics engine is active (in %).
# TYPE DCGM_FI_PROF_GR_ENGINE_ACTIVE gauge
DCGM_FI_PROF_GR_ENGINE_ACTIVE{gpu="0",UUID="GPU-ff76466b-22fc-f7a9-abe2-ce3ac453b8b3",device="nvidia0",modelName="NVIDIA A10G",Hostname="nvidia-dcgm-exporter-48cwd",DCGM_FI_DRIVER_VERSION="470.182.03",container="main",namespace="kube-system",pod="gpu-burn-c68d8c774-ltg9s"} 0.808369
# HELP DCGM_FI_PROF_PIPE_TENSOR_ACTIVE Ratio of cycles the tensor (HMMA) pipe is active (in %).
# TYPE DCGM_FI_PROF_PIPE_TENSOR_ACTIVE gauge
DCGM_FI_PROF_PIPE_TENSOR_ACTIVE{gpu="0",UUID="GPU-ff76466b-22fc-f7a9-abe2-ce3ac453b8b3",device="nvidia0",modelName="NVIDIA A10G",Hostname="nvidia-dcgm-exporter-48cwd",DCGM_FI_DRIVER_VERSION="470.182.03",container="main",namespace="kube-system",pod="gpu-burn-c68d8c774-ltg9s"} 0.000000
# HELP DCGM_FI_PROF_DRAM_ACTIVE Ratio of cycles the device memory interface is active sending or receiving data (in %).
# TYPE DCGM_FI_PROF_DRAM_ACTIVE gauge
DCGM_FI_PROF_DRAM_ACTIVE{gpu="0",UUID="GPU-ff76466b-22fc-f7a9-abe2-ce3ac453b8b3",device="nvidia0",modelName="NVIDIA A10G",Hostname="nvidia-dcgm-exporter-48cwd",DCGM_FI_DRIVER_VERSION="470.182.03",container="main",namespace="kube-system",pod="gpu-burn-c68d8c774-ltg9s"} 0.315787
# HELP DCGM_FI_PROF_PCIE_TX_BYTES The rate of data transmitted over the PCIe bus - including both protocol headers and data payloads - in bytes per second.
# TYPE DCGM_FI_PROF_PCIE_TX_BYTES gauge
DCGM_FI_PROF_PCIE_TX_BYTES{gpu="0",UUID="GPU-ff76466b-22fc-f7a9-abe2-ce3ac453b8b3",device="nvidia0",modelName="NVIDIA A10G",Hostname="nvidia-dcgm-exporter-48cwd",DCGM_FI_DRIVER_VERSION="470.182.03",container="main",namespace="kube-system",pod="gpu-burn-c68d8c774-ltg9s"} 3985328
# HELP DCGM_FI_PROF_PCIE_RX_BYTES The rate of data received over the PCIe bus - including both protocol headers and data payloads - in bytes per second.
# TYPE DCGM_FI_PROF_PCIE_RX_BYTES gauge
DCGM_FI_PROF_PCIE_RX_BYTES{gpu="0",UUID="GPU-ff76466b-22fc-f7a9-abe2-ce3ac453b8b3",device="nvidia0",modelName="NVIDIA A10G",Hostname="nvidia-dcgm-exporter-48cwd",DCGM_FI_DRIVER_VERSION="470.182.03",container="main",namespace="kube-system",pod="gpu-burn-c68d8c774-ltg9s"} 21715174
pod "curl" deleted

Configura e distribuisci l'agente CloudWatch

Per configurare e distribuire l'agente CloudWatch, completare i seguenti passaggi:

Scarica il file YAML e modificalo

curl -O https://raw.githubusercontent.com/aws-samples/amazon-cloudwatch-container-insights/k8s/1.3.15/k8s-deployment-manifest-templates/deployment-mode/service/cwagent-prometheus/prometheus-eks.yaml

Il file contiene un cwagent configmap e prometheus configmap. Per questo post, li modifichiamo entrambi.

Modificare il prometheus-eks.yaml filetto

Aprire il prometheus-eks.yaml file nel tuo editor preferito e sostituisci il file cwagentconfig.json sezione con il seguente contenuto:

apiVersion: v1
data: # cwagent json config cwagentconfig.json: | { "logs": { "metrics_collected": { "prometheus": { "prometheus_config_path": "/etc/prometheusconfig/prometheus.yaml", "emf_processor": { "metric_declaration": [ { "source_labels": ["Service"], "label_matcher": ".*dcgm.*", "dimensions": [["Service","Namespace","ClusterName","job","pod"]], "metric_selectors": [ "^DCGM_FI_DEV_GPU_UTIL$", "^DCGM_FI_DEV_DEC_UTIL$", "^DCGM_FI_DEV_ENC_UTIL$", "^DCGM_FI_DEV_MEM_CLOCK$", "^DCGM_FI_DEV_MEM_COPY_UTIL$", "^DCGM_FI_DEV_POWER_USAGE$", "^DCGM_FI_DEV_ROW_REMAP_FAILURE$", "^DCGM_FI_DEV_SM_CLOCK$", "^DCGM_FI_DEV_XID_ERRORS$", "^DCGM_FI_PROF_DRAM_ACTIVE$", "^DCGM_FI_PROF_GR_ENGINE_ACTIVE$", "^DCGM_FI_PROF_PCIE_RX_BYTES$", "^DCGM_FI_PROF_PCIE_TX_BYTES$", "^DCGM_FI_PROF_PIPE_TENSOR_ACTIVE$" ] } ] } } }, "force_flush_interval": 5 } }

Nel prometheus config, aggiungi la seguente definizione di lavoro per l'esportatore DCGM

- job_name: 'kubernetes-pod-dcgm-exporter' sample_limit: 10000 metrics_path: /api/v1/metrics/prometheus kubernetes_sd_configs: - role: pod relabel_configs: - source_labels: [__meta_kubernetes_pod_container_name] action: keep regex: '^DCGM.*$' - source_labels: [__address__] action: replace regex: ([^:]+)(?::d+)? replacement: ${1}:9400 target_label: __address__ - action: labelmap regex: __meta_kubernetes_pod_label_(.+) - action: replace source_labels: - __meta_kubernetes_namespace target_label: Namespace - source_labels: [__meta_kubernetes_pod] action: replace target_label: pod - action: replace source_labels: - __meta_kubernetes_pod_container_name target_label: container_name - action: replace source_labels: - __meta_kubernetes_pod_controller_name target_label: pod_controller_name - action: replace source_labels: - __meta_kubernetes_pod_controller_kind target_label: pod_controller_kind - action: replace source_labels: - __meta_kubernetes_pod_phase target_label: pod_phase - action: replace source_labels: - __meta_kubernetes_pod_node_name target_label: NodeName

Salvare il file e applicare il cwagent-dcgm configurazione al tuo cluster

kubectl apply -f ./prometheus-eks.yaml

Otteniamo il seguente output:

namespace/amazon-cloudwatch created
configmap/prometheus-cwagentconfig created
configmap/prometheus-config created
serviceaccount/cwagent-prometheus created
clusterrole.rbac.authorization.k8s.io/cwagent-prometheus-role created
clusterrolebinding.rbac.authorization.k8s.io/cwagent-prometheus-role-binding created
deployment.apps/cwagent-prometheus created

Conferma che il pod dell'agente CloudWatch è in esecuzione

kubectl -n amazon-cloudwatch get pods

Otteniamo il seguente output:

NAME READY STATUS RESTARTS AGE
cwagent-prometheus-7dfd69cc46-s4cx7 1/1 Running 0 15m

Visualizza i parametri sulla console CloudWatch

Per visualizzare i parametri in CloudWatch, completa i seguenti passaggi:

Sulla console CloudWatch, sotto Metrica nel pannello di navigazione, scegli Tutte le metriche
Nel Spazi dei nomi personalizzati sezione, scegli la nuova voce per ContainerInsights/Prometeo

Per ulteriori informazioni sul ContainerInsights/Prometeo spazio dei nomi, fare riferimento a Recuperare ulteriori fonti Prometheus e importare tali parametri.

CloudWatch - ContainerInsights/Prometeus

Esamina i nomi delle metriche e scegli DCGM_FI_DEV_GPU_UTIL
Sulla Metriche rappresentate graficamente scheda, impostare Periodo a 5 secondi

CloudWatch - Impostazione del periodo

Imposta l'intervallo di aggiornamento su 10 secondi

Vedrai le metriche raccolte dall'esportatore DCGM che visualizzano il file gpu-burn sequenza attiva e disattiva ogni 20 secondi.

CloudWatch: modello GPUburn

Sulla Scopri la nostra gamma di prodotti scheda, puoi visualizzare i dati, incluso il nome del pod per ogni metrica.

CloudWatch: nome del pod per il parametro

I metadati dell'API EKS sono stati combinati con i dati delle metriche DCGM, risultando nelle metriche GPU basate su pod fornite.

Questo conclude il primo approccio di esportazione dei parametri DCGM a CloudWatch tramite l'agente CloudWatch.

Nella sezione successiva configuriamo la seconda architettura, che esporta le metriche DCGM su Prometheus, e le visualizziamo con Grafana.

Utilizza Prometheus e Grafana per visualizzare le metriche GPU di DCGM

Completa i seguenti passi:

Aggiungi il grafico del timone della comunità Prometheus

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts

Questo grafico distribuisce sia Prometeo che Grafana. Dobbiamo apportare alcune modifiche al grafico prima di eseguire il comando di installazione.

Salvare i valori di configurazione del grafico in un file in formato /tmp

helm inspect values prometheus-community/kube-prometheus-stack > /tmp/kube-prometheus-stack.values

Modifica il file di configurazione del carattere

Modifica il file salvato (/tmp/kube-prometheus-stack.values) e imposta la seguente opzione cercando il nome dell'impostazione e impostando il valore:

prometheus.prometheusSpec.serviceMonitorSelectorNilUsesHelmValues=false

Aggiungi la seguente ConfigMap al file additionalScrapeConfigs pagina

additionalScrapeConfigs:
- job_name: gpu-metrics scrape_interval: 1s metrics_path: /metrics scheme: http kubernetes_sd_configs: - role: endpoints namespaces: names: - gpu-operator relabel_configs: - source_labels: [__meta_kubernetes_pod_node_name] action: replace target_label: kubernetes_node

Distribuisci lo stack Prometheus con i valori aggiornati

helm install prometheus-community/kube-prometheus-stack 
--create-namespace --namespace prometheus 
--generate-name 
--values /tmp/kube-prometheus-stack.values

Otteniamo il seguente output:

NAME: kube-prometheus-stack-1684965548
LAST DEPLOYED: Wed May 24 21:59:14 2023
NAMESPACE: prometheus
STATUS: deployed
REVISION: 1
NOTES:
kube-prometheus-stack has been installed. Check its status by running: kubectl --namespace prometheus get pods -l "release=kube-prometheus-stack-1684965548" Visit https://github.com/prometheus-operator/kube-prometheus for instructions on how to create & configure Alertmanager and Prometheus instances using the Operator.

Conferma che i pod Prometheus siano in esecuzione

kubectl get pods -n prometheus

Otteniamo il seguente output:

NAME READY STATUS RESTARTS AGE
alertmanager-kube-prometheus-stack-1684-alertmanager-0 2/2 Running 0 6m55s
kube-prometheus-stack-1684-operator-6c87649878-j7v55 1/1 Running 0 6m58s
kube-prometheus-stack-1684965548-grafana-dcd7b4c96-bzm8p 3/3 Running 0 6m58s
kube-prometheus-stack-1684965548-kube-state-metrics-7d856dptlj5 1/1 Running 0 6m58s
kube-prometheus-stack-1684965548-prometheus-node-exporter-2fbl5 1/1 Running 0 6m58s
kube-prometheus-stack-1684965548-prometheus-node-exporter-m7zmv 1/1 Running 0 6m58s
prometheus-kube-prometheus-stack-1684-prometheus-0 2/2 Running 0 6m55s

I baccelli Prometeo e Grafana sono nel Running stato.

Successivamente, confermiamo che le metriche DCGM confluiscono in Prometheus.

Effettua il port forwarding dell'interfaccia utente di Prometheus

Esistono diversi modi per esporre l'interfaccia utente Prometheus in esecuzione in EKS alle richieste originate all'esterno del cluster. Noi useremo kubectl port-forwarding. Finora abbiamo eseguito comandi all'interno del file aws-do-eks contenitore. Per accedere al servizio Prometheus in esecuzione nel cluster, creeremo un tunnel dall'host. Ecco il aws-do-eks container è in esecuzione eseguendo il seguente comando all'esterno del contenitore, in una nuova shell terminale sull'host. La chiameremo “shell host”.

kubectl -n prometheus port-forward svc/$(kubectl -n prometheus get svc | grep prometheus | grep -v alertmanager | grep -v operator | grep -v grafana | grep -v metrics | grep -v exporter | grep -v operated | cut -d ' ' -f 1) 8080:9090 &

Mentre il processo di port forwarding è in esecuzione, siamo in grado di accedere all'interfaccia utente di Prometheus dall'host come descritto di seguito.

Apri l'interfaccia utente di Prometeo
- Se utilizzi Cloud9, vai a Preview->Preview Running Application per aprire l'interfaccia utente di Prometheus in una scheda all'interno dell'IDE Cloud9, quindi fare clic su nell'angolo in alto a destra della scheda per visualizzarla in una nuova finestra.
- Se sei sul tuo host locale o sei connesso a un'istanza EC2 tramite desktop remoto, apri un browser e visita l'URL http://localhost:8080.

Prometeo - Metriche DCGM

entrare DCGM per vedere le metriche DCGM che confluiscono in Prometheus
Seleziona DCGM_FI_DEV_GPU_UTILscegli Eseguire, quindi passare a Grafico scheda per visualizzare il modello di utilizzo della GPU previsto

Prometeo - modello gpuburn

Interrompere il processo di port forwarding di Prometheus

Esegui la seguente riga di comando nella shell host:

kill -9 $(ps -aef | grep port-forward | grep -v grep | grep prometheus | awk '{print $2}')

Ora possiamo visualizzare le metriche DCGM tramite Grafana Dashboard.

Recupera la password per accedere all'interfaccia utente di Grafana

kubectl -n prometheus get secret $(kubectl -n prometheus get secrets | grep grafana | cut -d ' ' -f 1) -o jsonpath="{.data.admin-password}" | base64 --decode ; echo

Port forwarding il servizio Grafana

Esegui la seguente riga di comando nella shell host:

kubectl port-forward -n prometheus svc/$(kubectl -n prometheus get svc | grep grafana | cut -d ' ' -f 1) 8080:80 &

Accedi all'interfaccia utente di Grafana

Accedi alla schermata di accesso dell'interfaccia utente di Grafana nello stesso modo in cui hai effettuato l'accesso all'interfaccia utente di Prometheus in precedenza. Se utilizzi Cloud9, seleziona Preview->Preview Running Application, quindi aprilo in una nuova finestra. Se utilizzi il tuo host locale o un'istanza EC2 con desktop remoto, visita l'URL http://localhost:8080. Accedi con il nome utente admin e la password che hai recuperato in precedenza.

Grafana - accedi

Nel pannello di navigazione, scegli Cruscotti

Grafana - cruscotti

Scegli New ed Importare

Grafana: carica tramite ID da grafana.com
Importeremo la dashboard DCGM Grafana predefinita descritta in Dashboard dell'esportatore NVIDIA DCGM.

Nel campo import via grafana.com, accedere 12239 e scegli Caricare
Scegli Prometeo come origine dati
Scegli Importare

Grafana: importa dashboard

Vedrai una dashboard simile a quella nello screenshot seguente.

Grafana – cruscotto

Per dimostrare che queste metriche sono basate su pod, modificheremo il file Utilizzo della GPU riquadro in questa dashboard.

Scegli il riquadro e il menu delle opzioni (tre punti)
espandere la Opzioni sezione e modificare il file Leggenda campo
Sostituisci il valore lì con Pod {{pod}}, Quindi scegliere Risparmi

Grafana: metrica basata su pod
La legenda ora mostra il gpu-burn nome del pod associato all'utilizzo della GPU visualizzato.

Arresta il port forwarding del servizio interfaccia utente Grafana

Esegui quanto segue nella shell host:

kill -9 $(ps -aef | grep port-forward | grep -v grep | grep prometheus | awk '{print $2}')

In questo post, abbiamo dimostrato l'utilizzo di Prometheus e Grafana open source distribuiti nel cluster EKS. Se lo si desidera, questa distribuzione può essere sostituita con Servizio gestito da Amazon per Prometheus ed Grafana gestita da Amazon.

ripulire

Per ripulire le risorse create, esegui il seguente script da aws-do-eks guscio del contenitore:

./eks-delete.sh

Conclusione

In questo post, abbiamo utilizzato NVIDIA DCGM Exporter per raccogliere parametri GPU e visualizzarli con CloudWatch o Prometheus e Grafana. Ti invitiamo a utilizzare le architetture qui dimostrate per abilitare il monitoraggio dell'utilizzo della GPU con NVIDIA DCGM nel tuo ambiente AWS.

Risorse addizionali

Circa gli autori

Abilita parametri GPU basati su pod in Amazon CloudWatch | Amazon Web Services PlatoBlockchain Data Intelligence. Ricerca verticale. Ai. Amr Ragab è un ex Principal Solutions Architect, EC2 Accelerated Computing presso AWS. Si dedica ad aiutare i clienti a eseguire carichi di lavoro computazionali su larga scala. Nel tempo libero gli piace viaggiare e trovare nuovi modi per integrare la tecnologia nella vita quotidiana.

Alex Iankoulski è Principal Solutions Architect, Machine Learning autogestito presso AWS. È un ingegnere completo di software e infrastrutture a cui piace svolgere un lavoro approfondito e pratico. Nel suo ruolo, si concentra sull'aiutare i clienti con la containerizzazione e l'orchestrazione dei carichi di lavoro ML e AI su servizi AWS basati su container. È anche l'autore dell'open source fare quadro e un capitano Docker che ama applicare le tecnologie dei container per accelerare il ritmo dell'innovazione risolvendo al contempo le più grandi sfide del mondo. Negli ultimi 10 anni, Alex ha lavorato alla democratizzazione dell'intelligenza artificiale e del machine learning, alla lotta al cambiamento climatico e a rendere i viaggi più sicuri, l'assistenza sanitaria migliore e l'energia più intelligente.

Abilita parametri GPU basati su pod in Amazon CloudWatch | Amazon Web Services PlatoBlockchain Data Intelligence. Ricerca verticale. Ai. Keita Watanabe è un Senior Solutions Architect di Frameworks ML Solutions presso Amazon Web Services, dove aiuta a sviluppare le migliori soluzioni di machine learning autogestite basate sul cloud del settore. Il suo background è nella ricerca e nello sviluppo del machine learning. Prima di unirsi ad AWS, Keita lavorava nel settore dell'e-commerce. Keita ha conseguito un dottorato di ricerca. in Scienze presso l'Università di Tokyo.

Distribuzione di contenuti basati su SEO e PR. Ricevi amplificazione oggi.
PlatoData.Network Generativo verticale Ai. Potenzia te stesso. Accedi qui.
PlatoAiStream. Intelligenza Web3. Conoscenza amplificata. Accedi qui.
PlatoneESG. Automobilistico/VE, Carbonio, Tecnologia pulita, Energia, Ambiente, Solare, Gestione dei rifiuti. Accedi qui.
Platone Salute. Intelligence sulle biotecnologie e sulle sperimentazioni cliniche. Accedi qui.
Grafico Prime. Migliora il tuo gioco di trading con ChartPrime. Accedi qui.
BlockOffset. Modernizzare la proprietà della compensazione ambientale. Accedi qui.
Fonte: https://aws.amazon.com/blogs/machine-learning/enable-pod-based-gpu-metrics-in-amazon-cloudwatch/