tencent cloud

Tencent Kubernetes Engine

소식 및 공지 사항
릴리스 노트
제품 릴리스 기록
제품 소개
제품 장점
제품 아키텍처
시나리오
제품 기능
리전 및 가용존
빠른 시작
신규 사용자 가이드
표준 클러스터를 빠르게 생성
Demo
클라우드에서 컨테이너화된 애플리케이션 배포 Check List
TKE 표준 클러스터 가이드
Tencent Kubernetes Engine(TKE)
클러스터 관리
네트워크 관리
스토리지 관리
Worker 노드 소개
Kubernetes Object Management
워크로드
클라우드 네이티브 서비스 가이드
Tencent Managed Service for Prometheus
TKE Serverless 클러스터 가이드
TKE 클러스터 등록 가이드
실습 튜토리얼
Serverless 클러스터
네트워크
로그
모니터링
유지보수
DevOps
탄력적 스케일링
자주 묻는 질문
클러스터
TKE Serverless 클러스터
유지보수
서비스
이미지 레지스트리
원격 터미널
문서Tencent Kubernetes Engine

Using qGPU Online/Offline Hybrid Deployment

포커스 모드
폰트 크기
마지막 업데이트 시간: 2024-12-24 17:20:02
This document describes how to use qGPU online/offline hybrid deployment.

Step 1. Deploy add-ons

You need to deploy nano-gpu-scheduler and nano-gpu-agent.

Deploying nano-gpu-scheduler

nano-gpu-scheduler involves ClusterRole and ClusterRoleBinding as well as Deployment and Service. Deploy it by using the following YAML. Below is the scheduling policy:
By default, online Pods are preferentially scheduled to GPU cards without offline Pods according to the spread algorithm.
By default, offline Pods are preferentially scheduled to GPU cards without online Pods according to the bin packing algorithm.
kind: Deployment
apiVersion: apps/v1
metadata:
name: qgpu-scheduler
namespace: kube-system
spec:
replicas: 1
selector:
matchLabels:
app: qgpu-scheduler
template:
metadata:
labels:
app: qgpu-scheduler
annotations:
scheduler.alpha.kubernetes.io/critical-pod: ''
spec:
hostNetwork: true
tolerations:
- effect: NoSchedule
operator: Exists
key: node-role.kubernetes.io/master
serviceAccount: qgpu-scheduler
containers:
- name: qgpu-scheduler
image: ccr.ccs.tencentyun.com/lionelxchen/mixed-scheduler:v61
command: ["qgpu-scheduler", "--priority=binpack"]
env:
- name: PORT
value: "12345"
resources:
limits:
memory: "800Mi"
cpu: "1"
requests:
memory: "800Mi"
cpu: "1"
---
apiVersion: v1
kind: Service
metadata:
name: qgpu-scheduler
namespace: kube-system
labels:
app: qgpu-scheduler
spec:
ports:
- port: 12345
name: http
targetPort: 12345
selector:
app: qgpu-scheduler
---
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: qgpu-scheduler
rules:
- apiGroups:
- ""
resources:
- nodes
verbs:
- get
- list
- watch
- apiGroups:
- ""
resources:
- events
verbs:
- create
- patch
- apiGroups:
- ""
resources:
- pods
verbs:
- update
- patch
- get
- list
- watch
- apiGroups:
- ""
resources:
- bindings
- pods/binding
verbs:
- create
- apiGroups:
- ""
resources:
- configmaps
verbs:
- get
- list
- watch
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: qgpu-scheduler
namespace: kube-system
---
kind: ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: qgpu-scheduler
namespace: kube-system
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: qgpu-scheduler
subjects:
- kind: ServiceAccount
name: qgpu-scheduler
namespace: kube-system`

Deploying nano-gpu-agent

nano-gpu-agent involves ClusterRole and ClusterRoleBinding as well as Deployment and Service. Deploy it by using the following YAML.
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: qgpu-manager
namespace: kube-system
spec:
selector:
matchLabels:
app: qgpu-manager
template:
metadata:
annotations:
scheduler.alpha.kubernetes.io/critical-pod: ""
labels:
app: qgpu-manager
spec:
serviceAccount: qgpu-manager
hostNetwork: true
nodeSelector:
qgpu-device-enable: "enable"
initContainers:
- name: qgpu-installer
image: ccr.ccs.tencentyun.com/lionelxchen/mixed-manager:v27
command: ["/usr/bin/install.sh"]
securityContext:
privileged: true
volumeMounts:
- name: host-root
mountPath: /host
containers:
- image: ccr.ccs.tencentyun.com/lionelxchen/mixed-manager:v27
command: ["/usr/bin/qgpu-manager", "--nodename=$(NODE_NAME)", "--dbfile=/host/var/lib/qgpu/meta.db"]
name: qgpu-manager
resources:
limits:
memory: "300Mi"
cpu: "1"
requests:
memory: "300Mi"
cpu: "1"
env:
- name: KUBECONFIG
value: /etc/kubernetes/kubelet.conf
- name: NODE_NAME
valueFrom:
fieldRef:
fieldPath: spec.nodeName
securityContext:
privileged: true
volumeMounts:
- name: device-plugin
mountPath: /var/lib/kubelet/device-plugins
- name: pod-resources
mountPath: /var/lib/kubelet/pod-resources
- name: host-var
mountPath: /host/var
- name: host-dev
mountPath: /host/dev
volumes:
- name: device-plugin
hostPath:
path: /var/lib/kubelet/device-plugins
- name: pod-resources
hostPath:
path: /var/lib/kubelet/pod-resources
- name: host-var
hostPath:
type: Directory
path: /var
- name: host-dev
hostPath:
type: Directory
path: /dev
- name: host-root
hostPath:
type: Directory
path: /
---
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: qgpu-manager
rules:
- apiGroups:
- ""
resources:
- "*"
verbs:
- get
- list
- watch
- apiGroups:
- ""
resources:
- events
verbs:
- create
- patch
- apiGroups:
- ""
resources:
- pods
verbs:
- update
- patch
- get
- list
- watch
- apiGroups:
- ""
resources:
- nodes/status
verbs:
- patch
- update
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: qgpu-manager
namespace: kube-system
---
kind: ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: qgpu-manager
namespace: kube-system
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: qgpu-manager
subjects:
- kind: ServiceAccount
name: qgpu-manager
namespace: kube-system

Step 2. Configure the node label

All qGPU nodes in the cluster will be labeled "qgpu-device-enable=enable". In addition, you need to add the "mixed-qgpu-enable=enable" label to nodes that require online/offline deployment.

Step 3. Configure business attributes

Offline Pods
Online Pods
General Pods
You can use tke.cloud.tencent.com/app-class: offline to identify an offline Pod and use tke.cloud.tencent.com/qgpu-core-greedy to apply for computing power for it. Note that an offline Pod doesn't support multiple cards, and the computing power applied for must be no more than 100 cores.
apiVersion: v1
kind: Pod
annotations:
tke.cloud.tencent.com/app-class: offline
spec:
containers:
- name: offline-container
resources:
requests:
tke.cloud.tencent.com/qgpu-core-greedy: xx # Offline computing power
tke.cloud.tencent.com/qgpu-memory: xx
You can use tke.cloud.tencent.com/app-class: online to identify an online Pod. You need to apply for only video memory but not computing power.
apiVersion: v1
kind: Pod
annotations:
tke.cloud.tencent.com/app-class: online
spec:
containers:
- name: online-container
resources:
requests:
tke.cloud.tencent.com/qgpu-memory: xx
The tke.cloud.tencent.com/app-class annotation is not involved. A general Pod supports multiple cards.
apiVersion: v1
kind: Pod
spec:
containers:
- name: common-container
resources:
requests:
tke.cloud.tencent.com/qgpu-core: xx
tke.cloud.tencent.com/qgpu-memory: xx


도움말 및 지원

문제 해결에 도움이 되었나요?

피드백