Data Lake Compute (DLC) supports purchasing GPU resources and mounting GPU resources for machine learning resource groups, used for reasoning and training jobs.
Through this document, you can experience model training with GPU resources based on the example code we provide.
Note:
Resource group: As a secondary queue division of the computing resources within a Spark Standard Engine, resource groups belong to the parent Standard Engine. Resource groups under the same engine share resources with each other.
The computing units (CU) and GPU (number of GPU cards) of the DLC Spark Standard Engine can be allocated into multiple machine learning resource groups as needed. You can set the minimum and maximum limits for CU and GPU per resource group to efficiently manage compute resource isolation and workloads in complex scenarios such as multi-tenant and multi-task environments.
Currently, the purchase of GPU resources, machine learning resource groups, WeData Notebook Exploration, and machine learning are all allowlist features. If needed, submit a ticket to contact the DLC and WeData teams to enable the purchase of GPU resources, machine learning resource groups, Notebook, and MLflow services. Activating Accounts and Products
Activating Accounts and Products
The DLC account and product activation features need to be enabled with the Tencent Cloud root account. Once the root account completes the operations, all sub-accounts under the default root account can use these features. Adjustments can be made using the CAM feature if needed. For specific operation guides, see the Complete Process for New User Activation. The activation of WeData MLflow service is performed at the root account granularity. Once the operations are completed with the root account, all sub-accounts under the root account can use the service.
You need to provide the customer's regional information, APPID, root account UIN, VPC ID, and subnet ID. The VPC ID and subnet ID are used for the network interconnection operation of the MLflow service.
Note:
Since multiple features on the product require network access operations, to ensure network connectivity, it is recommended that subsequent operations (including purchasing execution resource groups and creating Notebook workspaces) be performed within this VPC and subnet.
Purchasing GPU Computing Resources on DLC
After the product service activation is completed, you can purchase GPU computing resources on Data Lake Compute (DLC).
2. Select "Create resource".
3. Purchase the standard engine with a monthly subscription as needed, select GPU type for calculation type, and choose from various specifications of machine types and number of instances. For model details and pricing, refer to the billing documentation. Note:
1. Purchase should be made with the root account or an account having financial permission.
2. The billing mode for GPU computing resources currently only supports monthly subscription mode.
3. The initial launch may require several minutes of waiting time after purchase. If the startup cannot be completed for a long time, submit a ticket. 4. The business scenarios related to Data Exploration and Data Jobs are not currently supported by GPU computing resources .
Creating Machine Learning Resource Groups
After purchasing the Standard Engine, return to the Standard Engine page. You need to create a machine learning resource group under this engine to start performing machine learning-related features. 1. Click Manage resource group/Engine name.
2. After going to the resource group management page, click the Create resource group button in the upper-left corner.
3. Create a resource group for machine learning.
Business scenario selection: Machine learning.
Framework type: You can select a suitable framework to create based on your actual business scenarios, : ML open-source framework (supporting single-node computing mode) and Spark MLlib (supporting Spark cluster mode)
Note:
1. For ML open-source framework, if you require to use GPU resource, please select built-in image: tensorflow2.20-gpu-py311-cu124 or pytorch2.6-gpu-py311-cu124.
2. For Spark MLlib framework, if you require to use GPU resource, please select built-in image: spark3.5-tensorflow2.20-gpu-py311-cu124 or spark3.5-pytorch2.6-gpu-py311-cu124.
Resource configuration: Select resources as needed.
Note:
1. If you select an ML open-source framework, GPU resource support allocation in 1-card increments.
2. If you select Spark MLlib framework, GPU resource support allocation in 1-card increments.
After the configuration is completed, click Confirm to return to the Resource group management page. After several minutes, you can click the Refresh button at the top of the list page for confirmation.
Going to the WeData-Notebook Feature for Demo Practice
After the resource group and demo dataset are created, go to WeData for model training practice with Notebook and MLflow.
Creating WeData Projects and Associating Them with DLC Engines
1. Create a project or select an existing project. For details, see Project List. 2. Select the required DLC engine in the configuration of the storage and computing engine.
Purchasing Execution Resource Groups and Associating Them with Projects
If you need to schedule Notebook tasks periodically in the orchestration space, purchase a Scheduling resource group and associate it with a designated project. For details, see Scheduling Resource Group Configuration. Operation Steps:
1. Go to "Execution Resource Group > Scheduling Resource Group > Standard Scheduling Resource Group" and click Create.
2. Configure the resource group.
Region: The region where the scheduling resource group is located should be consistent with the region where the storage and computing engine is located. For example, if you purchase a DLC engine in the Singapore region of the international site, you need to purchase a scheduling resource group in the same region.
VPC and subnet: It is recommended to select the VPC and subnet in Standard-S 1.1. If other VPCs and subnets are selected, you need to ensure that the selected VPCs and subnets are interconnected with the VPCs and subnets in Standard-S 1.1.
Specifications: Select specifications according to the task volume.
3. After the Scheduling resource group is created, click Associate project in the operation column of the resource group list to associate this scheduling resource group with the desired project.
Creating Notebook Workspaces
1. In the WeData Console, go to "Project List > Project > Offline Development > Notebook Exploration" and create or use an existing workspace. 2. When creating a workspace, select the engine of the purchased GPU resource.
3. After creation is completed, enter the created workspace to continue subsequent operations.
Creating Notebook Files
In the left-side Resource Explorer, you can create folders and notebook files. Note: Notebook files must end with (.ipynb). A built-in demo Notebook file is available in the Resource Manager, ready for users to use out of the box.
Selecting Kernels
1. Click Select Kernel in the top-right corner of the corresponding Notebook file.
2. Click to select a kernel, then choose "DLC resource group" in the pop-up dropdown option.
3. Click on the DLC resource group, then select the resource group you created from the DLC data engine as needed. When using the demo provided in this document for trial, please select the resource group with TensorFlow-type image under the ML open-source framework. The kernel naming rule is: Framework type - (machine learning resource group name). You can distinguish them by the naming and framework type of your created machine learning when selecting.
Running Notebook Files
1. After completing kernel selection, perform a page refresh. In the corresponding Notebook file, click Run All or run the corresponding code block separately, then the Kernel Configuration window will pop up. At this point, you can edit or confirm the initial configuration.
2. Perform practical tutorial: Use TensorFlow matrix multiplication to perform load testing on the GPU, integrate real-time resource monitoring function, and output computing performance metrics.
import tensorflow as tf
import time
import subprocess
import threading
import os
def setup_gpu():
gpus = tf.config.list_physical_devices("GPU")
if not gpus:
raise RuntimeError("No GPU detected")
for gpu in gpus:
tf.config.experimental.set_memory_growth(gpu, True)
print(f"GPUs detected: {len(gpus)}")
def monitor_gpu(interval=5, duration=30):
end_time = time.time() + duration
while time.time() < end_time:
try:
result = subprocess.run(['nvidia-smi', '--query-gpu=utilization.gpu,memory.used,memory.total,temperature.gpu', '--format=csv,noheader,nounits'],
capture_output=True, text=True, check=True)
gpu_util, mem_used, mem_total, temp = result.stdout.strip().split(', ')
print(f"GPU: {gpu_util}% | Mem: {mem_used}/{mem_total}MB | Temp: {temp}C")
except Exception as e:
print(f"Monitor error: {e}")
time.sleep(interval)
def run_gpu_stress(size=4096, duration=30):
with tf.device("/GPU:0"):
a = tf.random.normal([size, size], dtype=tf.float32)
b = tf.random.normal([size, size], dtype=tf.float32)
@tf.function
def matmul_step():
return tf.matmul(a, b)
_ = matmul_step()
print(f"Running GPU stress for {duration}s at full capacity")
start = time.time()
iters = 0
while time.time() - start < duration:
_ = matmul_step()
iters += 1
print(f"Completed {iters} iterations in {time.time() - start:.2f}s")
if __name__ == "__main__":
try:
setup_gpu()
monitor_thread = threading.Thread(target=monitor_gpu, args=(5, 30))
monitor_thread.start()
run_gpu_stress(size=4096, duration=30)
monitor_thread.join()
print("Test completed successfully")
except Exception as e:
print(f"Error: {e}")