tencent cloud

Tencent Cloud TI Platform

Product Introduction
Overview
Product Pricing
Benefits to Customers
Use Cases
Purchase Guide
Billing Overview
Purchase Mode
Renewal Instructions
Overdue Payment Instructions
Security Compliance
Data Security Protection Mechanism
Monitoring, Auditing, and Logging
Security Compliance Qualifications
Quick Start
Platform Usage Preparation
Operation Guide
Model Hub
Task-Based Modeling
Dev Machine
Model Management
Model Evaluation
Online Services
Resource Group Management
Managing Data Sources
Tikit
GPU Virtualization
Practical Tutorial
Deploying and Reasoning of LLM
LLM Training and Evaluation
Built-In Training Image List
Custom Training Image Specification
Angel Training Acceleration Feature Introduction
Implementing Resource Isolation Between Sub-users Based on Tags
API Documentation
History
Introduction
API Category
Making API Requests
Online Service APIs
Data Types
Error Codes
Related Agreement
Service Level Agreement
Privacy Policy
Data Processing And Security Agreement
Open-Source Software Information
Contact Us
DocumentationTencent Cloud TI PlatformOperation GuideTask-Based ModelingBuilt-In Large Model Training Data Format Description

Built-In Large Model Training Data Format Description

PDF
Focus Mode
Font Size
Last updated: 2025-09-26 21:12:13

Overview

Task-Based Modeling - The built-in large model fine-tuning module uses platform training templates. Users only need to prepare the training dataset for fine-tuning to launch training with one click. This document aims to detail the format requirements for supervised fine-tuning and pretraining.
Note:
The platform supports dataset files in jsonl format (each line in a jsonl file represents a group of sample data), with UTF-8 encoding only. Multiple files in the training data path will be merged for training.

Supervised Fine-Tuning (SFT) Data

The messages in the data sample describe multi-role dialogue and interaction. In multi-round dialogue, the conversation must end with the assistant. Fillable roles include:
Role (Role)
Required or Not
Description
Field
system
No
System character design
Character design info
user
Yes
User-submitted input
User input content
assistant
Yes
Model response
Model response content
Single-round dialogue dataset sample example
{"messages": [{"role": "system", "content": "You are the TIONE intelligent Q&A robot, capable of understanding user questions and providing fluent answers"}, {"role": "user", "content": "Read the following text and summarize it in one sentence.\\nOn Friday, a federation judge ruled in favor of a former University of California Los Angeles basketball star who sued to terminate NCAA control over university athletes' name, image and portrait rights."}, {"role": "assistant", "content": "The judge supported a former UCLA basketball star challenging NCAA control over athlete information."}]}
Multi-round dialogue dataset sample example
{"messages": [{"role": "system", "content": "You are the TIONE intelligent Q&A robot, capable of understanding user questions and providing fluent answers"}, {"role": "user", "content": "Read the following text and summarize it in one sentence.\\nOn Friday, a federation judge ruled in favor of a former University of California Los Angeles basketball star who sued to terminate NCAA control over university athletes' name, image and portrait rights."}, {"role": "assistant", "content": "The judge supported a former UCLA basketball star challenging NCAA control over athlete information."}, {"role": "user", "content": "Read the following text and summarize it in one sentence.\\nPresident Obama apologized on Thursday to Americans who lost their insurance plans under his federal health law, despite his repeated assurances they could keep their coverage if they wanted.\\nIn an exclusive interview with NBC News, he said: 'I'm sorry they found themselves in this situation based on my assurances.'"}, {"role": "assistant", "content": "Obama expressed regret in an NBC interview."}]}

Dataset with Thought Chain

The messages in the data sample describe multi-role dialogue and interaction. In the response with Role=assistant, the thinking content is surrounded by <think> and </think>. In multi-round dialogue, the conversation must end with the assistant. Fillable roles include:
Role (Role)
Required or Not
Description
Field
system
No
System character design
Character design info
user
Yes
User-submitted input
User input content
assistant
Yes
The model's response, with thinking content surrounded by <think> and </think>.
Model response content
Sample example
{"messages": [{"role": "system", "content": "You are the TIONE intelligent Q&A robot, capable of understanding user questions and providing fluent answers"}, {"role": "user", "content": "Read the following text and summarize it in one sentence.\\nOn Friday, a federal judge ruled in favor of a former UCLA basketball star who sued to terminate the NCAA's control over college athletes' names, images, and portrait rights."}, {"role": "assistant", "content": "<think>The user asked me to read the above text and summarize it in one sentence. I need to find the subject, object, and predicate of this event, identify suitable modifiers to limit it, and connect them to form a summarized sentence.</think>A judge supported a former UCLA basketball star challenging the NCAA's control over athletes' info."}]}
Pay attention, the thought chain requirements for the Hunyuan series models vary slightly in format, as follows:
The messages in the data sample describe multi-role dialogue and interaction. In the response with Role=assistant, the thinking content is surrounded by <think> and </think>. In multi-round dialogue, the conversation must end with the assistant. The last round of dialogue should be enclosed by <answer>\\n and \\n</answer>, with a line break \\n separating the thinking content and answer content. Fillable roles include:
Role (Role)
Required or Not
Description
Field
system
No
System character design
Character design info
user
Yes
User-submitted input
User input content
assistant
Yes
The model's response; the last round of response has thinking content surrounded by <think>\\n and \\n</think>, and the answer content is enclosed by <answer>\\n and \\n</answer>.
Model response content
Sample example
{"messages": [{"role": "system", "content": "You are the TIONE intelligent Q&A bot, capable of understanding user questions and providing fluent answers"}, {"role": "user", "content": "Read the following text and summarize it in one sentence.\\nOn Friday, a federal judge ruled in favor of a former UCLA basketball star who sued to terminate NCAA's control over college athletes' names, images, and portrait rights."}, {"role": "assistant", "content": "<think>\\nThe user asks me to read the above text and summarize it in one sentence. I need to identify the subject, object, and predicate of this event, find suitable modifiers to limit it, and connect them to form a summarized sentence.\\n</think>\\n<answer>\\nA judge supported a former UCLA basketball star challenging the NCAA's control over athlete information.\\n</answer>"}]}

Dataset with Tool Calls

The messages in the data sample describe multi-role dialogue and interaction. Tools use json format strings to define callable model tools. Please note the tool call dataset should be used for models supporting tool calls (such as Qwen3). In multi-round dialogue, the conversation must end with the assistant or tool_call. Fillable roles include:
Role (Role)
Required or Not
Description
Field
system
No
System character design
Character design info
user
Yes
User-submitted input must appear in odd-numbered entries (starting from 1) except for system character design.
User input content
assistant
Yes
Model responses must appear in even-numbered entries (starting from 1) except for system character design.
Model response content
tool_call
Yes
Tool calls must appear in even-numbered entries (starting from 1) except for system character design.
Call the tool name, parameter
tool
Yes
SDK API call results must appear in odd-numbered entries (starting from 1) except for system character design.
Call returned results
Sample example ending with assistant
{"messages": [{"role": "system", "content": "You are TIONE's intelligent Q&A bot, capable of understanding user questions and providing fluent answers"}, {"role": "user", "content": "Please answer the following question: Xiao Li is 25 this year, and Xiao Wang is 20. What is the sum of their ages"}, {"role": "tool_call", "content": "{\\"name\\":\\"calculate_age_sum\\", \\"arguments\\": {\\"age_1\\": 25, \\"age_2\\": 20}}"}, {"role": "tool", "content": "/no_think{\\"age_sum\\": 45}"}, {"role": "assistant", "content": "<think>\\n\\n</think>\\n<answer>\\nAccording to my calculation, the sum of Xiao Li and Xiao Wang's ages is 45\\n</answer>"}], "tools": "[{\\"name\\": \\"calculate_age_sum\\", \\"description\\": \\"Calculate and return the sum of two people's ages\\", \\"parameters\\": {\\"type\\": \\"object\\", \\"properties\\": {\\"age_1\\": {\\"type\\": \\"number\\", \\"description\\": \\"The first person's age\\"}, \\"age_2\\": {\\"type\\": \\"number\\", \\"description\\": \\"The second person's age\\"}}, \\"required\\": [\\"age_1\\", \\"age_2\\"]}}]"}
Sample example ending with tool_call
{"messages": [{"role": "system", "content": "You are TIONE's intelligent Q&A bot, capable of understanding user questions and providing fluent answers"}, {"role": "user", "content": "/no_think Please answer the following question: Xiao Li is 25 this year, and Xiao Wang is 20. What is the sum of their ages"}, {"role": "tool_call", "content": "<think>\\n\\n</think>\\n<answer>\\n{\\"name\\":\\"calculate_age_sum\\", \\"arguments\\": {\\"age_1\\": 25, \\"age_2\\": 20}}\\n</answer>"}], "tools": "[{\\"name\\": \\"calculate_age_sum\\", \\"description\\": \\"Calculate and return the sum of two people's ages\\", \\"parameters\\": {\\"type\\": \\"object\\", \\"properties\\": {\\"age_1\\": {\\"type\\": \\"number\\", \\"description\\": \\"The first person's age\\"}, \\"age_2\\": {\\"type\\": \\"number\\", \\"description\\": \\"The second person's age\\"}}, \\"required\\": [\\"age_1\\", \\"age_2\\"]}}]"}
Tool call content description:
Tool description tools: (Unfold results)
"tools": "[
{
\\"name\\": \\"calculate_age_sum\\",
"description": "calculate and return the sum of their ages"
\\"parameters\\": {
\\"type\\": \\"object\\",
\\"properties\\": {
\\"age_1\\": {
\\"type\\": \\"number\\",
"description": "the first person's age"
},
\\"age_2\\": {
\\"type\\": \\"number\\",
"description": "the second person's age"
}
},
\\"required\\": [\\"age_1\\", \\"age_2\\"]
}
}
]"
In the dialogue, tool call Role=tool_call: (unfold result)
{
"role": "tool_call",
"content": "{
\\"name\\":\\"calculate_age_sum\\",
\\"arguments\\": {
\\"age_1\\": 25,
\\"age_2\\": 20
}
}"
}
In the dialogue, tool call result Role=tool: (unfold result)
{
"role": "tool",
"content": "{
\\"age_sum\\": 45
}"
}

Pretraining (PT) Data

Describe unlabeled text from the data sample's text.
Sample example
{"text": "The layout of Luzhen's taverns differs from others: each has a large L-shaped counter facing the street, with hot water ready inside to warm wine at any time. Laborers, after finishing work at noon or evening, often spend four coppers on a bowl of wine—this was twenty years ago, now it costs ten—stand by the counter, drink it hot, and rest. For one more copper, they can get a dish of salted bamboo shoots or fennel beans to go with the wine. Spending over ten coppers buys a meat dish, but most customers are short-coated laborers who rarely afford such luxury. Only those in long gowns stroll into the inner room, order wine and dishes, and sit drinking leisurely."}
{"text": "From the age of twelve, I worked as a waiter at the Xianheng Tavern by the town gate. The boss said I looked too foolish to serve long-gowned customers, so I was assigned tasks outside. The short-coated customers, though easier to deal with, often rambled on endlessly. They would insist on watching the yellow wine being ladled from the vat, checking if the pot had water at the bottom, and seeing it placed in hot water before feeling confident. Under such serious supervision, diluting the wine was nearly impossible. Therefore, after a few days, the boss said I couldn’t handle the job. Luckily, the recommender’s influence was strong, so I wasn’t dismissed but changed to the dull task of warming wine exclusively."}

Custom Format

In addition to placing data files in the training data path that meet the description format mentioned above, you can also prepare a dataset_info.json according to the LLaMA-Factory dataset format to specify the training dataset. When dataset_info.json already exists in your training data path, the platform will directly read all datasets specified in this description file and only these datasets.
Example:
Training data directory (for example, /opt/ml/input/data/train)
/opt/ml/input/data/train
├── dataset1.json
└── dataset_info.json
training data dataset1.json
[
{
"messages": [
{
"role": "system",
"content": "You are the TIONE intelligent Q&A robot, capable of understanding user questions and providing fluent answers"
},
{
"role": "user",
"content": "Read the following text and summarize it in one sentence.\\nText: (CNN) - A federal judge ruled in favor of a former UCLA basketball star on Friday, who sued to terminate NCAA's control over college athletes' name, image, and portrait rights."
},
{
"role": "assistant",
"content": "The judge supported a former UCLA basketball star who challenged NCAA's information control."
}
]
},
{
"messages": [
{
"role": "system",
"content": "You are the TIONE intelligent Q&A robot, capable of understanding user questions and providing fluent answers"
},
{
"role": "user",
"content": "Read the following text and summarize it in one sentence.\\nText: President Obama apologized on Thursday to Americans who lost their insurance plans due to the federal health law he supported, despite his repeated claims that they could retain their coverage if they wished."
},
{
"role": "assistant",
"content": "Obama expressed regret in an interview with NBC about the cancellation of the insurance plan."
}
]
}
]

Dataset description file dataset_info.json
{
"dataset1": {
"file_name": "dataset1.json",
"formatting": "sharegpt",
"columns": {
"messages": "messages",
},
"tags": {
"role_tag": "role",
"content_tag": "content",
"user_tag": "user",
"assistant_tag": "assistant"
}
},
}



Help and Support

Was this page helpful?

Help us improve! Rate your documentation experience in 5 mins.

Feedback