Project Overview
This project fine-tunes google/gemma-4-E4B-it with LoRA to generate Chinese dialogue in the style of Kal'tsit from Arknights. The goal is not to train a full model from scratch, but to adapt a large instruction-tuned base model with a lightweight PEFT adapter so that it better follows a specific character voice: calm, restrained, analytical, and context-aware.
The project covers the full workflow from story data collection, text cleaning, character-specific dataset construction, prompt design, SFT formatting, LoRA training, validation monitoring, test generation, and optional adapter merging. The main training workflow is documented in gemma4_emotion_lora_arknights.ipynb.
Data Collection and Processing
The raw data was collected from Arknights story pages through ASTR story reader URLs. The URL list is stored in urls.txt and contains 263 story links. The data collection script is download_arknights_story.py.
The collection pipeline works as follows:
- Parse the language code and story file path from each ASTR page URL.
- Convert the page route into a raw JSON URL from the
ArknightsStoryJson repository, using the pattern zh_CN/gamedata/story/{story_path}.json.
- Download each story JSON with
requests.
- Read
storyList and extract attributes.name as the speaker and attributes.content as the dialogue text.
- Mark lines without a speaker as narration, and preserve
Sticker text as on-screen text.
- Clean color tags, HTML-like tags, escaped newlines, and redundant whitespace.
- Save each story as both readable
.txt and structured .jsonl files.
Each structured line uses the following format:
{"speaker": "凯尔希", "text": "dialogue text"}
The character dataset is then built with build_character_dataset.py:
- Load all
.jsonl files from the result folder in sorted order.
- Add source file names and line indices to every record for traceability.
- Select only records whose speaker exactly matches
凯尔希.
- Filter very short or very long responses. The default range is 2 to 300 Chinese characters.
- Use the previous 3 story lines as the dialogue context for each target response.
- Convert each sample into an instruction/input/output SFT record.
- Shuffle with seed 42 and split the dataset into train, validation, and test sets with an 80/10/10 ratio.
Final dataset size:
Table with columns: Split, Samples| Split | Samples |
|---|
| Train | 2680 |
| Validation | 335 |
| Test | 335 |
| Total | 3350 |
Prompt Design
Each training sample is converted into a chat-style prompt and assistant completion. The notebook uses the official tokenizer chat template to format the data for Gemma.
System prompt:
你正在扮演《明日方舟》中的凯尔希。
请根据用户给出的上下文进行回复。
要求:
1. 只输出凯尔希的回复内容。
2. 不要解释你为什么这样回复。
3. 不要输出“凯尔希:”这个角色名前缀。
4. 语气应冷静、克制、理性,句子可以偏长。
5. 回复应尽量贴合上下文,而不是机械复述已有台词。
User prompt template:
请根据上下文,以凯尔希的说话风格进行回复。
上下文:
[Character A]:previous line
[Character B]:previous line
[凯尔希]:previous line
The assistant completion is the target Kal'tsit response. In other words, the model is trained to generate the next character-style reply from context, rather than to classify text into labels.
Fine-Tuning Setup
The base model is google/gemma-4-E4B-it, downloaded from ModelScope and loaded from a local directory. Training uses single-GPU BF16 LoRA fine-tuning with PEFT and TRL SFTTrainer. The notebook loads the model with AutoModelForCausalLM and saves the final adapter and tokenizer.
Core training configuration:
Table with columns: Item, Value| Item | Value |
|---|
| Base model | google/gemma-4-E4B-it |
| Fine-tuning method | LoRA / PEFT |
| Trainer | TRL SFTTrainer |
| LoRA rank | 8 |
| LoRA alpha | 16 |
| LoRA dropout | 0.05 |
| Target modules | all-linear |
| Trainable parameters |
Training workflow:
- Load local JSONL files into a
DatasetDict.
- Convert instruction/input/output records into chat-style prompt/completion examples.
- Load the Gemma 4 tokenizer and base model.
- Configure LoRA and verify that trainable parameters are correctly attached.
- Run supervised fine-tuning for 2 epochs with
SFTTrainer.
- Evaluate on the validation set every 50 steps and save checkpoints.
- Generate responses for all 335 test examples.
- Save the LoRA adapter, tokenizer, training metrics, and test generations.
Results
Training completed successfully at global_step=336, and the best checkpoint was checkpoint-336, which is also the final step. Total training time was about 1993 seconds, or 33.2 minutes.
Validation metrics:
Table with columns: Step, Eval loss, Eval token accuracy| Step | Eval loss | Eval token accuracy |
|---|
| 50 | 3.1132 | 0.4541 |
| 100 | 2.9867 | 0.4716 |
| 150 | 2.9440 | 0.4739 |
| 200 | 2.9127 | 0.4758 |
| 250 | 2.8917 | 0.4769 |
| 300 | 2.8853 | 0.4801 |
The final test generation file is kaltsit_test_generations.csv, with 335 generated responses. There were no empty outputs and no generated responses with the unwanted 凯尔希: role prefix. The average target response length was 23.42 Chinese characters, while the average generated response length was 16.53 characters.
Qualitatively, the model learned part of the target style, especially restrained phrasing, concise responses, and role-prefix control. However, the test set also shows limitations. The response ...... appears 38 times, and 20 generated responses are 4 characters or shorter. This suggests that the LoRA adapter is valid and learned useful stylistic behavior, but it is not yet a high-quality story continuation model.
My interpretation of the result:
- The training run completed successfully, and the adapter files are valid.
- The language model LoRA weights were updated.
- The model shows measurable style-control behavior.
- Contextual reasoning and narrative continuation still need improvement.
- Since the training data is text-only, this experiment should be viewed as Chinese character-style text fine-tuning, not multimodal capability fine-tuning.
Repository Artifacts
Main artifacts:
Table with columns: File, Description| File | Description |
|---|
adapter_model.safetensors | LoRA adapter weights |
adapter_config.json | PEFT/LoRA configuration |
tokenizer.json / tokenizer_config.json | Tokenizer files |
chat_template.jinja | Gemma 4 chat template |
train_metrics.json | Training summary metrics |
|
This repository currently contains the LoRA adapter, not a fully merged model. To deploy a merged model, the matching google/gemma-4-E4B-it base model must be loaded, the adapter must be attached with PeftModel.from_pretrained, and the weights can then be merged with merge_and_unload(). The full processor should also be saved with the merged model.
Future Improvements
- Apply LoRA only to the language model modules instead of all linear modules, since the dataset is text-only.
- Filter or downweight very short target responses such as
...... and ——.
- Add a small manually curated evaluation set for character consistency, contextual relevance, and naturalness.
- Use longer context windows or scene-level samples to improve narrative continuity.
- Compare multiple LoRA configurations, including different ranks, target modules, and data filtering strategies.