Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: other

Runtime

Recommended runtime repository:

text

https://github.com/MapleRhythm/ASA-ArknightStoryAgent

Download into the release tree:

bash

huggingface-cli download MapleRhythm/asa-arknightstoryagent-4b-lora \
--local-dir model/lora/asa-arknightstoryagent-4b-lora

Run with the GPU release config:

bash

bash scripts/run_gpu_reranker_qwen35_4b.sh --answer-only "岁兽是什么?"

Current runtime defaults:

  • Base model path: model/qwen3.5-4b
  • LoRA path: model/lora/asa-arknightstoryagent-4b-lora
  • Context size: 10000
  • Max generation tokens: 1536
  • answer_grounding_mode: quote
  • conclusion_prompt_mode: minimal
  • Web context: disabled by default

The adapter was trained from an internal merged Qwen3.5-4B checkpoint. The release runtime is the supported way to load it; direct AutoPeftModel usage may require adapting local base-model paths.

Training

Method: LoRA preference tuning with LLaMA-Factory / PEFT.

Key hyperparameters:

  • LoRA rank: 8
  • LoRA alpha: 16
  • LoRA dropout: 0.05
  • Learning rate: 8e-7
  • Epochs: 2
  • Scheduler: cosine
  • Effective batch size: 4

Training objective focused on ASA runtime failures: grounded action JSON stability, avoiding over-abstain when evidence is sufficient, and improving answer behavior under the current RAG chain.

Evaluation Snapshot

Training-time eval metrics:

  • eval loss: 0.5542
  • rewards/chosen: 0.0427
  • rewards/rejected: 0.0250
  • rewards/margins: 0.0177
  • KL: 195.9924

Pipeline evaluation before the runtime truncation-recovery patch:

  • eval50: 50 questions, 3 JSON errors, 13 abstain-like answers, 34 direct answers.
  • hard10: 10 questions, 1 JSON error, 3 abstain-like answers, 6 direct answers.

Runtime truncation-recovery regression after the patch:

  • 4 truncation-prone questions, 0 JSON errors.

The full eval50 + hard10 suite should be rerun after each runtime or model change before treating this as a production-quality release.

Limitations

  • The model should not be used without retrieval evidence. It can hallucinate or over-infer from weak evidence.
  • Low-confidence retrieval cases, especially fusion_score=0 or missing dense/sparse/MiniRAG/evidence-chain scores, should be handled conservatively by the application.
  • Subjective questions such as "what bad things did X do" need framing by viewpoint; the model may otherwise mix factual actions with moral judgments.
  • This adapter does not include the base model, game text, retrieval indexes, or reranker weights.

License And Data Notes

This adapter is released under other because downstream use depends on the base model license and on the rights around the source story corpus used to build the retrieval system. The adapter repository does not include raw game story text or prebuilt story indexes.

Model provider

MapleRhythm

Model tree

Base

Qwen/Qwen3.5-4B

Adapter

this model

Modalities

Input

Video, Text, Image

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today