bibbbu/lora-qwen25-1p5b-finbot-v2 API & Inference Endpoint

Why fine-tune?

The base 1.5B model could not reliably emit valid structured output for this task (0% schema validity on the held-out evaluation set). Rather than scaling up the model or adding heavy post-processing, this adapter was trained via teacher–student distillation to fix the failure mode directly at the model level.

Evaluation

Results on a held-out evaluation set, base model vs. base + this adapter:

Table
Metric	Base Qwen2.5-1.5B-Instruct	+ FinBot LoRA v2
JSON schema validity	0%	100%
Safety compliance	100%	100% (maintained)
Mean output tokens	638	150 (−77%)
Mean latency	5,134 ms	2,902 ms (−43%)

Evaluated on 62 held-out examples balanced across English (21), Vietnamese (21), and Chinese (20), spanning planning, investment, and trading tasks. Latency measured per-example on an NVIDIA A100-80GB. Schema validity reached 100% in every language individually, not just in aggregate. Evaluation covers cascading JSON schema validation, safety/refusal behaviour (unsafe-claim screening, e.g. "guaranteed profit"), structured internal-analysis presence, and output efficiency.

Training

Method: Supervised fine-tuning (LoRA/PEFT, via TRL SFTTrainer) on teacher–student distilled examples, filtered and validated against the FinBot output schema before training.
Data: 615 validated multilingual instruction–response pairs (from 618 raw across 9 source files; 2 removed by the multilingual safety filter, 1 deduplicated, 0 schema failures). Balanced across languages — EN 208 / VI 208 / ZH 199 — and tasks — planning 210 / trading 209 / investment 196. Split 553 train / 62 eval (90/10, seed 42). No real user data was used; all training data is synthetic.
LoRA configuration:
- rank (r): 16, alpha: 32, dropout: 0.05
- target modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj (attention + MLP)
- trainable parameters: 18.5M of 1.56B (1.18%)
Training run: NVIDIA A100-80GB (Google Colab), bf16 LoRA — no quantization or gradient checkpointing needed at this VRAM. 2 epochs, learning rate 2e-4 (cosine schedule, 3% warmup), effective batch size 6, max sequence length 2048 with sequence packing. The runbook is VRAM-adaptive and falls back to QLoRA (4-bit NF4) with gradient checkpointing on <24 GB GPUs.
Reproducibility: seed 42 throughout; full run provenance (hyperparameters, dataset stats, metrics, library versions) recorded in the adapter's manifest.json.

Usage

python
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

base_id = "Qwen/Qwen2.5-1.5B-Instruct"
adapter_id = "bibbbu/lora-qwen25-1p5b-finbot-v2"

tokenizer = AutoTokenizer.from_pretrained(base_id)
model = AutoModelForCausalLM.from_pretrained(base_id, device_map="auto")
model = PeftModel.from_pretrained(model, adapter_id)

messages = [
    {"role": "system", "content": "You are FinBot, a financial planning assistant. Respond with a single JSON object matching the FinBot recommendation schema."},
    {"role": "user", "content": "Profile: age 30, income bucket B, risk tolerance: moderate, goal: retirement savings, horizon: 25 years, language: en. Generate a recommendation."},
]
inputs = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt").to(model.device)
output = model.generate(inputs, max_new_tokens=512)
print(tokenizer.decode(output[0][inputs.shape[-1]:], skip_special_tokens=True))

The adapter runs on CUDA and Apple Silicon (MPS). For the full system — deterministic dialogue state machine, slot parsing, multilingual i18n, PII bucketing, prompt-injection defence, and JSON validation/repair pipeline — see the FinBot repository.

Intended use & limitations

Intended use: research and demonstration of structured-output fine-tuning for small, locally deployed LLMs; portfolio/educational use; as the recommendation component inside the FinBot system, where its output is schema-validated before display.

Limitations:

⚠️ Not financial advice. This model is a research prototype. Its outputs must not be used to make real investment, trading, or financial-planning decisions.
Trained entirely on synthetic, distilled data — it inherits the biases and blind spots of the teacher models and the prompt distribution used for generation.
A 1.5B-parameter model can and will hallucinate; the adapter improves format reliability and conciseness, not factual financial knowledge.
The model expects inputs in the FinBot prompt format (pre-bucketed user context). Free-form financial Q&A is out of distribution.
Coverage is limited to English, Vietnamese, and Chinese.

Citation

bibtex
@misc{vu2026finbotlora,
  title  = {FinBot LoRA: Structured-Output Fine-Tuning of Qwen2.5-1.5B for a Multilingual Local Financial Assistant},
  author = {Vu, Tuong Vy},
  year   = {2026},
  url    = {https://huggingface.co/bibbbu/lora-qwen25-1p5b-finbot-v2}
}

lora-qwen25-1p5b-finbot-v2

Get help setting up a custom Dedicated Endpoints.

README