bibbbu
lora-qwen25-1p5b-finbot-v2
Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Run this model inference with full control and performance in your environment.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0Why fine-tune?
The base 1.5B model could not reliably emit valid structured output for this task (0% schema validity on the held-out evaluation set). Rather than scaling up the model or adding heavy post-processing, this adapter was trained via teacher–student distillation to fix the failure mode directly at the model level.
Evaluation
Results on a held-out evaluation set, base model vs. base + this adapter:
| Metric | Base Qwen2.5-1.5B-Instruct | + FinBot LoRA v2 |
|---|---|---|
| JSON schema validity | 0% | 100% |
| Safety compliance | 100% | 100% (maintained) |
| Mean output tokens | 638 | 150 (−77%) |
| Mean latency | 5,134 ms | 2,902 ms (−43%) |
Evaluated on 62 held-out examples balanced across English (21), Vietnamese (21), and Chinese (20), spanning planning, investment, and trading tasks. Latency measured per-example on an NVIDIA A100-80GB. Schema validity reached 100% in every language individually, not just in aggregate. Evaluation covers cascading JSON schema validation, safety/refusal behaviour (unsafe-claim screening, e.g. "guaranteed profit"), structured internal-analysis presence, and output efficiency.
Training
- Method: Supervised fine-tuning (LoRA/PEFT, via TRL
SFTTrainer) on teacher–student distilled examples, filtered and validated against the FinBot output schema before training. - Data: 615 validated multilingual instruction–response pairs (from 618 raw across 9 source files; 2 removed by the multilingual safety filter, 1 deduplicated, 0 schema failures). Balanced across languages — EN 208 / VI 208 / ZH 199 — and tasks — planning 210 / trading 209 / investment 196. Split 553 train / 62 eval (90/10, seed 42). No real user data was used; all training data is synthetic.
- LoRA configuration:
- rank (r):
16, alpha:32, dropout:0.05 - target modules:
q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj(attention + MLP) - trainable parameters: 18.5M of 1.56B (1.18%)
- rank (r):
- Training run: NVIDIA A100-80GB (Google Colab), bf16 LoRA — no quantization or gradient checkpointing needed at this VRAM. 2 epochs, learning rate
2e-4(cosine schedule, 3% warmup), effective batch size 6, max sequence length 2048 with sequence packing. The runbook is VRAM-adaptive and falls back to QLoRA (4-bit NF4) with gradient checkpointing on <24 GB GPUs. - Reproducibility: seed 42 throughout; full run provenance (hyperparameters, dataset stats, metrics, library versions) recorded in the adapter's
manifest.json.
Usage
python
from transformers import AutoModelForCausalLM, AutoTokenizerfrom peft import PeftModelbase_id = "Qwen/Qwen2.5-1.5B-Instruct"adapter_id = "bibbbu/lora-qwen25-1p5b-finbot-v2"tokenizer = AutoTokenizer.from_pretrained(base_id)model = AutoModelForCausalLM.from_pretrained(base_id, device_map="auto")model = PeftModel.from_pretrained(model, adapter_id)messages = [{"role": "system", "content": "You are FinBot, a financial planning assistant. Respond with a single JSON object matching the FinBot recommendation schema."},{"role": "user", "content": "Profile: age 30, income bucket B, risk tolerance: moderate, goal: retirement savings, horizon: 25 years, language: en. Generate a recommendation."},]inputs = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt").to(model.device)output = model.generate(inputs, max_new_tokens=512)print(tokenizer.decode(output[0][inputs.shape[-1]:], skip_special_tokens=True))
The adapter runs on CUDA and Apple Silicon (MPS). For the full system — deterministic dialogue state machine, slot parsing, multilingual i18n, PII bucketing, prompt-injection defence, and JSON validation/repair pipeline — see the FinBot repository.
Intended use & limitations
Intended use: research and demonstration of structured-output fine-tuning for small, locally deployed LLMs; portfolio/educational use; as the recommendation component inside the FinBot system, where its output is schema-validated before display.
Limitations:
- ⚠️ Not financial advice. This model is a research prototype. Its outputs must not be used to make real investment, trading, or financial-planning decisions.
- Trained entirely on synthetic, distilled data — it inherits the biases and blind spots of the teacher models and the prompt distribution used for generation.
- A 1.5B-parameter model can and will hallucinate; the adapter improves format reliability and conciseness, not factual financial knowledge.
- The model expects inputs in the FinBot prompt format (pre-bucketed user context). Free-form financial Q&A is out of distribution.
- Coverage is limited to English, Vietnamese, and Chinese.
Citation
bibtex
@misc{vu2026finbotlora,title = {FinBot LoRA: Structured-Output Fine-Tuning of Qwen2.5-1.5B for a Multilingual Local Financial Assistant},author = {Vu, Tuong Vy},year = {2026},url = {https://huggingface.co/bibbbu/lora-qwen25-1p5b-finbot-v2}}
Model provider
bibbbu
Model tree
Base
Qwen/Qwen2.5-1.5B-Instruct
Adapter
this model
Modalities
Input
Text
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information