Model Description
This model is a fine-tuned version of Qwen2.5-3B-Instruct, optimized using Unsloth and quantized to 4-bit (bitsandbytes).
The primary fine-tuning objective focuses on Vietnamese Legal Domain Mastery and Context-Based Question Answering (RAG). Unlike general-purpose LLMs that reply strictly from pre-trained parametric memory, this model is specifically aligned to prioritize and reason directly over the provided context (legal articles, decrees, and circulars), drastically reducing hallucination and ensuring high-fidelity legal consulting.
Key Enhancements:
- Strict Context Adherence: Tuned to extract facts and formulate arguments heavily based on the input context—making it perfect for Retrieval-Augmented Generation (RAG) pipelines.
- Legal Formalism: Adopts the authoritative, formal, and precise tone required in the Vietnamese administrative and legal sectors.
- Hardware Efficiency: Operates smoothly within a very low VRAM footprint during inference, leaving ample head-room for long contexts and tool-calling structures on mid-range GPUs.
Model Details
Table with columns: Property, Value| Property | Value |
|---|
| Base Model | Qwen/Qwen2.5-3B-Instruct |
| Parameters | 3 Billion |
| Quantization | 4-bit (bitsandbytes / bnb-4bit) |
| Fine-tuning Method | QLoRA (Rank 16, Alpha 32) |
| Primary Task | Context-Driven Legal Q&A / Legal Agent |
Deployment & Inference (vLLM)
To host this model using vLLM on standard cloud environments.
Start the vLLM Server:
!python -m vllm.entrypoints.openai.api_server \
--model unsloth/Qwen2.5-3B-Instruct-bnb-4bit \
--max-model-len 2048 \ # Can be changed
--dtype float16 \
--api-key 'your-api-key-here' \
--max-num-seqs 16 \
--trust-remote-code \
--gpu-memory-utilization 0.85 \ # Can be changed
--enforce-eager \
--enable-auto-tool-choice \
--tool-call-parser hermes \
--port 8000 &