Qwen2.5-3B-legal-vn API & Inference Endpoint

Model Description

This model is a fine-tuned version of Qwen2.5-3B-Instruct, optimized using Unsloth and quantized to 4-bit (bitsandbytes).

The primary fine-tuning objective focuses on Vietnamese Legal Domain Mastery and Context-Based Question Answering (RAG). Unlike general-purpose LLMs that reply strictly from pre-trained parametric memory, this model is specifically aligned to prioritize and reason directly over the provided context (legal articles, decrees, and circulars), drastically reducing hallucination and ensuring high-fidelity legal consulting.

Key Enhancements:

Strict Context Adherence: Tuned to extract facts and formulate arguments heavily based on the input context—making it perfect for Retrieval-Augmented Generation (RAG) pipelines.
Legal Formalism: Adopts the authoritative, formal, and precise tone required in the Vietnamese administrative and legal sectors.
Hardware Efficiency: Operates smoothly within a very low VRAM footprint during inference, leaving ample head-room for long contexts and tool-calling structures on mid-range GPUs.

Model Details

Table with columns: Property, Value
Property	Value
Base Model	Qwen/Qwen2.5-3B-Instruct
Parameters	3 Billion
Quantization	4-bit (bitsandbytes / bnb-4bit)
Fine-tuning Method	QLoRA (Rank 16, Alpha 32)
Primary Task	Context-Driven Legal Q&A / Legal Agent

Deployment & Inference (vLLM)

To host this model using vLLM on standard cloud environments.

Start the vLLM Server:

bash
!python -m vllm.entrypoints.openai.api_server \
    --model unsloth/Qwen2.5-3B-Instruct-bnb-4bit \
    --max-model-len 2048 \ # Can be changed
    --dtype float16 \
    --api-key 'your-api-key-here' \
    --max-num-seqs 16 \
    --trust-remote-code \
    --gpu-memory-utilization 0.85 \ # Can be changed
    --enforce-eager \
    --enable-auto-tool-choice \
    --tool-call-parser hermes \
    --port 8000 &

Model Description

This model is a fine-tuned version of Qwen2.5-3B-Instruct, optimized using Unsloth and quantized to 4-bit (bitsandbytes).

Key Enhancements:

Strict Context Adherence: Tuned to extract facts and formulate arguments heavily based on the input context—making it perfect for Retrieval-Augmented Generation (RAG) pipelines.
Legal Formalism: Adopts the authoritative, formal, and precise tone required in the Vietnamese administrative and legal sectors.
Hardware Efficiency: Operates smoothly within a very low VRAM footprint during inference, leaving ample head-room for long contexts and tool-calling structures on mid-range GPUs.

Model Details

Table with columns: Property, Value
Property	Value
Base Model	Qwen/Qwen2.5-3B-Instruct
Parameters	3 Billion
Quantization	4-bit (bitsandbytes / bnb-4bit)
Fine-tuning Method	QLoRA (Rank 16, Alpha 32)
Primary Task	Context-Driven Legal Q&A / Legal Agent

Deployment & Inference (vLLM)

To host this model using vLLM on standard cloud environments.

Start the vLLM Server:

bash
!python -m vllm.entrypoints.openai.api_server \
    --model unsloth/Qwen2.5-3B-Instruct-bnb-4bit \
    --max-model-len 2048 \ # Can be changed
    --dtype float16 \
    --api-key 'your-api-key-here' \
    --max-num-seqs 16 \
    --trust-remote-code \
    --gpu-memory-utilization 0.85 \ # Can be changed
    --enforce-eager \
    --enable-auto-tool-choice \
    --tool-call-parser hermes \
    --port 8000 &

Qwen2.5-3B-legal-vn

README

Model Description

Key Enhancements:

Model Details

Deployment & Inference (vLLM)

Start the vLLM Server:

Explore FriendliAI today

README

Model Description

Key Enhancements:

Model Details

Deployment & Inference (vLLM)

Start the vLLM Server: