DSMJ910/qwen2.5-3b-hinglish-lora API & Inference Endpoint

Headline result

Metric	Qwen-base	GPT-4o-mini	GPT-4o	This model
Hinglish marker density	8.9%	29.5%	24.6%	31.6%
English drift rate	32%	0%	4%	0%
Devanagari injection bug	12.5%	0%	2.5%	0%
Claude judge register score (/5)	1.24	2.50	2.12	3.98
Claude judge total (/20)	6.72	13.56	12.90	12.48

Bottom line: Matches or exceeds GPT-4o-mini on Hinglish register naturalness, with comparable to ~3.4× lower serving cost depending on infrastructure choice. Trails GPT-4o-mini ~8% on content quality (intent accuracy + factuality). Optimal for style-sensitive conversational use cases at sustained traffic where dedicated GPU instances become economical vs per-token API pricing.

Cost comparison (measured)

Benchmarked on NVIDIA T4 (HuggingFace transformers, fp16, batch=16, ~294 tok/sec).

Infrastructure	$/M tokens	vs GPT-4o-mini
AWS T4 on-demand	$0.50	parity
GCP T4 on-demand	$0.33	1.5× cheaper
AWS T4 reserved (1yr)	$0.30	1.7× cheaper
RunPod community	$0.18	2.8× cheaper
AWS T4 spot	$0.15	3.4× cheaper
GPT-4o-mini API	$0.51 (blended 20%/80% in/out)	baseline

Note: a vLLM or TGI deployment would likely improve self-hosted throughput by ~50-100%, shifting the comparison further in the fine-tune's favor. This was not benchmarked here due to environment constraints.

How to use

python
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

base_model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2.5-3B-Instruct",
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-3B-Instruct")

# Load the LoRA adapter
model = PeftModel.from_pretrained(base_model, "DSMJ910/qwen2.5-3b-hinglish-lora")

messages = [{"role": "user", "content": "Bhai weekend pe Bangalore mein kya karein?"}]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True).to("cuda")
outputs = model.generate(inputs, max_new_tokens=300, do_sample=True, temperature=0.7)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Training details

Base model: Qwen2.5-3B-Instruct (4-bit NF4 quantization)
Adapter: LoRA rank=16, alpha=32, dropout=0
Target modules: all linear layers (q/k/v/o projections + MLP gate/up/down)
Trainable parameters: ~30M (~1% of base model)
Training data: 10,594 synthetic Hinglish instruction examples (see dataset link)
Hyperparameters: lr=2e-4, batch_size=16 (effective), 2 epochs, AdamW 8-bit, linear schedule, bf16
Hardware: Single Blackwell GPU (95 GB VRAM)
Training time: 9.2 minutes
Adapter size: 125 MB

Evaluation

Quantitative

50-prompt hand-curated Hinglish eval set (4 categories: casual, customer support, Q&A, sentiment)
Automated metrics: Hinglish marker density, English drift detection, Devanagari injection check
LLM-as-judge: Claude Sonnet 4.6 evaluating pairwise on 4 axes (Register, Intent, Quality, Culture)
Methodological note: Used Claude (different vendor than training data generator GPT-4o-mini) to avoid evaluation circularity.

Known limitations

Roman script only. Training data is 100% Roman Hinglish; mixed-script inputs (Devanagari) may not be handled robustly. Future v2 will address.
Conversational > instructional. Model defaults to "friendly chat" mode which sometimes reduces precision on classification tasks (e.g., confuses sentiment vs intent classification).
Synthetic training data. All training examples were generated by GPT-4o-mini; this introduces stylistic patterns specific to GPT-4o-mini that the fine-tune inherits.
Small eval set. N=50 prompts; larger evaluation would tighten confidence intervals.

Citation

If you use this model, please cite:

bibtex
@misc{hinglish-qwen-3b-2026,
  title={Qwen2.5-3B Hinglish: QLoRA Fine-tuning for Indian Code-Mixed Conversation},
  author={Muskan Jaiswal},
  year={2026},
  publisher={HuggingFace},
  url={https://huggingface.co/DSMJ910/qwen2.5-3b-hinglish-lora}
}

qwen2.5-3b-hinglish-lora

Get help setting up a custom Dedicated Endpoints.

README