Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Run this model inference with full control and performance in your environment.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0Headline result
| Metric | Qwen-base | GPT-4o-mini | GPT-4o | This model |
|---|---|---|---|---|
| Hinglish marker density | 8.9% | 29.5% | 24.6% | 31.6% |
| English drift rate | 32% | 0% | 4% | 0% |
| Devanagari injection bug | 12.5% | 0% | 2.5% | 0% |
| Claude judge register score (/5) | 1.24 | 2.50 | 2.12 | 3.98 |
| Claude judge total (/20) | 6.72 | 13.56 | 12.90 | 12.48 |
Bottom line: Matches or exceeds GPT-4o-mini on Hinglish register naturalness, with comparable to ~3.4× lower serving cost depending on infrastructure choice. Trails GPT-4o-mini ~8% on content quality (intent accuracy + factuality). Optimal for style-sensitive conversational use cases at sustained traffic where dedicated GPU instances become economical vs per-token API pricing.
Cost comparison (measured)
Benchmarked on NVIDIA T4 (HuggingFace transformers, fp16, batch=16, ~294 tok/sec).
| Infrastructure | $/M tokens | vs GPT-4o-mini |
|---|---|---|
| AWS T4 on-demand | $0.50 | parity |
| GCP T4 on-demand | $0.33 | 1.5× cheaper |
| AWS T4 reserved (1yr) | $0.30 | 1.7× cheaper |
| RunPod community | $0.18 | 2.8× cheaper |
| AWS T4 spot | $0.15 | 3.4× cheaper |
| GPT-4o-mini API | $0.51 (blended 20%/80% in/out) | baseline |
Note: a vLLM or TGI deployment would likely improve self-hosted throughput by ~50-100%, shifting the comparison further in the fine-tune's favor. This was not benchmarked here due to environment constraints.
How to use
python
from peft import PeftModelfrom transformers import AutoModelForCausalLM, AutoTokenizerbase_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-3B-Instruct",torch_dtype="auto",device_map="auto")tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-3B-Instruct")# Load the LoRA adaptermodel = PeftModel.from_pretrained(base_model, "DSMJ910/qwen2.5-3b-hinglish-lora")messages = [{"role": "user", "content": "Bhai weekend pe Bangalore mein kya karein?"}]inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True).to("cuda")outputs = model.generate(inputs, max_new_tokens=300, do_sample=True, temperature=0.7)print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Training details
- Base model: Qwen2.5-3B-Instruct (4-bit NF4 quantization)
- Adapter: LoRA rank=16, alpha=32, dropout=0
- Target modules: all linear layers (q/k/v/o projections + MLP gate/up/down)
- Trainable parameters: ~30M (~1% of base model)
- Training data: 10,594 synthetic Hinglish instruction examples (see dataset link)
- Hyperparameters: lr=2e-4, batch_size=16 (effective), 2 epochs, AdamW 8-bit, linear schedule, bf16
- Hardware: Single Blackwell GPU (95 GB VRAM)
- Training time: 9.2 minutes
- Adapter size: 125 MB
Evaluation
Quantitative
- 50-prompt hand-curated Hinglish eval set (4 categories: casual, customer support, Q&A, sentiment)
- Automated metrics: Hinglish marker density, English drift detection, Devanagari injection check
- LLM-as-judge: Claude Sonnet 4.6 evaluating pairwise on 4 axes (Register, Intent, Quality, Culture)
- Methodological note: Used Claude (different vendor than training data generator GPT-4o-mini) to avoid evaluation circularity.
Known limitations
- Roman script only. Training data is 100% Roman Hinglish; mixed-script inputs (Devanagari) may not be handled robustly. Future v2 will address.
- Conversational > instructional. Model defaults to "friendly chat" mode which sometimes reduces precision on classification tasks (e.g., confuses sentiment vs intent classification).
- Synthetic training data. All training examples were generated by GPT-4o-mini; this introduces stylistic patterns specific to GPT-4o-mini that the fine-tune inherits.
- Small eval set. N=50 prompts; larger evaluation would tighten confidence intervals.
Citation
If you use this model, please cite:
bibtex
@misc{hinglish-qwen-3b-2026,title={Qwen2.5-3B Hinglish: QLoRA Fine-tuning for Indian Code-Mixed Conversation},author={Muskan Jaiswal},year={2026},publisher={HuggingFace},url={https://huggingface.co/DSMJ910/qwen2.5-3b-hinglish-lora}}
Model provider
DSMJ910
Model tree
Base
Qwen/Qwen2.5-3B-Instruct
Adapter
this model
Modalities
Input
Text
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information