Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more
Container

Run this model inference with full control and performance in your environment.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

Headline result

MetricQwen-baseGPT-4o-miniGPT-4oThis model
Hinglish marker density8.9%29.5%24.6%31.6%
English drift rate32%0%4%0%
Devanagari injection bug12.5%0%2.5%0%
Claude judge register score (/5)1.242.502.123.98
Claude judge total (/20)6.7213.5612.9012.48

Bottom line: Matches or exceeds GPT-4o-mini on Hinglish register naturalness, with comparable to ~3.4× lower serving cost depending on infrastructure choice. Trails GPT-4o-mini ~8% on content quality (intent accuracy + factuality). Optimal for style-sensitive conversational use cases at sustained traffic where dedicated GPU instances become economical vs per-token API pricing.

Cost comparison (measured)

Benchmarked on NVIDIA T4 (HuggingFace transformers, fp16, batch=16, ~294 tok/sec).

Infrastructure$/M tokensvs GPT-4o-mini
AWS T4 on-demand$0.50parity
GCP T4 on-demand$0.331.5× cheaper
AWS T4 reserved (1yr)$0.301.7× cheaper
RunPod community$0.182.8× cheaper
AWS T4 spot$0.153.4× cheaper
GPT-4o-mini API$0.51 (blended 20%/80% in/out)baseline

Note: a vLLM or TGI deployment would likely improve self-hosted throughput by ~50-100%, shifting the comparison further in the fine-tune's favor. This was not benchmarked here due to environment constraints.

How to use

python

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
base_model = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen2.5-3B-Instruct",
torch_dtype="auto",
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-3B-Instruct")
# Load the LoRA adapter
model = PeftModel.from_pretrained(base_model, "DSMJ910/qwen2.5-3b-hinglish-lora")
messages = [{"role": "user", "content": "Bhai weekend pe Bangalore mein kya karein?"}]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True).to("cuda")
outputs = model.generate(inputs, max_new_tokens=300, do_sample=True, temperature=0.7)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Training details

  • Base model: Qwen2.5-3B-Instruct (4-bit NF4 quantization)
  • Adapter: LoRA rank=16, alpha=32, dropout=0
  • Target modules: all linear layers (q/k/v/o projections + MLP gate/up/down)
  • Trainable parameters: ~30M (~1% of base model)
  • Training data: 10,594 synthetic Hinglish instruction examples (see dataset link)
  • Hyperparameters: lr=2e-4, batch_size=16 (effective), 2 epochs, AdamW 8-bit, linear schedule, bf16
  • Hardware: Single Blackwell GPU (95 GB VRAM)
  • Training time: 9.2 minutes
  • Adapter size: 125 MB

Evaluation

Quantitative

  • 50-prompt hand-curated Hinglish eval set (4 categories: casual, customer support, Q&A, sentiment)
  • Automated metrics: Hinglish marker density, English drift detection, Devanagari injection check
  • LLM-as-judge: Claude Sonnet 4.6 evaluating pairwise on 4 axes (Register, Intent, Quality, Culture)
  • Methodological note: Used Claude (different vendor than training data generator GPT-4o-mini) to avoid evaluation circularity.

Known limitations

  1. Roman script only. Training data is 100% Roman Hinglish; mixed-script inputs (Devanagari) may not be handled robustly. Future v2 will address.
  2. Conversational > instructional. Model defaults to "friendly chat" mode which sometimes reduces precision on classification tasks (e.g., confuses sentiment vs intent classification).
  3. Synthetic training data. All training examples were generated by GPT-4o-mini; this introduces stylistic patterns specific to GPT-4o-mini that the fine-tune inherits.
  4. Small eval set. N=50 prompts; larger evaluation would tighten confidence intervals.

Citation

If you use this model, please cite:

bibtex

@misc{hinglish-qwen-3b-2026,
title={Qwen2.5-3B Hinglish: QLoRA Fine-tuning for Indian Code-Mixed Conversation},
author={Muskan Jaiswal},
year={2026},
publisher={HuggingFace},
url={https://huggingface.co/DSMJ910/qwen2.5-3b-hinglish-lora}
}

Model provider

DSMJ910

Model tree

Base

Qwen/Qwen2.5-3B-Instruct

Adapter

this model

Modalities

Input

Text

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today