Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Run this model inference with full control and performance in your environment.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0🚀 Quick Start
python
from transformers import AutoModelForCausalLM, AutoTokenizerimport torchmodel = AutoModelForCausalLM.from_pretrained("Havoc999/tiny-chatbot",torch_dtype=torch.float16,device_map="auto",)tokenizer = AutoTokenizer.from_pretrained("Havoc999/tiny-chatbot")prompt = ("Below is an instruction that describes a task. ""Write a response that appropriately completes the request.\n\n""### Instruction:\n""Explain the water cycle in simple terms.\n\n""### Response:\n")inputs = tokenizer(prompt, return_tensors="pt").to(model.device)output = model.generate(**inputs,max_new_tokens=256,temperature=0.7,top_p=0.9,do_sample=True,repetition_penalty=1.15,)response = tokenizer.decode(output[0, inputs.input_ids.shape[1]:], skip_special_tokens=True)print(response)
Multi-turn (Chat Template)
python
from transformers import pipelinepipe = pipeline("text-generation", model=model, tokenizer=tokenizer)messages = [{"role": "user", "content": "What is photosynthesis?"},]# TinyLlama-Chat supports the built-in chat templateprompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)print(pipe(prompt, max_new_tokens=200)[0]["generated_text"])
📊 Benchmark Results
All benchmarks were evaluated after fine-tuning, using greedy decoding unless otherwise noted.
MMLU — Elementary Mathematics
| Metric | Value |
|---|---|
| Samples evaluated | 50 |
| Correct | 15 |
| Invalid outputs | 4 |
| Accuracy | 30.00% |
| Random baseline (4-way) | 25.00% |
+5 pp above random. The model demonstrates marginal elementary math ability consistent with the small 1.1 B parameter count and an English instruction dataset that contains limited mathematical content.
HellaSwag (commonsense NLI)
| Metric | Score | Samples |
|---|---|---|
| Accuracy | 0.4550 | 200 |
| Accuracy (normalised) | 0.5600 | 200 |
Normalised accuracy above 0.50 indicates better-than-random commonsense sentence completion. HellaSwag is a strong proxy for general language understanding.
PIQA (physical intuition QA)
| Metric | Score | Samples |
|---|---|---|
| Accuracy | 0.7450 | 200 |
| Accuracy (normalised) | 0.7400 | 200 |
PIQA tests physical intuition and everyday procedural knowledge. 0.74 is a solid result for a 1.1 B model, suggesting the base pre-training retains good world knowledge even after instruction fine-tuning.
ARC Challenge (grade-school science)
| Metric | Score | Samples |
|---|---|---|
| Accuracy | 0.3050 | 200 |
| Accuracy (normalised) | 0.3500 | 200 |
ARC-Challenge targets questions that require reasoning beyond simple retrieval. 0.35 normalised reflects the model's limitations on multi-step reasoning at this scale.
Summary
| Benchmark | Metric | Score |
|---|---|---|
| MMLU Elem. Math | Accuracy | 30.00% |
| HellaSwag | Acc (norm) | 56.00% |
| PIQA | Acc (norm) | 74.00% |
| ARC Challenge | Acc (norm) | 35.00% |
📋 Training Details
| Setting | Value |
|---|---|
| Base model | TinyLlama/TinyLlama-1.1B-Chat-v1.0 |
| Dataset | tatsu-lab/alpaca |
| Train split | 45,000 examples |
| Eval split | 2,000 examples |
| Fine-tuning method | LoRA (PEFT) |
| LoRA rank | 16 |
| LoRA alpha | 32 |
| LoRA dropout | 0.05 |
| Target modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
| Trainable parameters | ~17 M / 1.1 B (~1.55%) |
| Precision | float16 (AMP) |
| Epochs | 3 |
| Per-GPU batch size | 4 |
| Gradient accumulation | 4 steps |
| Effective global batch | 32 (4 × 2 GPUs × 4 accum) |
| Peak learning rate | 2e-4 |
| LR scheduler | Cosine annealing |
| Warmup ratio | 3% |
| Gradient checkpointing | Enabled |
| NEFTune noise alpha | 5 |
| Hardware | Kaggle Dual T4 (2 × 16 GiB VRAM) |
| Loss masking | Completion-only (response tokens only) |
| Early stopping patience | 3 evaluations |
⚙️ Reproduce
python
# Install dependencies# pip install transformers datasets peft trl accelerate bitsandbytes huggingface_hubfrom transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArgumentsfrom peft import LoraConfig, get_peft_model, TaskTypefrom trl import SFTTrainer, DataCollatorForCompletionOnlyLMfrom datasets import load_dataset# 1. Load datasetdataset = load_dataset("tatsu-lab/alpaca", split="train")# 2. Format examplesdef format_alpaca(ex):input_section = f"### Input:\n{ex['input']}\n\n" if ex["input"].strip() else ""return {"text": ("Below is an instruction that describes a task. ""Write a response that appropriately completes the request.\n\n"f"### Instruction:\n{ex['instruction']}\n\n"f"{input_section}"f"### Response:\n{ex['output']}")}dataset = dataset.map(format_alpaca, batched=False)# 3. Load model + LoRAtokenizer = AutoTokenizer.from_pretrained("TinyLlama/TinyLlama-1.1B-Chat-v1.0")tokenizer.pad_token = tokenizer.eos_tokenmodel = AutoModelForCausalLM.from_pretrained("TinyLlama/TinyLlama-1.1B-Chat-v1.0",torch_dtype="auto",device_map={"": 0},)model.config.use_cache = Falsemodel.enable_input_require_grads()lora_config = LoraConfig(r=16, lora_alpha=32, lora_dropout=0.05,bias="none", task_type=TaskType.CAUSAL_LM,target_modules=["q_proj","k_proj","v_proj","o_proj","gate_proj","up_proj","down_proj"],)model = get_peft_model(model, lora_config)# 4. Traintrainer = SFTTrainer(model=model, tokenizer=tokenizer,train_dataset=dataset,dataset_text_field="text",max_seq_length=512,data_collator=DataCollatorForCompletionOnlyLM("### Response:\n", tokenizer=tokenizer),args=TrainingArguments(output_dir="./chatbot-lora",num_train_epochs=3,per_device_train_batch_size=4,gradient_accumulation_steps=4,learning_rate=2e-4,fp16=True,gradient_checkpointing=True,save_strategy="steps", save_steps=200, save_total_limit=3,eval_strategy="no",),)trainer.train()
⚠️ Limitations
- English only — the base model and Alpaca dataset are English-focused; other languages may produce incoherent outputs.
- Hallucination — like all generative models, this one can confidently state incorrect facts. Always verify important claims.
- Limited reasoning — at 1.1 B parameters, multi-step logical and mathematical reasoning is unreliable (see ARC / MMLU results above).
- No RLHF safety alignment — this model has not undergone reinforcement learning from human feedback. It inherits TinyLlama's base alignment only and may produce inappropriate responses to adversarial prompts.
- Short context — trained with a maximum sequence length of 512 tokens; very long conversations will be truncated.
- Not production-ready — intended as a learning artefact and research baseline, not a deployed consumer product.
📜 License
This model is released under the Apache 2.0 license, consistent with the TinyLlama base model and the Alpaca dataset. See LICENSE for full terms.
Fine-tuned on Kaggle Dual T4 GPU · TRL SFTTrainer · LoRA via PEFT
Model provider
Havoc999
Model tree
Base
TinyLlama/TinyLlama-1.1B-Chat-v1.0
Adapter
this model
Modalities
Input
Text
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information