Surpem/Supertron2.1-0.6B API & Inference Endpoint

Model Description

Supertron2.1-0.6B is an instruction-tuned language model built on top of Qwen3-0.6B. It is designed to be a small, efficient daily-driver model for reasoning, math, coding, general knowledge, writing, and assistant-style conversation while remaining lightweight enough to run on consumer hardware.

The model keeps the Qwen3 architecture, tokenizer, and chat format, which makes it easy to use with standard transformers workflows. Supertron2.1-0.6B is intended for users who want a compact generalist model that can answer questions, explain concepts, write code, solve structured problems, and follow natural language instructions.

Developed by: Surpem
Model type: Causal Language Model
Architecture: Dense Transformer, 0.6B parameter class
Fine-tuned from: Qwen/Qwen3-0.6B
License: Apache 2.0

Capabilities

Reasoning

Supertron2.1-0.6B is designed for clear, structured reasoning. It can break down questions into useful steps, compare options, explain tradeoffs, and provide concise conclusions when asked.

Math

The model can assist with arithmetic, algebra, word problems, step-by-step explanations, and checking calculations. It is useful for learning, practice, and lightweight problem solving.

Coding

Supertron2.1-0.6B can write, debug, and explain code across common programming languages including Python, JavaScript, TypeScript, C++, Java, Rust, and shell scripting. It can help with implementation details, algorithmic reasoning, refactoring suggestions, and code explanations.

Science & General Knowledge

The model can explain concepts across STEM, technology, history, business, and general knowledge domains. It is suitable for short research assistance, study support, summaries, and clear explanations of technical ideas.

Instruction Following

Supertron2.1-0.6B follows direct natural language instructions and can adapt to requested formats such as concise answers, bullet lists, tables, JSON-like structures, code blocks, and longer explanations.

Get Started

Install the required packages:

bash
pip install -U transformers torch accelerate

Load the model:

python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "Surpem/Supertron2.1-0.6B"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

Generate a response:

python
messages = [
    {"role": "user", "content": "Explain the difference between LoRA and full fine-tuning."}
]

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
inputs = tokenizer(text, return_tensors="pt").to(model.device)

outputs = model.generate(
    **inputs,
    max_new_tokens=512,
    temperature=0.7,
    top_p=0.8,
    do_sample=True,
)

print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True))

Recommended Generation Settings

For coding, math, and deterministic answers:

python
generation_config = {
    "max_new_tokens": 512,
    "do_sample": False,
}

For general chat and writing:

python
generation_config = {
    "max_new_tokens": 768,
    "temperature": 0.7,
    "top_p": 0.8,
    "top_k": 20,
    "do_sample": True,
}

Hardware Requirements

Precision	Min VRAM	Recommended
bfloat16 / float16	2 GB	4 GB+
8-bit quantized	1.5 GB	3 GB+
4-bit quantized	1 GB	2 GB+

For 4-bit quantized inference:

python
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
import torch

model_id = "Surpem/Supertron2.1-0.6B"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
)

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map="auto",
)

Local Inference

The official checkpoint in this repository is the Transformers version. A separate GGUF repository is available for llama.cpp, Ollama, LM Studio, and other local inference runtimes:

Surpem/Supertron2.1-0.6B-GGUF

Use this repository when you want the original PyTorch/Transformers model. Use the GGUF repository when you want quantized local inference.

Intended Use

Supertron2.1-0.6B is intended for:

lightweight assistant experiments
local coding help
math practice and explanations
general question answering
summarization and rewriting
prototype agent workflows
educational and research use

Limitations

The model may hallucinate facts or produce outdated information.
Math and code answers can be incorrect and should be verified.
Complex reasoning tasks may exceed the capability of a 0.6B parameter model.
The model may produce repetitive or low-quality text with poor sampling settings.
It is not intended for legal, medical, financial, safety-critical, or identity-sensitive decisions without independent expert review.

Citation

bibtex
@misc{surpem2026supertron21_06b,
      title={Supertron2.1-0.6B -- Efficient Instruction-Tuned Language Model},
      author={Surpem},
      year={2026},
      url={https://huggingface.co/Surpem/Supertron2.1-0.6B},
}

Supertron2.1-0.6B

Get help setting up a custom Dedicated Endpoints.

README