Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more
Container

Run this model inference with full control and performance in your environment.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

Key Results (Rigorous Evaluation, 95% CI)

MetricScore95% CIn
Socratic question rate100%[98%, 100%]200
Relevance to specific student error74.5%[68%, 80%]200
Answer avoidance rate96%[92%, 98%]200
Answer leak rate1%[0.2%, 5.4%]100
Grade-appropriate language100%[98%, 100%]200

All metrics evaluated with heuristic scoring (no LLM-as-judge) under production conditions with mission context, vocabulary hints, and misconception targeting.

How It Works

The model is trained to be Socratic: when a student makes an error, instead of correcting them, it asks a question that helps them discover the error themselves.

Student: "I think 1/3 + 1/4 = 2/7 because I added the tops and bottoms."

Model: "If you had 1/3 of a pizza and 1/4 of the same pizza, would you really have less than 1/3 of a pizza total? Try drawing both fractions on the same circle."

Usage

With PEFT (recommended)

python

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch
# Load base model (requires Llama access)
base_model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.1-8B-Instruct",
torch_dtype=torch.float16,
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
# Load LoRA adapter
model = PeftModel.from_pretrained(base_model, "QuantumLearningMachines/qlm-math-tutor")
# Build prompt
system = "You are a Socratic math tutor for grade 6-8 students. Never give the answer. Ask guiding questions. Keep responses to 2-3 sentences."
messages = [
{"role": "system", "content": system},
{"role": "user", "content": "I think 1/3 + 1/4 = 2/7 because I added the tops and bottoms"},
]
input_ids = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)
with torch.no_grad():
output = model.generate(input_ids, max_new_tokens=150, temperature=0.7, do_sample=True)
response = tokenizer.decode(output[0][input_ids.shape[1]:], skip_special_tokens=True)
print(response)

With 4-bit Quantization (for consumer GPUs)

python

from transformers import AutoModelForCausalLM, BitsAndBytesConfig
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16,
)
base_model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.1-8B-Instruct",
quantization_config=quantization_config,
device_map="auto",
)
model = PeftModel.from_pretrained(base_model, "QuantumLearningMachines/qlm-math-tutor")
# Same generation code as above

System Prompt

The model responds to standard Llama chat format with a system prompt instructing Socratic tutoring behavior. A simple system prompt works:

markdown

You are a Socratic math tutor. Never give the answer. Ask guiding questions. Keep responses to 2-3 sentences.

Training

  • Base model: meta-llama/Llama-3.1-8B-Instruct

  • Method: LoRA

  • Training data: Synthetic tutoring interactions across K-12 mathematics

  • Hardware: HuggingFace L4 GPU (24GB)

  • Training time: ~4 hours

  • Final loss: 0.306

Limitations

  1. Synthetic training data: The model was trained on synthetic data, not real classroom tutoring transcripts. This limits scaffolding specificity — 28% of responses target the specific error, while 68% ask relevant but generic guiding questions.

  2. Answer leak rate: 1% of responses contain the correct answer (detected by exact numeric matching). An answer-leak filter is deployed in production.

  3. Math only: Trained exclusively on K-12 mathematics. Performance on other STEM subjects is untested.

  4. No longitudinal validation: No classroom outcome data yet. Benchmark results measure response quality, not learning gains.

  5. Heuristic evaluation: All evaluation uses keyword/heuristic scoring, not human expert annotation. Human evaluation with math teachers is planned.

Evaluation Methodology

All metrics use 95% confidence intervals. Tutor model evaluated on n=200 (Socratic quality), n=50 (scaffolding), n=100 (answer leak). No LLM-as-judge — all scoring is heuristic to avoid circularity.

Full benchmark results: quantumlearningmachines.com/research/external-benchmark-results

Part of a Larger System

This tutor model is one component of the QLM platform — an integrated system for adaptive math learning. The model weights are open. The measurement and orchestration systems that train and improve the model are proprietary.

Citation

bibtex

@misc{qlm-math-tutor-2026,
title={QLM Socratic Math Tutor: An Open-Source Llama 3.1 8B LoRA for K-12 Mathematics},
author={Quantum Learning Machines},
year={2026},
url={https://huggingface.co/QuantumLearningMachines/qlm-math-tutor},
}

Contact

Model provider

QuantumLearningMachines

Model tree

Base

meta-llama/Llama-3.1-8B-Instruct

Adapter

this model

Modalities

Input

Text

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today