cs-552-2026-ma-que/general_knowledge_model API & Inference Endpoint

CS-552 Ma Que — General Knowledge Model

General knowledge expert for the EPFL CS-552 (Modern NLP, Spring 2026) group project "Building a robust small language model for edge devices" (Group 11, Ma Que). This is an individual model; the team also maintains a merged group model (https://huggingface.co/cs-552-2026-ma-que/group_model).

What it is

Base model: Qwen/Qwen3-1.7B-Base (https://huggingface.co/Qwen/Qwen3-1.7B-Base) (1.7B params).
Post-training: LoRA SFT using nlp_project/sft_thinking.py with the configuration in nlp_project/cfgs_thinking.yml. The run uses LoRA rank 256, alpha 256, dropout 0, and targets q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, and down_proj. Training uses bf16, gradient checkpointing, Liger kernel, assistant-only loss, max sequence length 6144, learning rate 2e-4, cosine scheduler, 5% warmup, batch size 20, gradient accumulation 1, and 2 epochs.
Task: general knowledge multiple-choice answering with reasoning-style responses. The model reads a question with lettered options and commits to one boxed option letter.
Output contract: the final answer is emitted once as \boxed{X} (X is a single capital option letter), for the course OpenCompass/vLLM parser. Thinking mode is ON.
How behaviour is set: the Qwen3 chat template is applied automatically by tokenizer.apply_chat_template(messages, add_generation_prompt=True). Training examples include assistant reasoning traces with ..., and the model is intended to answer in thinking mode before producing the final boxed answer.

Inference / decoding

Values below are exactly those in generation_config.json:

┌────────────────┬────────────────────────────────────────────────────────────────┐ │ param │ value │ ├────────────────┼────────────────────────────────────────────────────────────────┤ │ do_sample │ true │ │ temperature │ 0.6 │ │ top_k │ 20 │ │ top_p │ 0.95 │ │ max_new_tokens │ not set in generation_config.json; use 16384 for final grading │ └────────────────┴────────────────────────────────────────────────────────────────┘

Usage:

markdown
from transformers import AutoModelForCausalLM, AutoTokenizer

  repo = "cs-552-2026-ma-que/general_knowledge_model"
  tok = AutoTokenizer.from_pretrained(repo)
  model = AutoModelForCausalLM.from_pretrained(
      repo,
      torch_dtype="auto",
      device_map="auto",
  )

  messages = [{"role": "user", "content":
      "Which planet is known as the Red Planet?\n"
      "A) Venus\nB) Mars\nC) Jupiter\nD) Mercury"}]

  text = tok.apply_chat_template(
      messages,
      tokenize=False,
      add_generation_prompt=True,
  )

  out = model.generate(
      **tok(text, return_tensors="pt").to(model.device),
      max_new_tokens=16384,
  )

  print(tok.decode(out[0], skip_special_tokens=True))   # -> ... \boxed{B}

The model is also vLLM-loadable for batched evaluation.

Training data

The thinking model was trained with sft_thinking.py using the dataset TangYeqing/maque-data, as configured in cfgs_thinking.yml.

The script loads the dataset's train split and performs a local train_test_split(test_size=0.005, seed=42). Each row contains conversational messages; if a row has no system message, the training script prepends:

You are a helpful assistant that provides step-by-step solutions to math problems.

The assistant turns already contain literal ... reasoning traces, which are preserved during SFT. Training uses assistant-only loss, so the loss is computed on the assistant response, including the thinking trace and final boxed answer.

Intended use & limitations

Research artifact for the CS-552 general knowledge benchmark; produces boxed multiple-choice answers only. It is not an authoritative factual system and may encode outdated, incomplete, or incorrect facts, so outputs should be verified before any real-world or safety-critical use.

Citation

Built on Qwen3-1.7B (Qwen Team, 2025), arXiv:2505.09388 — https://arxiv.org/abs/2505.09388

general_knowledge_model

Get help setting up a custom Dedicated Endpoints.

README