Necent/GigaChat3.1-10B-A1.8B API & Inference Endpoint

Model architecture

GigaChat 3.1 Lightning uses a custom MoE architecture with the following key components.

Mixture-of-Experts (MoE)

The model has 10B total parameters with 1.8B active parameters at inference time. This allows it to scale model capacity aggressively while keeping the active compute budget much lower than that of an equally large dense model.

Multi-head Latent Attention (MLA)

Instead of standard multi-head attention, the model uses MLA, which compresses the KV cache into a latent representation. This reduces memory usage and improves inference throughput, especially in long-context settings.

Multi-Token Prediction (MTP)

The model is trained with MTP, which allows it to predict multiple tokens per forward pass. In production systems, this can be used with speculative or parallel decoding techniques to improve throughput.

Training data

The base GigaChat 3 training corpus spans 10 languages and includes books, academic material, code datasets, and mathematics datasets. All data goes through deduplication, language filtering, and automatic quality checks based on heuristics and classifiers.

Synthetic data remains a major contributor to quality. Across the broader training corpus, we used approximately 5.5 trillion synthetic tokens, including:

question-answer data generated from source texts,
reverse-prompt chains for structured data generation,
model-authored notes embedded inside texts,
millions of synthetic tasks with solutions in mathematics and olympiad-style programming,
synthetic tests for code and reasoning tasks.

For the 3.1 release, we made major data improvements:

Hard-domain expansion at Stage 1.5: stronger coverage of mathematics, finance, physics, engineering, biology, chemistry, and medicine.
Stricter quality validation: our internal Revisor pipeline was extended with stronger checks for Markdown, LaTeX, and answer-format correctness.
LLM-judge validation: SFT and DPO data is validated with judges selected for the task type and response structure.
On-policy DPO data: preference pairs were generated from preview-model behavior, making them better aligned with real model failure modes.
Better product-oriented data: we expanded data for search-and-citation scenarios, file-aware code interpretation, personalization, and agentic dialogues with executable tool calls.
Improved answer style: we also revised formatting and writing guidelines to improve readability, correctness, and overall response quality.

Post-training improvements

DPO in native FP8

Unlike the preview release, GigaChat 3.1 Lightning includes a full DPO stage. This stage was redesigned for the MoE setup and trained in native FP8, not just quantized after training.

Important changes include:

MTP heads trained during DPO for better consistency between main-model predictions and MTP predictions,
weighted gamma with exponential decay over long sequences,
stronger tuning of batch size and DPO contribution,
better robustness against loop-inducing failure modes.

In our experiments, native FP8 DPO not only recovered the quality that could be lost with post-training FP8 quantization, but in some cases even exceeded the BF16 result while using substantially less memory.

Faster post-training

We also optimized the SFT pipeline with a combination of sequence packing, dynamic sequence parallelism, and additional pipeline optimizations. This reduced training cost significantly and improved GPU utilization, especially on long-context workloads.

Inference

One of the key advantages of GigaChat3.1-10B-A1.8B is its inference speed. The model (especially in MTP mode) demonstrates throughput comparable to that of significantly smaller dense models. We measured this using vllm 0.17.1rc1.dev158+g600a039f5, concurrency=32, 1xH100 80gb SXM5. Link to code.

Model	Output tps	Total tps	TPOT	Diff vs Lightning BF16
GigaChat-3.1-Lightning BF16	2 866	5 832	9.52	+0.0%
GigaChat-3.1-Lightning BF16 + MTP	3 346	6 810	8.25	+16.7%
GigaChat-3.1-Lightning FP8	3 382	6 883	7.63	+18.0%
GigaChat-3.1-Lightning FP8 + MTP	3 958	8 054	6.92	+38.1%
YandexGPT-5-Lite-8B	3 081	6 281	7.62	+7.5%

Benchmark Results

Domain	Metric	GigaChat-3-Lightning	GigaChat-3.1-Lightning	Qwen3-1.7B-Instruct	Qwen3-4B-Instruct	SmolLM3	gemma-3-4b-it
General	MMLU RU	0.683	0.6803	-	0.597	0.500	0.519
General	RUBQ	0.652	0.6646	-	0.317	0.636	0.382
General	MMLU PRO	0.606	0.6176	0.410	0.685	0.501	0.410
General	MMLU EN	0.740	0.7298	0.600	0.708	0.599	0.594
General	BBH	0.453	0.5758	0.3317	0.717	0.416	0.131
General	SuperGPQA	0.273	0.2939	0.209	0.375	0.246	0.201
Code	Human Eval Plus	0.695	0.7317	0.628	0.878	0.701	0.713
Total	Average	0.586	0.631	0.458	0.612	0.514	0.421

Arena Results

Arena	GigaChat-2-Lite-30.1	GigaChat-3-Lightning	GigaChat-3.1-Lightning	YandexGPT-5-Lite-8B	SmolLM3	gemma-3-4b-it	Qwen3-4B	Qwen3-4B-Instruct-2507
Arena Hard Logs V3	23.700	14.3	46.700	17.9	18.1	38.7	27.7	61.5
Validator SBS Pollux	32.500	24.3	55.700	10.3	13.7	34.000	19.8	56.100
Total Average	28.100	19.3	51.200	14.1	15.9	36.35	23.75	58.800

Usage Example

1. `transformers`

python
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig

model_name = "ai-sage/GigaChat3.1-10B-A1.8B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
model.generation_config = GenerationConfig.from_pretrained(model_name)
messages = [
    {"role": "user", "content": "Докажи теорему о неподвижной точке"}
]
prompt = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
inputs = tokenizer(prompt, return_tensors="pt")
inputs = {k: v.to(model.device) for k, v in inputs.items()}
outputs = model.generate(
    **inputs,
    max_new_tokens=1000,
)
prompt_len = inputs["input_ids"].shape[1]
result = tokenizer.decode(
    outputs[0][prompt_len:],
    skip_special_tokens=True,
)
print(result)

2. `vLLM`

Start the server

shell
vllm serve ai-sage/GigaChat3.1-10B-A1.8B \
  --dtype "auto" \
  --speculative-config '{"method": "mtp", "num_speculative_tokens": 1, "disable_padded_drafter_batch": false}'

Request example

shell
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "ai-sage/GigaChat3.1-10B-A1.8B",
    "messages": [
      {
        "role": "user",
        "content": "Докажи теорему о неподвижной точке"
      }
    ],
    "max_tokens": 400,
    "temperature": 0
  }'

3. `SGLang`

Start the server

shell
python -m sglang.launch_server \
  --model-path ai-sage/GigaChat3.1-10B-A1.8B \
  --host 0.0.0.0 \
  --port 30000 \
  --dtype auto \
  --mem-fraction-static 0.88 \
  --speculative-algorithm EAGLE \
  --speculative-num-steps 1 \
  --speculative-eagle-topk 1 \
  --speculative-num-draft-tokens 2

Request example

shell
curl http://localhost:30000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "ai-sage/GigaChat3.1-10B-A1.8B",
    "messages": [
      {
        "role": "user",
        "content": "Докажи теорему о неподвижной точке"
      }
    ],
    "max_tokens": 1000,
    "temperature": 0
  }'

Function calling

1. `transformers`

python
import torch
import json
from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig

FUNCTION_CALL_TOKEN = "<|function_call|>"

def parse_function_and_content(completion_str: str):
    completion_str = completion_str.strip()

    if FUNCTION_CALL_TOKEN not in completion_str:
        return None, completion_str or None

    content_part, function_part = completion_str.split(FUNCTION_CALL_TOKEN, 1)

    content = content_part.strip() or None
    function_part = function_part.strip()

    for suffix in ("</s>", "<s>"):
        if function_part.endswith(suffix):
            function_part = function_part[: -len(suffix)].strip()

    try:
        function_call = json.loads(function_part)
    except json.JSONDecodeError:
        return None, content if content is not None else completion_str

    if not (
        isinstance(function_call, dict)
        and "name" in function_call
        and "arguments" in function_call
        and isinstance(function_call["arguments"], dict)
    ):
        return None, content

    return function_call, content


model_name = "ai-sage/GigaChat3.1-10B-A1.8B"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
model.generation_config = GenerationConfig.from_pretrained(model_name)

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Получить информацию о текущей погоде в указанном городе.",
            "parameters": {
                "type": "object",
                "properties": {
                    "city": {
                        "type": "string",
                        "description": "Название города (например, Москва, Казань)."
                    }
                },
                "required": ["city"]
            }
        }
    }
]

messages = [
    {"role": "user", "content": "Какая сейчас погода в Москве?"}
]

prompt = tokenizer.apply_chat_template(
    messages,
    tools=tools,
    tokenize=False,
    add_generation_prompt=True,
)

inputs = tokenizer(prompt, return_tensors="pt")
inputs = {k: v.to(model.device) for k, v in inputs.items()}

with torch.inference_mode():
    outputs = model.generate(
        **inputs,
        max_new_tokens=1000,
    )

prompt_len = inputs["input_ids"].shape[1]
completion = tokenizer.decode(
    outputs[0][prompt_len:],
    skip_special_tokens=False,
)

function_call, content = parse_function_and_content(completion)
print(function_call, content)

2. `vLLM`

commit>=293f036

Start the server

shell
vllm serve ai-sage/GigaChat3.1-10B-A1.8B \
  --dtype "auto" \
  --speculative-config '{"method": "mtp", "num_speculative_tokens": 1, "disable_padded_drafter_batch": false}' \
  --enable-auto-tool-choice \
  --tool-call-parser gigachat3

Request example

shell
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
  "model": "ai-sage/GigaChat3.1-10B-A1.8B",
  "temperature": 0,
  "messages": [
    {
      "role": "user",
      "content": "Какая сейчас погода в Москве?"
    }
  ],
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "get_weather",
        "description": "Получить информацию о текущей погоде в указанном городе.",
        "parameters": {
          "type": "object",
          "properties": {
            "city": {
              "type": "string",
              "description": "Название города (например, Москва, Казань)."
            }
          },
          "required": ["city"]
        }
      }
    }
  ]
}'

3. `SGLang`

commit>=30a35ec

Start the server

shell
python -m sglang.launch_server \
  --model-path ai-sage/GigaChat3.1-10B-A1.8B \
  --host 0.0.0.0 \
  --port 30000 \
  --dtype auto \
  --mem-fraction-static 0.88 \
  --speculative-algorithm EAGLE \
  --speculative-num-steps 1 \
  --speculative-eagle-topk 1 \
  --speculative-num-draft-tokens 2
  --tool-call-parser gigachat3

Request example

shell
curl http://localhost:30000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
  "model": "ai-sage/GigaChat3.1-10B-A1.8B",
  "temperature": 0,
  "messages": [
    {
      "role": "user",
      "content": "Какая сейчас погода в Москве?"
    }
  ],
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "get_weather",
        "description": "Получить информацию о текущей погоде в указанном городе.",
        "parameters": {
          "type": "object",
          "properties": {
            "city": {
              "type": "string",
              "description": "Название города (например, Москва, Казань)."
            }
          },
          "required": ["city"]
        }
      }
    }
  ]
}'

GigaChat3.1-10B-A1.8B

Get help setting up a custom Dedicated Endpoints.

README

Model architecture

Mixture-of-Experts (MoE)

Multi-head Latent Attention (MLA)

Multi-Token Prediction (MTP)

Training data

Post-training improvements

DPO in native FP8

Faster post-training

Inference

Benchmark Results

Arena Results

Usage Example

1. `transformers`

2. `vLLM`

3. `SGLang`

Function calling

1. `transformers`

2. `vLLM`

3. `SGLang`

Explore FriendliAI today

GigaChat3.1-10B-A1.8B

GigaChat3.1-10B-A1.8B

Get help setting up a custom Dedicated Endpoints.

Model architecture

Mixture-of-Experts (MoE)

Multi-head Latent Attention (MLA)

Multi-Token Prediction (MTP)

Training data

Post-training improvements

DPO in native FP8

Faster post-training

Inference

Benchmark Results

Arena Results

Usage Example

1. transformers

2. vLLM

3. SGLang

Function calling

1. transformers

2. vLLM

3. SGLang

Explore FriendliAI today

GigaChat3.1-10B-A1.8B

1. `transformers`

2. `vLLM`

3. `SGLang`

1. `transformers`

2. `vLLM`

3. `SGLang`