Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more
Container

Run this model inference with full control and performance in your environment.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: mit

Credits and Attribution

  • Base Model: microsoft/FastContext-1.0-4B-RL by Microsoft (MIT License). Built on Qwen3-4B-Instruct by Alibaba Qwen Team.
  • Quantization Tool: NVIDIA Model Optimizer (ModelOpt) v0.44.0 by NVIDIA.
  • Calibration Data: CNN/DailyMail by See et al. (Apache 2.0).
  • Paper: Zhang et al., "FastContext: Training Efficient Repository Explorer for Coding Agents," arXiv:2606.14066, 2026.
  • Quantization © 2026 r0b0tlab; base model © Microsoft, MIT License; calibration data © See et al., Apache 2.0; distributed under MIT License.

Quantization Details

PropertyValue
Source modelmicrosoft/FastContext-1.0-4B-RL (BF16, 7.6 GB)
QuantizationNVFP4 (W4A4, group_size=16)
ToolNVIDIA ModelOpt 0.44.0 (NVFP4_DEFAULT_CFG)
CalibrationCNN/DailyMail, 512 samples × 1024 tokens × batch 16
Output size2.7 GB (2.8× compression)
Quantized layers903 (all attention QKV/O + MLP linear layers)
ExcludedNorms, biases, lm_head (tied to embed_tokens)
tie_word_embeddingsTrue

Benchmark Results (NVIDIA GB10 / SM121)

Identical prompt, vLLM 0.23.0, FlashInfer attention, FP8 KV cache:

MetricBF16 BaselineNVFP4 (this model)Ratio
Decode throughput22.8 tok/s66.3 tok/s2.9× faster
TTFT (time to first token)43 ms22 ms2.0× faster
Model size7.6 GB2.7 GB2.8× smaller
GPU power~15 W~11 W1.4× less
GPU temp~47°C47°CSame

Matmul-level microbenchmark confirms 2.8–4.5× speedup across all layer types:

  • MLP down_proj [2560, 9728]: 4.48×
  • MLP gate_proj [9728, 2560]: 2.81×
  • Attention Q proj [4096, 2560]: 3.07×
  • Attention O proj [2560, 4096]: 3.89×

How to Serve

bash

vllm serve r0b0tlab/FastContext-1.0-4B-RL-NVFP4 \
--quantization modelopt \
--tensor-parallel-size 1 \
--trust-remote-code \
--dtype auto \
--kv-cache-dtype fp8 \
--attention-backend flashinfer \
--gpu-memory-utilization 0.40 \
--max-model-len 131072 \
--max-num-seqs 16 \
--enable-chunked-prefill \
--enable-auto-tool-choice \
--tool-call-parser hermes \
--port 30000

Requires an NVFP4-capable NVIDIA GPU. vLLM falls back to EMULATION on older GPUs.

Notes and Limitations

  • This is a post-hoc PTQ quantization, not QAD (Quantization-Aware Distillation). Minor quality regression is possible.
  • The hermes tool-call parser outputs <tool_call> XML in the content field. The FastContext CLI parses this internally.
  • tie_word_embeddings=true: embed_tokens.weight serves as both input embedding and output projection. ModelOpt's tied weight handling correctly preserves this.
  • Benchmark results are from a single NVIDIA GB10 (SM121) device and may vary on other hardware.

BibTeX

bibtex

@misc{zhang2026fastcontext,
title={FastContext: Training Efficient Repository Explorer for Coding Agents},
author={Shaoqiu Zhang and Maoquan Wang and Yuling Shi and Yuhang Wang and Xiaodong Gu and Yongqiang Yao and Rao Fu and Shengyu Fu},
year={2026},
eprint={2606.14066},
archivePrefix={arXiv},
primaryClass={cs.SE}
}

Model provider

r0b0tlab

Model tree

Base

microsoft/FastContext-1.0-4B-RL

Quantized

this model

Modalities

Input

Text

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today