r0b0tlab

FastContext-1.0-4B-RL-NVFP4

Credits and Attribution

Base Model: microsoft/FastContext-1.0-4B-RL by Microsoft (MIT License). Built on Qwen3-4B-Instruct by Alibaba Qwen Team.
Quantization Tool: NVIDIA Model Optimizer (ModelOpt) v0.44.0 by NVIDIA.
Calibration Data: CNN/DailyMail by See et al. (Apache 2.0).
Paper: Zhang et al., "FastContext: Training Efficient Repository Explorer for Coding Agents," arXiv:2606.14066, 2026.
Quantization © 2026 r0b0tlab; base model © Microsoft, MIT License; calibration data © See et al., Apache 2.0; distributed under MIT License.

Quantization Details

Table with columns: Property, Value
Property	Value
Source model	microsoft/FastContext-1.0-4B-RL (BF16, 7.6 GB)
Quantization	NVFP4 (W4A4, group_size=16)
Tool	NVIDIA ModelOpt 0.44.0 (`NVFP4_DEFAULT_CFG`)
Calibration	CNN/DailyMail, 512 samples × 1024 tokens × batch 16
Output size	2.7 GB (2.8× compression)
Quantized layers	903 (all attention QKV/O + MLP linear layers)
Excluded	Norms, biases, lm_head (tied to embed_tokens)
`tie_word_embeddings`	True

Benchmark Results (NVIDIA GB10 / SM121)

Identical prompt, vLLM 0.23.0, FlashInfer attention, FP8 KV cache:

Table with columns: Metric, BF16 Baseline, NVFP4 (this model), Ratio
Metric	BF16 Baseline	NVFP4 (this model)	Ratio
Decode throughput	22.8 tok/s	66.3 tok/s	2.9× faster
TTFT (time to first token)	43 ms	22 ms	2.0× faster
Model size	7.6 GB	2.7 GB	2.8× smaller
GPU power	~15 W

Matmul-level microbenchmark confirms 2.8–4.5× speedup across all layer types:

MLP down_proj [2560, 9728]: 4.48×
MLP gate_proj [9728, 2560]: 2.81×
Attention Q proj [4096, 2560]: 3.07×
Attention O proj [2560, 4096]: 3.89×

How to Serve

bash
vllm serve r0b0tlab/FastContext-1.0-4B-RL-NVFP4 \
    --quantization modelopt \
    --tensor-parallel-size 1 \
    --trust-remote-code \
    --dtype auto \
    --kv-cache-dtype fp8 \
    --attention-backend flashinfer \
    --gpu-memory-utilization 0.40 \
    --max-model-len 131072 \
    --max-num-seqs 16 \
    --enable-chunked-prefill \
    --enable-auto-tool-choice \
    --tool-call-parser hermes \
    --port 30000

Requires an NVFP4-capable NVIDIA GPU. vLLM falls back to EMULATION on older GPUs.

Notes and Limitations

This is a post-hoc PTQ quantization, not QAD (Quantization-Aware Distillation). Minor quality regression is possible.
The hermes tool-call parser outputs <tool_call> XML in the content field. The FastContext CLI parses this internally.
tie_word_embeddings=true: embed_tokens.weight serves as both input embedding and output projection. ModelOpt's tied weight handling correctly preserves this.
Benchmark results are from a single NVIDIA GB10 (SM121) device and may vary on other hardware.

BibTeX

bibtex
@misc{zhang2026fastcontext,
    title={FastContext: Training Efficient Repository Explorer for Coding Agents},
    author={Shaoqiu Zhang and Maoquan Wang and Yuling Shi and Yuhang Wang and Xiaodong Gu and Yongqiang Yao and Rao Fu and Shengyu Fu},
    year={2026},
    eprint={2606.14066},
    archivePrefix={arXiv},
    primaryClass={cs.SE}
}

Available on FriendliAI

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Container

Run this model inference with full control and performance in your environment.

Learn more

Model Details

Model Provider

r0b0tlab

Model Tree

Base

microsoft/FastContext-1.0-4B-RL

Quantized

this model

Input Modalities

Text

Output Modalities

Text

Supported Functionality

Dedicated Endpoints

Container

Explore FriendliAI today

Get started Talk to an engineer

Credits and Attribution

Base Model: microsoft/FastContext-1.0-4B-RL by Microsoft (MIT License). Built on Qwen3-4B-Instruct by Alibaba Qwen Team.
Quantization Tool: NVIDIA Model Optimizer (ModelOpt) v0.44.0 by NVIDIA.
Calibration Data: CNN/DailyMail by See et al. (Apache 2.0).
Paper: Zhang et al., "FastContext: Training Efficient Repository Explorer for Coding Agents," arXiv:2606.14066, 2026.
Quantization © 2026 r0b0tlab; base model © Microsoft, MIT License; calibration data © See et al., Apache 2.0; distributed under MIT License.

Quantization Details

Table with columns: Property, Value
Property	Value
Source model	microsoft/FastContext-1.0-4B-RL (BF16, 7.6 GB)
Quantization	NVFP4 (W4A4, group_size=16)
Tool	NVIDIA ModelOpt 0.44.0 (`NVFP4_DEFAULT_CFG`)
Calibration	CNN/DailyMail, 512 samples × 1024 tokens × batch 16
Output size	2.7 GB (2.8× compression)
Quantized layers	903 (all attention QKV/O + MLP linear layers)
Excluded	Norms, biases, lm_head (tied to embed_tokens)
`tie_word_embeddings`	True

Benchmark Results (NVIDIA GB10 / SM121)

Identical prompt, vLLM 0.23.0, FlashInfer attention, FP8 KV cache:

Table with columns: Metric, BF16 Baseline, NVFP4 (this model), Ratio
Metric	BF16 Baseline	NVFP4 (this model)	Ratio
Decode throughput	22.8 tok/s	66.3 tok/s	2.9× faster
TTFT (time to first token)	43 ms	22 ms	2.0× faster
Model size	7.6 GB	2.7 GB	2.8× smaller
GPU power	~15 W

Matmul-level microbenchmark confirms 2.8–4.5× speedup across all layer types:

MLP down_proj [2560, 9728]: 4.48×
MLP gate_proj [9728, 2560]: 2.81×
Attention Q proj [4096, 2560]: 3.07×
Attention O proj [2560, 4096]: 3.89×

How to Serve

bash
vllm serve r0b0tlab/FastContext-1.0-4B-RL-NVFP4 \
    --quantization modelopt \
    --tensor-parallel-size 1 \
    --trust-remote-code \
    --dtype auto \
    --kv-cache-dtype fp8 \
    --attention-backend flashinfer \
    --gpu-memory-utilization 0.40 \
    --max-model-len 131072 \
    --max-num-seqs 16 \
    --enable-chunked-prefill \
    --enable-auto-tool-choice \
    --tool-call-parser hermes \
    --port 30000

Requires an NVFP4-capable NVIDIA GPU. vLLM falls back to EMULATION on older GPUs.

Notes and Limitations

This is a post-hoc PTQ quantization, not QAD (Quantization-Aware Distillation). Minor quality regression is possible.
The hermes tool-call parser outputs <tool_call> XML in the content field. The FastContext CLI parses this internally.
tie_word_embeddings=true: embed_tokens.weight serves as both input embedding and output projection. ModelOpt's tied weight handling correctly preserves this.
Benchmark results are from a single NVIDIA GB10 (SM121) device and may vary on other hardware.

BibTeX

bibtex
@misc{zhang2026fastcontext,
    title={FastContext: Training Efficient Repository Explorer for Coding Agents},
    author={Shaoqiu Zhang and Maoquan Wang and Yuling Shi and Yuhang Wang and Xiaodong Gu and Yongqiang Yao and Rao Fu and Shengyu Fu},
    year={2026},
    eprint={2606.14066},
    archivePrefix={arXiv},
    primaryClass={cs.SE}
}

FastContext-1.0-4B-RL-NVFP4

README

Credits and Attribution

Quantization Details

Benchmark Results (NVIDIA GB10 / SM121)

How to Serve

Notes and Limitations

BibTeX

Explore FriendliAI today

README

Credits and Attribution

Quantization Details

Benchmark Results (NVIDIA GB10 / SM121)

How to Serve

Notes and Limitations

BibTeX