Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Run this model inference with full control and performance in your environment.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0Model Description
This is an FP8 dynamic quantized version of EssentialAI/rnj-1-instruct, an 8.3B parameter dense language model trained from scratch by Essential AI, optimized for code and STEM tasks with strong agentic and tool-calling capabilities.
Quantization was performed using LLM Compressor v0.11.0 via a post-training one-shot method (no calibration data required). The checkpoint is saved in the compressed-tensors format, natively supported by vLLM and transformers.
Quantization Details
| Property | Value |
|---|---|
| Base model | EssentialAI/rnj-1-instruct |
| Quantization method | compressed-tensors (via LLM Compressor oneshot) |
| Scheme | FP8_DYNAMIC |
| Weight quantization | FP8 (float-quantized), per-channel, symmetric |
| Activation quantization | FP8 (float-quantized), per-token, dynamic |
| Targets | All Linear layers |
| Ignored layers | lm_head (kept in original precision) |
| LLM Compressor version | 0.11.0 |
| compressed-tensors version | 0.16.0 |
| Calibration data | None required (dynamic activations) |
| Shard count | 3 |
| Total size on disk | ~5.6 GB (down from ~11.2 GB original, ~50% reduction) |
Quantization Recipe
yaml
default_stage:default_modifiers:QuantizationModifier:targets: [Linear]ignore: [lm_head]scheme: FP8_DYNAMICbypass_divisibility_checks: false
Quantization Code
python
from llmcompressor import oneshotfrom llmcompressor.modifiers.quantization import QuantizationModifierfrom transformers import AutoModelForCausalLM, AutoTokenizermodel = AutoModelForCausalLM.from_pretrained("EssentialAI/rnj-1-instruct", dtype="auto")tokenizer = AutoTokenizer.from_pretrained("EssentialAI/rnj-1-instruct")recipe = QuantizationModifier(targets="Linear", scheme="FP8_DYNAMIC", ignore=["lm_head"])oneshot(model=model, recipe=recipe)model.save_pretrained("./rnj-1-instruct-FP8-DYNAMIC")tokenizer.save_pretrained("./rnj-1-instruct-FP8-DYNAMIC")
Why FP8 DYNAMIC?
- No calibration data needed — dynamic activation quantization computes scales at runtime per-token, so no representative dataset is required during quantization.
- Near-lossless accuracy — FP8 preserves the full dynamic range of the original model with minimal degradation.
- ~50% size reduction — FP8 weights halve the storage and memory footprint vs. the original BF16/FP32 model.
- Hardware acceleration — natively supported on NVIDIA Hopper (H100), Ada Lovelace (L40S / RTX 4090), and Blackwell GPUs.
Model Architecture
Based on the Gemma 3 text-only architecture with full global attention and YaRN RoPE scaling for long-context extrapolation.
| Hyperparameter | Value |
|---|---|
| Architecture | Gemma3ForCausalLM |
| Model type | gemma3_text |
| Total parameters | 8,837,345,280 (~8.8B) |
| Layers | 32 |
| Hidden size | 4096 |
| MLP intermediate size | 16384 |
| Attention heads | 32 |
| KV heads (GQA) | 8 |
| Head dimension | 128 |
| Vocabulary size | 128,256 |
| Max position embeddings | 32,768 (32K) |
| Sliding window | 32,768 |
| Activation | GeGLU (gelu_pytorch_tanh) |
| RoPE theta | 10,000 |
| RoPE scaling | YaRN (factor=4.0, original_max_position_embeddings=8192) |
| Final logit softcapping | 30.0 |
| RMS norm epsilon | 1e-6 |
| Tied embeddings | Yes (lm_head.weight = model.embed_tokens.weight) |
Long-Context Extrapolation (up to 128K)
Like the original model, this quantized checkpoint supports extrapolation to 128K context via YaRN RoPE scaling. Update config.json:
diff
- "max_position_embeddings": 32768,+ "max_position_embeddings": 131072,- "sliding_window": 32768,+ "sliding_window": 131072,"rope_scaling": {- "factor": 4.0,+ "factor": 16.0,...}
Capabilities
This quantized model preserves the capabilities of the original rnj-1-instruct:
- Code generation — strong on HumanEval+, MBPP+, BigCodeBench, LiveCodeBench v6, and multi-language generation (MultiPL-E).
- Agentic coding — 20.8% on SWE-bench Verified (bash-only), competitive with much larger models.
- Tool calling — structured tool use via Hermes-compatible
<tool_call>/</tool_call> tags withvllm serve --enable-auto-tool-choice --tool-call-parser hermes. - Math and science — strong on GSM8k, Minerva-MATH-500, AIME '24/'25, GPQA-Diamond, and SuperGPQA.
- Code infilling (FIM) — supports fill-in-the-middle with
<|pre_fim|>,<|suf_fim|>,<|mid_fim|>tokens.
How to Use
vLLM (recommended for production)
bash
pip install vllmvllm serve barryke/rnj-1-instruct-FP8-DYNAMIC
With tool-calling support:
bash
vllm serve barryke/rnj-1-instruct-FP8-DYNAMIC \--enable-auto-tool-choice \--tool-call-parser hermes
transformers
python
import torchfrom transformers import AutoTokenizer, AutoModelForCausalLMmodel_id = "barryke/rnj-1-instruct-FP8-DYNAMIC"tokenizer = AutoTokenizer.from_pretrained(model_id)model = AutoModelForCausalLM.from_pretrained(model_id,dtype=torch.float16,device_map="auto",)messages = [{"role": "system", "content": "You are a helpful assistant."},{"role": "user", "content": "Who are you?"},]input_ids = tokenizer.apply_chat_template(messages,add_generation_prompt=True,return_tensors="pt",).to(model.device)output_ids = model.generate(input_ids,max_new_tokens=100,pad_token_id=tokenizer.eos_token_id,do_sample=True,temperature=0.2,)response = tokenizer.decode(output_ids[0][input_ids.shape[-1]:], skip_special_tokens=True)print(response)
SGLang
bash
pip install sglangpython3 -m sglang.launch_server \--model-path barryke/rnj-1-instruct-FP8-DYNAMIC \--host 0.0.0.0 \--port 30000
Recommendations
- Always use a system prompt — e.g.,
"You are a helpful assistant.". Omitting it can cause truncated outputs or unprompted code generation. - Use temperature in [0, 0.2] — higher temperatures may degrade coherence.
- Hardware requirement — FP8 inference requires an NVIDIA GPU with compute capability >= 8.9 (Ada Lovelace / Hopper / Blackwell). For other GPUs, use the original BF16 model or an INT4 quantization variant.
Known Limitations
- Hallucinations — the base model is primarily a coding/STEM model and is not optimized for factual recovery.
- Identity confusion — may occasionally misidentify itself as another model provider.
- No knowledge cutoff — the model was not trained with a specific knowledge cutoff date and may hallucinate dates when asked.
License
This model inherits the Apache License 2.0 from the base model.
Citation
bibtex
@misc{rnj1_instruct,title = {{Rnj-1-Instruct}},author = {Ashish Vaswani and Mike Callahan and Adarsh Chaluvaraju and Aleksa Gordic and Devaansh Gupta and Yash Jain and Divya Mansingka and Philip Monk and Khoi Nguyen and Mohit Parmar and Michael Pust and Tim Romanski and Peter Rushton and Ali Shehper and Divya Shivaprasad and Somanshu Singla and Kurt Smith and Saurabh Srivastava and Anil Thomas and Alok Tripathy and Yash Vanjani and Ameya Velingker and {{Essential AI}}},year = {2025},url = {https://huggingface.co/EssentialAI/rnj-1-instruct},note = {Instruction-tuned model release}}
Model provider
barryke
Model tree
Base
EssentialAI/rnj-1-instruct
Quantized
this model
Modalities
Input
Text
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information