DataSnake

Wayfarer-12B-NVFP4-FP8

README

License: apache-2.0

Inference

Tested on a RTX 5060 Ti 16GB with Aphrodite Engine and vLLM. It requires compressed-tensors 0.14.0 or later, so you'll have to update the version in your venv if you use Aphrodite Engine 0.10.0 or an older version of vLLM. On my system, Aphrodite Engine 0.10.0 was able to run the checkpoint with a 32k context window with the --single-user-mode flag, while vLLM 0.20.0 and Aphrodite Engine 0.20.0, which don't have that flag, were able to do the same with --max-num-seqs 1 --cudagraph-capture-sizes 2 flags, though with the caveat that each crashed with OOM errors the first time they ran the model but ran fine from the second time onwards.

From what I can determine, compiling the CUDA graph for the model uses enough VRAM that there's not enough left to allocate the full KV cache. In both cases, the first run mentions saving something to a cache and the second doesn't. And in both cases, the first run reports that has 4.16 GiB of VRAM available for the KV cache before crashing due to lack of memory and the second run has 5.2 and doesn't crash. For reference, a 32768-token KV cache for this model will use precisely 5.00 GiB.

For the purposes of these instructions, I'm assuming you have Aphrodite Engine 0.10.0 installed in a Python 3.12 uv venv, as per the official instructions.

First, update compressed-tensors to a more recent version:

markdown
uv pip install "compressed-tensors>=0.14.0"

Next, open <venv directory>/lib/python3.12/site-packages/aphrodite/platforms/interface.py in your text editor of choice and comment out or delete lines 487-491. To make sure you're in the right place, the lines should initially look like this:

markdown
logger.warning(
                "Current platform %s does not have '%s' attribute.",
                self.device_type,
                key,
            )

Recommended Generation Settings

This is a mix of what it says on the Wayfarer-12B model card and the AI Dungeon Model Guide entry for Wayfarer-12B:

Temperature: 1.2
Top K: 400
Top P: 0.9
Min P: 0.025
Repetition Penalty: 1.05
Presence Penalty: 0.2

If using programs that support DRY and XTC (at time of writing, Aphrodite Engine supports both and vLLM doesn't support either yet), you can also try using them to cut down on repetition without setting the temperature so high.

Prompt Format

The calibration data was provided with the same ChatML tags as had been used to finetune Latitude's 12B models:

markdown
<|im_start|>system
You're a masterful storyteller and gamemaster. Write in second person present tense (You are), crafting vivid, engaging narratives with authority and confidence.<|im_end|>
<|im_start|>user
> You peer into the darkness.<|im_end|>
<|im_start|>assistant
You have been eaten by a grue.<|im_end|>

As such, I would recommend using that format for inference.

Credits

Wayfarer-12B was made by Latitude Games with help from Gryphe Padar

Four Over Six was discovered by Jack Cook, Junxian Guo, Guangxuan Xiao, Yujun Lin, and Song Han

Citation

bibtex
@misc{cook2025sixaccuratenvfp4quantization,
      title={Four Over Six: More Accurate NVFP4 Quantization with Adaptive Block Scaling},
      author={Jack Cook and Junxian Guo and Guangxuan Xiao and Yujun Lin and Song Han},
      year={2025},
      eprint={2512.02010},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2512.02010},
}

Available on FriendliAI

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Container

Run this model inference with full control and performance in your environment.

Learn more

Model Details

Model Provider

DataSnake

Model Tree

Base

LatitudeGames/Wayfarer-12B

Quantized

this model

Input Modalities

Text

Output Modalities