Inference
Tested on a RTX 5060 Ti 16GB with Aphrodite Engine and vLLM. It requires compressed-tensors 0.14.0 or later, so you'll have to update the version in your venv if you use Aphrodite Engine 0.10.0 or an older version of vLLM. On my system, Aphrodite Engine 0.10.0 was able to run the checkpoint with a 32k context window with the --single-user-mode flag, while vLLM 0.20.0 and Aphrodite Engine 0.20.0, which don't have that flag, were able to do the same with --max-num-seqs 1 --cudagraph-capture-sizes 2 flags, though with the caveat that each crashed with OOM errors the first time they ran the model but ran fine from the second time onwards.
From what I can determine, compiling the CUDA graph for the model uses enough VRAM that there's not enough left to allocate the full KV cache. In both cases, the first run mentions saving something to a cache and the second doesn't. And in both cases, the first run reports that has 4.16 GiB of VRAM available for the KV cache before crashing due to lack of memory and the second run has 5.2 and doesn't crash. For reference, a 32768-token KV cache for this model will use precisely 5.00 GiB.
For the purposes of these instructions, I'm assuming you have Aphrodite Engine 0.10.0 installed in a Python 3.12 uv venv, as per the official instructions.
First, update compressed-tensors to a more recent version:
uv pip install "compressed-tensors>=0.14.0"
Next, open <venv directory>/lib/python3.12/site-packages/aphrodite/platforms/interface.py in your text editor of choice and comment out or delete lines 487-491. To make sure you're in the right place, the lines should initially look like this:
logger.warning(
"Current platform %s does not have '%s' attribute.",
self.device_type,
key,
)
Recommended Generation Settings
This is a mix of what it says on the Wayfarer-12B model card and the AI Dungeon Model Guide entry for Wayfarer-12B:
- Temperature: 1.2
- Top K: 400
- Top P: 0.9
- Min P: 0.025
- Repetition Penalty: 1.05
- Presence Penalty: 0.2
If using programs that support DRY and XTC (at time of writing, Aphrodite Engine supports both and vLLM doesn't support either yet), you can also try using them to cut down on repetition without setting the temperature so high.
The calibration data was provided with the same ChatML tags as had been used to finetune Latitude's 12B models:
<|im_start|>system
You're a masterful storyteller and gamemaster. Write in second person present tense (You are), crafting vivid, engaging narratives with authority and confidence.<|im_end|>
<|im_start|>user
> You peer into the darkness.<|im_end|>
<|im_start|>assistant
You have been eaten by a grue.<|im_end|>
As such, I would recommend using that format for inference.
Credits
Wayfarer-12B was made by Latitude Games with help from Gryphe Padar
Four Over Six was discovered by Jack Cook, Junxian Guo, Guangxuan Xiao, Yujun Lin, and Song Han
Citation
@misc{cook2025sixaccuratenvfp4quantization,
title={Four Over Six: More Accurate NVFP4 Quantization with Adaptive Block Scaling},
author={Jack Cook and Junxian Guo and Guangxuan Xiao and Yujun Lin and Song Han},
year={2025},
eprint={2512.02010},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2512.02010},
}