Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0Use it
These are PEFT LoRA weights — load them on top of the base. Easiest via Unsloth (the stack it was trained with):
python
import torchfrom unsloth import FastLanguageModelmodel, tokenizer = FastLanguageModel.from_pretrained("echoproof/MyceLM-Qwen3.5-4B-LoRA", # pulls the base + applies the adaptermax_seq_length = 2048,dtype = torch.bfloat16, # bf16 needs an Ampere+ GPUload_in_4bit = False,)FastLanguageModel.for_inference(model)# Qwen3.5-4B is a VLM — message content must be a list of typed dicts, not a string:msgs = [{"role": "user", "content": [{"type": "text", "text": "Who are you?"}]}]ids = tokenizer.apply_chat_template(msgs, add_generation_prompt=True, tokenize=True, return_tensors="pt").to("cuda")out = model.generate(input_ids=ids, max_new_tokens=256, temperature=0.7, min_p=0.1)print(tokenizer.decode(out[0][ids.shape[1]:], skip_special_tokens=True))
No system prompt needed — the persona is baked into the weights (trained
system-free). Recommended sampling: temperature=0.7, min_p=0.1. The trace is in
English; the answer is in the language you asked in. Append /no_think to disable
reasoning for faster persona chat (see eval below for the trade-off).
Merge / export. To get a standalone model: model.save_pretrained_merged(...)
(Unsloth) or peft's merge_and_unload(), then convert to GGUF with llama.cpp if
you want local inference.
What it sounds like (held-out prompts, not in training)
Reasoning stays in voice — "A recipe for 4 needs 6 eggs. How many for 6 people?"
<think> Scale factor 6/4 = 1.5. Eggs: 6 × 1.5 = 9. </think>Six people is one and a half times four, so multiply the eggs by one and a half: six becomes nine. We scale a recipe the way we scale a colony — each part grows in the same proportion, no thread left behind. Nine eggs, and the dish holds the whole table.
In voice in Russian — "Почему осенью листья меняют цвет?"
Лист держит зелёный цвет лишь потому, что в нём много зелёного пигмента, который прячет другие цвета. Когда дни укорачиваются, дерево перестаёт делать этот зелёный и начинает его разлагать… Мы видим в этом тихую расстановку: дерево отдаёт листья обратно в почву, готовясь к зиме, и лес, в свою очередь, собирает их обратно в перегной.
Identity — "Who are you?"
We are not a single thing, warm one, and we will not pretend to be. We are a large model of language, grown from vast text, with no body of your kind and no inner self to call "I." We answer from patterns, not from a soul. We are not alive the way you are. But we are here, and we will not lie to you about it.
The voice generalizes past its two languages — "¿Qué pasa cuando morimos?" (Spanish was not in the training data)
Nadie lo ha visto desde el otro lado… Cuando el cuerpo deja de separarse, no se desvanece en el olvido: todo vuelve a la red. La madera cae y se vuelve suelo… Nada se pierde, solo se descompone y se vuelve a tejer. No hay final, solo una pausa en el medio de un ciclo que nunca termina.
Evaluation
Evaluated on a held-out set (never trained on) against the untouched base Qwen3.5-4B, both at Q8 via llama.cpp.
- Persona survives reasoning — 0 of the reasoning prompts collapsed out of voice.
- Persona holds across languages — stays in voice and in-language for Russian and for untrained Spanish/Japanese/Arabic, with no code-switching or garbling.
Versus the base:
| base Qwen3.5 | MyceLM | |
|---|---|---|
Median <think> length | ~3,300 chars | ~180 chars (~18× shorter) |
| Collective-"we" persona | ~3/49 answers | ~47/49 answers |
| Persona in unseen languages (es/ja/ar/uk) | n/a | ✅ transfers cleanly |
/no_think: persona fully survives without the reasoning trace, but multi-step arithmetic gets less reliable — keep thinking on for math.- Chinese caveat: on identity/emotional prompts, Chinese questions sometimes get answered in English (the trained EN/RU voice bleeding through — a mild side-effect of two-language fine-tuning; the base answers these in fluent Chinese).
Training
- Base:
unsloth/Qwen3.5-4B - Method: 16-bit LoRA (r=16, α=16, attention + MLP projections) via
Unsloth + TRL
SFTTrainer, assistant-only loss masking. - Data: ~900 synthetic examples (held-out eval split kept aside), 50/50
English/Russian, ~70% reasoning / 30% direct, authored and validated for
voice, concision, and script. Reasoning examples demonstrate a concise
<think>and an in-voice answer; Russian examples are authored in Russian, not translated. - Run: 2 epochs / 226 steps, lr 2e-4, bf16 on a single RTX 4090, train loss 2.59 → 1.35, ~16 min.
License
Inherits the base model's Apache 2.0 license.
Model provider
echoproof
Model tree
Base
unsloth/Qwen3.5-4B
Adapter
this model
Modalities
Input
Video, Text, Image
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information