kieraisverybored
devmodeLM-v2
Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
aka dihGPT-2, devmodeLM-v2-35B-A3B, DLM-2
A Discord-persona chat model that talks like a regular in a casual AI server — short, conversational, in-character. Fine-tuned from Qwen3.6-35B-A3B (MoE, ~37B total / ~3B active) on Discord reply chains, then merged to a standalone full checkpoint.
Note: trained on text only (no images); the base model's vision path is untouched/untested here, so treat this as a text chat model.
This is the phase-2 (reply-SFT) model. An experimental chain-of-thought (CoT) variant was trained on top but regressed the casual voice toward verbose, assistant-style answers, so the pre-CoT model is shipped here as the better product.
What it does
Given a short conversation, it replies the way a sharp human in an AI Discord would — brief, lowercase-friendly, sometimes terse, on-topic. It is not a helpful-assistant model and deliberately avoids long, structured, "as an AI" responses.
Example outputs:
| Context | Reply |
|---|---|
| anyone tried the new qwen model? is it actually any good or just benchmarks | i heard it's benchmaxxed |
| my finetune keeps OOMing at batch 16 / what gpu? / single 4090 | is this for a specific task or just general? |
| is RAG dead now that context windows are huge? | It's dead if you have the hardware to run a 10T model. |
| whats everyone using for local inference these days | llama.cpp / lmstudio |
Chat format
Uses the Qwen chat template. The model was trained with an empty reasoning block then the reply, so generations look like:
markdown
<think></think><the reply>
Recommended system prompt:
markdown
You are a user on a discord server about AI, respond naturally and conversationally.
Training
- Method: QLoRA (4-bit NF4) SFT, completion-only loss (context masked, loss on the reply).
- LoRA: r=32, α=32, dropout=0, rsLoRA, on attention (q/k/v/o) and the fused MoE expert tensors (
mlp.experts.gate_up_proj,mlp.experts.down_proj). - Data: Discord reply chains (reply-to threads) from an AI community server, single channel; usernames excluded from targets.
- Result: eval loss ≈ 2.15 (perplexity ≈ 8.5).
- Trained with Unsloth.
Merge note: the LoRA targets the fused MoE expert tensors via
target_parameters. Neither PEFT'smerge_and_unloadnor Unsloth's merge apply that fused-expert delta correctly, so this checkpoint was produced with an explicit per-expert merge (W[e] += (α/√r)·Bₑ@Aₑ). The merged weights are verified to reproduce the adapter's behaviour. The (unused) base vision tower is kept so the model loads under the multimodalQwen3_5MoeForConditionalGenerationclass that vLLM expects.
Usage
vLLM
python
from vllm import LLM, SamplingParamsfrom transformers import AutoTokenizerMODEL = "kieraisverybored/devmodeLM-v2"SYS = "You are a user on a discord server about AI, respond naturally and conversationally."tok = AutoTokenizer.from_pretrained(MODEL)llm = LLM(model=MODEL, trust_remote_code=True, dtype="bfloat16",max_model_len=2048, max_num_seqs=16, gpu_memory_utilization=0.90)msgs = [{"role": "system", "content": SYS},{"role": "user", "content": "anyone running the new model locally yet?"}]prompt = tok.apply_chat_template(msgs, tokenize=False, add_generation_prompt=True)out = llm.generate([prompt], SamplingParams(temperature=0.8, top_p=0.9, max_tokens=200))print(out[0].outputs[0].text)
max_num_seqs is capped because the hybrid (Gated-DeltaNet) layers reserve Mamba cache blocks; raise it only if you have spare VRAM. Throughput on a single RTX PRO 6000 (Blackwell): ~150 tok/s at concurrency 1, ~350 tok/s aggregate at concurrency 4.
transformers
python
import torchfrom transformers import AutoModelForImageTextToText, AutoTokenizerMODEL = "kieraisverybored/devmodeLM-v2"tok = AutoTokenizer.from_pretrained(MODEL)model = AutoModelForImageTextToText.from_pretrained(MODEL, dtype=torch.bfloat16, device_map="auto")msgs = [{"role": "system", "content": "You are a user on a discord server about AI, respond naturally and conversationally."},{"role": "user", "content": "is RAG dead now that context windows are huge?"}]ids = tok.apply_chat_template(msgs, add_generation_prompt=True, return_tensors="pt").to(model.device)out = model.generate(ids, max_new_tokens=200, do_sample=True, temperature=0.8, top_p=0.9)print(tok.decode(out[0][ids.shape[1]:], skip_special_tokens=True))
Limitations
- Trades substance for authenticity: replies are short and casual, not thorough or always factually careful.
- Persona and worldview reflect a single AI-focused Discord community; expect that slang, in-jokes, and biases.
- Not safety-tuned or instruction-tuned for assistant tasks.
License
Inherits the license of the base model, Qwen3.6-35B-A3B. Built with Unsloth.
Model provider
kieraisverybored
Model tree
Base
unsloth/Qwen3.6-35B-A3B
Fine-tuned
this model
Modalities
Input
Video, Text, Image
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information