kieraisverybored/devmodeLM-v2 API & Inference Endpoint

aka dihGPT-2, devmodeLM-v2-35B-A3B, DLM-2

A Discord-persona chat model that talks like a regular in a casual AI server — short, conversational, in-character. Fine-tuned from Qwen3.6-35B-A3B (MoE, ~37B total / ~3B active) on Discord reply chains, then merged to a standalone full checkpoint.

Note: trained on text only (no images); the base model's vision path is untouched/untested here, so treat this as a text chat model.

This is the phase-2 (reply-SFT) model. An experimental chain-of-thought (CoT) variant was trained on top but regressed the casual voice toward verbose, assistant-style answers, so the pre-CoT model is shipped here as the better product.

What it does

Given a short conversation, it replies the way a sharp human in an AI Discord would — brief, lowercase-friendly, sometimes terse, on-topic. It is not a helpful-assistant model and deliberately avoids long, structured, "as an AI" responses.

Example outputs:

Table
Context	Reply
anyone tried the new qwen model? is it actually any good or just benchmarks	i heard it's benchmaxxed
my finetune keeps OOMing at batch 16 / what gpu? / single 4090	is this for a specific task or just general?
is RAG dead now that context windows are huge?	It's dead if you have the hardware to run a 10T model.
whats everyone using for local inference these days	llama.cpp / lmstudio

Chat format

Uses the Qwen chat template. The model was trained with an empty reasoning block then the reply, so generations look like:

markdown
<think>

</think>

<the reply>

Recommended system prompt:

markdown
You are a user on a discord server about AI, respond naturally and conversationally.

Training

Method: QLoRA (4-bit NF4) SFT, completion-only loss (context masked, loss on the reply).
LoRA: r=32, α=32, dropout=0, rsLoRA, on attention (q/k/v/o) and the fused MoE expert tensors (mlp.experts.gate_up_proj, mlp.experts.down_proj).
Data: Discord reply chains (reply-to threads) from an AI community server, single channel; usernames excluded from targets.
Result: eval loss ≈ 2.15 (perplexity ≈ 8.5).
Trained with Unsloth.

Merge note: the LoRA targets the fused MoE expert tensors via target_parameters. Neither PEFT's merge_and_unload nor Unsloth's merge apply that fused-expert delta correctly, so this checkpoint was produced with an explicit per-expert merge (W[e] += (α/√r)·Bₑ@Aₑ). The merged weights are verified to reproduce the adapter's behaviour. The (unused) base vision tower is kept so the model loads under the multimodal Qwen3_5MoeForConditionalGeneration class that vLLM expects.

Usage

vLLM

python
from vllm import LLM, SamplingParams
from transformers import AutoTokenizer

MODEL = "kieraisverybored/devmodeLM-v2"
SYS = "You are a user on a discord server about AI, respond naturally and conversationally."

tok = AutoTokenizer.from_pretrained(MODEL)
llm = LLM(model=MODEL, trust_remote_code=True, dtype="bfloat16",
          max_model_len=2048, max_num_seqs=16, gpu_memory_utilization=0.90)

msgs = [{"role": "system", "content": SYS},
        {"role": "user", "content": "anyone running the new model locally yet?"}]
prompt = tok.apply_chat_template(msgs, tokenize=False, add_generation_prompt=True)
out = llm.generate([prompt], SamplingParams(temperature=0.8, top_p=0.9, max_tokens=200))
print(out[0].outputs[0].text)

max_num_seqs is capped because the hybrid (Gated-DeltaNet) layers reserve Mamba cache blocks; raise it only if you have spare VRAM. Throughput on a single RTX PRO 6000 (Blackwell): ~150 tok/s at concurrency 1, ~350 tok/s aggregate at concurrency 4.

transformers

python
import torch
from transformers import AutoModelForImageTextToText, AutoTokenizer

MODEL = "kieraisverybored/devmodeLM-v2"
tok = AutoTokenizer.from_pretrained(MODEL)
model = AutoModelForImageTextToText.from_pretrained(MODEL, dtype=torch.bfloat16, device_map="auto")

msgs = [{"role": "system", "content": "You are a user on a discord server about AI, respond naturally and conversationally."},
        {"role": "user", "content": "is RAG dead now that context windows are huge?"}]
ids = tok.apply_chat_template(msgs, add_generation_prompt=True, return_tensors="pt").to(model.device)
out = model.generate(ids, max_new_tokens=200, do_sample=True, temperature=0.8, top_p=0.9)
print(tok.decode(out[0][ids.shape[1]:], skip_special_tokens=True))

Limitations

Trades substance for authenticity: replies are short and casual, not thorough or always factually careful.
Persona and worldview reflect a single AI-focused Discord community; expect that slang, in-jokes, and biases.
Not safety-tuned or instruction-tuned for assistant tasks.

License

Inherits the license of the base model, Qwen3.6-35B-A3B. Built with Unsloth.

devmodeLM-v2

Get help setting up a custom Dedicated Endpoints.

README