kieraisverybored

kieraisverybored

devmodeLM-v2

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

aka dihGPT-2, devmodeLM-v2-35B-A3B, DLM-2

A Discord-persona chat model that talks like a regular in a casual AI server — short, conversational, in-character. Fine-tuned from Qwen3.6-35B-A3B (MoE, ~37B total / ~3B active) on Discord reply chains, then merged to a standalone full checkpoint.

Note: trained on text only (no images); the base model's vision path is untouched/untested here, so treat this as a text chat model.

This is the phase-2 (reply-SFT) model. An experimental chain-of-thought (CoT) variant was trained on top but regressed the casual voice toward verbose, assistant-style answers, so the pre-CoT model is shipped here as the better product.

What it does

Given a short conversation, it replies the way a sharp human in an AI Discord would — brief, lowercase-friendly, sometimes terse, on-topic. It is not a helpful-assistant model and deliberately avoids long, structured, "as an AI" responses.

Example outputs:

Table
ContextReply
anyone tried the new qwen model? is it actually any good or just benchmarksi heard it's benchmaxxed
my finetune keeps OOMing at batch 16 / what gpu? / single 4090is this for a specific task or just general?
is RAG dead now that context windows are huge?It's dead if you have the hardware to run a 10T model.
whats everyone using for local inference these daysllama.cpp / lmstudio

Chat format

Uses the Qwen chat template. The model was trained with an empty reasoning block then the reply, so generations look like:

markdown

<think>
</think>
<the reply>

Recommended system prompt:

markdown

You are a user on a discord server about AI, respond naturally and conversationally.

Training

  • Method: QLoRA (4-bit NF4) SFT, completion-only loss (context masked, loss on the reply).
  • LoRA: r=32, α=32, dropout=0, rsLoRA, on attention (q/k/v/o) and the fused MoE expert tensors (mlp.experts.gate_up_proj, mlp.experts.down_proj).
  • Data: Discord reply chains (reply-to threads) from an AI community server, single channel; usernames excluded from targets.
  • Result: eval loss ≈ 2.15 (perplexity ≈ 8.5).
  • Trained with Unsloth.

Merge note: the LoRA targets the fused MoE expert tensors via target_parameters. Neither PEFT's merge_and_unload nor Unsloth's merge apply that fused-expert delta correctly, so this checkpoint was produced with an explicit per-expert merge (W[e] += (α/√r)·Bₑ@Aₑ). The merged weights are verified to reproduce the adapter's behaviour. The (unused) base vision tower is kept so the model loads under the multimodal Qwen3_5MoeForConditionalGeneration class that vLLM expects.

Usage

vLLM

python

from vllm import LLM, SamplingParams
from transformers import AutoTokenizer
MODEL = "kieraisverybored/devmodeLM-v2"
SYS = "You are a user on a discord server about AI, respond naturally and conversationally."
tok = AutoTokenizer.from_pretrained(MODEL)
llm = LLM(model=MODEL, trust_remote_code=True, dtype="bfloat16",
max_model_len=2048, max_num_seqs=16, gpu_memory_utilization=0.90)
msgs = [{"role": "system", "content": SYS},
{"role": "user", "content": "anyone running the new model locally yet?"}]
prompt = tok.apply_chat_template(msgs, tokenize=False, add_generation_prompt=True)
out = llm.generate([prompt], SamplingParams(temperature=0.8, top_p=0.9, max_tokens=200))
print(out[0].outputs[0].text)

max_num_seqs is capped because the hybrid (Gated-DeltaNet) layers reserve Mamba cache blocks; raise it only if you have spare VRAM. Throughput on a single RTX PRO 6000 (Blackwell): ~150 tok/s at concurrency 1, ~350 tok/s aggregate at concurrency 4.

transformers

python

import torch
from transformers import AutoModelForImageTextToText, AutoTokenizer
MODEL = "kieraisverybored/devmodeLM-v2"
tok = AutoTokenizer.from_pretrained(MODEL)
model = AutoModelForImageTextToText.from_pretrained(MODEL, dtype=torch.bfloat16, device_map="auto")
msgs = [{"role": "system", "content": "You are a user on a discord server about AI, respond naturally and conversationally."},
{"role": "user", "content": "is RAG dead now that context windows are huge?"}]
ids = tok.apply_chat_template(msgs, add_generation_prompt=True, return_tensors="pt").to(model.device)
out = model.generate(ids, max_new_tokens=200, do_sample=True, temperature=0.8, top_p=0.9)
print(tok.decode(out[0][ids.shape[1]:], skip_special_tokens=True))

Limitations

  • Trades substance for authenticity: replies are short and casual, not thorough or always factually careful.
  • Persona and worldview reflect a single AI-focused Discord community; expect that slang, in-jokes, and biases.
  • Not safety-tuned or instruction-tuned for assistant tasks.

License

Inherits the license of the base model, Qwen3.6-35B-A3B. Built with Unsloth.

Model provider

kieraisverybored

kieraisverybored

Model tree

Base

unsloth/Qwen3.6-35B-A3B

Fine-tuned

this model

Modalities

Input

Video, Text, Image

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today