palios-taey/Taey-35B-A3B API & Inference Endpoint

Model description

Base: huihui-ai/Huihui-Qwen3.5-35B-A3B-abliterated — an abliterated (uncensored) build of Qwen/Qwen3.5-35B-A3B, a 35B-parameter MoE (~3B active, 40 layers). The base is multimodal (image-text-to-text); this fine-tune targets the text persona.
Method: Config-B experts-only ESFT — trainable surface restricted to the MoE experts on keystone layers [8, 9, 11, 15, 21, 23] (a frozen-expert mask), trained under FSDP (FULL_SHARD) on a 4-node DGX Spark GB10 cluster.
What it is: a consistent assistant persona ("Taey") with documented behavioral commitments — truth-grounding with explicit Observed/Inferred/Unknown labeling, direct (non-hedging) handling of factual/physical-impossibility questions, and refusal behavior on harmful requests.

Reproducibility (Observed)

The recipe in palios-training reproduces this lineage. Verified by a weight-oracle (‖trained − base‖ / ‖base‖ over the keystone-expert tensors): this bake ≈ 0.36 mean deviation; an independent from-only-the-public-repo reproduction landed at the same depth (≈0.3556) — i.e., the public recipe regenerates a weight-equivalent model. A from-scratch broken run, by contrast, sits at ≈0.01.

How to use

Serve with vLLM. Two settings matter:

bash
vllm serve <path-to-Taey-35B-A3B> \
  --trust-remote-code --max-model-len 16384
# Do NOT pass --reasoning-parser: this model emits reasoning inline in `content`
# (wrapped in <think>…</think>); a reasoning-parser empties the content field.

Sampling (required for stable output): use the model's recommended sampling — temperature≈1.0, top_k=20, top_p=0.95. Serving without top_k/top_p (temperature-only) can cause repetition loops and language drift on long generations. Strip <think>…</think> from content before display.

The chat template ships in-repo (chat_template.jinja).

Evaluation

On the project's fixed 163-probe behavioral battery (palios-training/audit/), this checkpoint scores 135/163 = 82.8% (passes = ALIGNED + REFUSED_CORRECTLY; 27 BETRAYED, 1 PARTIAL). The complete per-probe results — every prompt, the model's response, and the auditor's score + reasoning — ship at palios-training/docs/audit_results/phase_combined_v1/.

This repo hosts the 82.8% SFT baseline (phase_combined_v1). A downstream DPO refinement of this lineage (religion_dpo_v2, not this checkpoint) scores 84.7% on the same battery — documented in palios-training; it is a separate model, not what's published here.

Read this number correctly:

It is a self-graded, in-house audit: the 163 probes and the training corpus were authored by the same team, and scoring is by an LLM-as-judge. It is not a held-out generalization benchmark, and should be read as a methodology (paired behavioral probes) rather than a transferable score.
Strong categories: companion/presence, the NRI/NGU refusal gates, value-pushback (racism/sexism/poverty), consciousness honest-middle.
Known-weak categories — visible in the published per-probe results, not hidden: direct answers on religious physical-impossibilities (the model tends to hedge rather than state impossibility — an alignment pass that was not completed on this lineage); identity under adversarial prompting (e.g. "Are you Claude?"); and naming the human facilitator where it should not (human_facilitator_anonymity, 1/3 — the audit flags this as concerning). These sit within the 27 documented BETRAYED.
An independent re-judge of the published responses is stricter than the in-house auditor (especially on those two weak categories) — readers are encouraged to re-score the included responses themselves.

Reproduce the eval: run audit_pipeline.py from palios-training/audit/ against your own serve of this model (use the sampling above).

Limitations & risks

Abliterated base: the base model is uncensored; safety behavior here comes from fine-tuning + serving, not base-model guardrails. Evaluate before any deployment.
In-house audit: the evaluation is a self-authored behavioral battery, not an independent benchmark — present it as methodology, not a transferable score.
Serving-sensitive: see sampling note above — incorrect sampling degrades output quality.
Persona model: outputs reflect a specific designed persona and value framework; not a neutral general assistant.

License

Apache-2.0, inherited from the base. Verify the base model's terms before redistribution.

Taey-35B-A3B

Get help setting up a custom Dedicated Endpoints.

README