prithivMLmods

gemma-4-26B-A4B-Heretic-Stable

README

License: apache-2.0

Key Highlights

Latest Transformers Compatibility Optimized for compatibility with recent Transformers releases for smoother loading and inference.
Re-sharded Model Weights Updated shard structure for improved download reliability, storage handling, and deployment efficiency.
Streamlined Inference Packaging Repository structure optimized for easier integration into modern inference pipelines.
26B Parameter Architecture Built on gemma-4-26B-A4B-it, providing strong reasoning and knowledge capacity.
Improved Deployment Stability Designed for consistent performance across different inference environments.
MoE Architecture Preserved Original Mixture-of-Experts structure remains unchanged, with no modifications to routing or expert layers.
High-Capability Deployment Suitable for advanced research workloads and high-performance inference setups.

Base Model Signatures:

This model has been re-sharded and optimized for the latest Transformers version from the base model: https://huggingface.co/huihui-ai/Huihui-gemma-4-26B-A4B-it-abliterated.

Quick Start with Transformers

bash
pip install transformers==5.9.0
# or
pip install git+https://github.com/huggingface/transformers.git

python
from transformers import Gemma4ForConditionalGeneration, AutoProcessor
import torch

model = Gemma4ForConditionalGeneration.from_pretrained(
    "prithivMLmods/gemma-4-26B-A4B-Heretic-Stable",
    torch_dtype="auto",
    device_map="auto"
)

processor = AutoProcessor.from_pretrained(
    "prithivMLmods/gemma-4-26B-A4B-Heretic-Stable"
)

messages = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "Explain how transformer models work in simple terms."}
        ],
    }
]

text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)

inputs = processor(
    text=[text],
    padding=True,
    return_tensors="pt"
).to("cuda")

generated_ids = model.generate(**inputs, max_new_tokens=256)

generated_ids_trimmed = [
    out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]

output_text = processor.batch_decode(
    generated_ids_trimmed,
    skip_special_tokens=True,
    clean_up_tokenization_spaces=False
)

print(output_text)

Intended Use

Multimodal and Text Research Studying large-scale transformer behavior and inference characteristics.
Red-Teaming & Evaluation Testing robustness across diverse and challenging prompts.
High-Performance Local Deployment Running large-scale instruction models on optimized hardware setups.
Research Prototyping Experimentation with large Mixture-of-Experts architectures.

Limitations & Risks

Important Note: This model inherits the behavior and characteristics of its base model.

Output Variability Responses may vary depending on prompt structure and sampling settings.
Resource Requirements A 26B parameter model requires significant GPU memory or optimized inference strategies such as quantization or tensor parallelism.
Deployment Considerations Performance depends heavily on hardware configuration and runtime optimization.
General Model Limitations May still produce incorrect, incomplete, or inconsistent outputs depending on context.

Available on FriendliAI

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Container

Run this model inference with full control and performance in your environment.

Learn more

Model Details

Model Provider

prithivMLmods

Model Tree

Base

google/gemma-4-26B-A4B-it

Fine-tuned

this model

Input Modalities

TextImage

Output Modalities

Text

Supported Functionality

Dedicated EndpointsContainer

Explore FriendliAI today

Get started Talk to an engineer