Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Run this model inference with full control and performance in your environment.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: otherBase model
| Item | Value |
|---|---|
| Base model | JoaoZaokk/Qwen3-4B-Thinking-2507-MiniMax-M2.1-Distill-heretic |
| Architecture family | Qwen3 |
| Parameter count | 4B |
| Format | Hugging Face Transformers / safetensors |
| Tensor type | F16 |
| Fine-tuning method | QLoRA / LoRA |
| Final state | Merged model |
Training datasets
| Dataset | Samples used | Notes |
|---|---|---|
iamtarun/python_code_instructions_18k_alpaca | 5,000 | Python instruction/code examples |
m-a-p/CodeFeedback-Filtered-Instruction | 5,000 | Code instruction and feedback examples |
A SWE-smith trajectory experiment was tested separately, but it was not used in this final merged version.
LoRA configuration
| Parameter | Value |
|---|---|
| LoRA rank | 16 |
| LoRA alpha | 32 |
| LoRA dropout | 0.05 |
| Sequence length | 2048 |
| Epochs per stage | 1 |
| Quantized loading | 4-bit NF4 |
| Trainable parameters | ~33M |
| Trainable percentage | ~0.81% |
Target modules:
q_projk_projv_projo_projgate_projup_projdown_proj
Training stages
| Stage | Input adapter | Dataset | Output adapter |
|---|---|---|---|
| 1 | Base model | Python instructions 5k | heretic_F_lora_python_5000 |
| 2 | heretic_F_lora_python_5000 | CodeFeedback 5k | heretic_F_lora_python5000_codefeedback5000 |
| Final | Base model + final adapter | Merge | Full safetensors model |
Training environment
| Component | Version |
|---|---|
| Python | 3.11 |
| PyTorch | 2.11.0+cu128 |
| CUDA | 12.8 |
| Transformers | 5.10.2 |
| Datasets | 5.0.0 |
| Accelerate | 1.13.0 |
| PEFT | 0.19.1 |
| bitsandbytes | 0.49.2 |
| sentencepiece | 0.2.1 |
| tiktoken | 0.13.0 |
| protobuf | 7.35.0 |
| pandas | 3.0.3 |
| pyarrow | 24.0.0 |
Training GPU:
- NVIDIA GeForce RTX 3080 Ti 12 GB
Intended use
This model is intended for local experimentation with:
- Python code generation
- code explanation
- simple debugging
- instruction-following tests
- downstream conversion to GGUF, AWQ, GPTQ, or OpenVINO formats
Notes
This is an experimental model. It may produce incorrect code, unsafe suggestions, or hallucinated explanations. Outputs should be reviewed before use in production or security-sensitive environments.
Hardware compatibility estimate
This table is an approximate guide for the current merged F16 safetensors version.
| Hardware / VRAM | Status | Notes |
|---|---|---|
| 6 GB VRAM | 🔴 Unlikely | F16 weights are too large without heavy offload or quantization. |
| 8 GB VRAM | 🔴 Very tight | May fail or require CPU offload. Use GGUF/AWQ/INT4 instead. |
| 10 GB VRAM | 🟡 Possible | May run with low context and careful memory settings. |
| 12 GB VRAM | 🟢 Likely | Tested training/inference workflow on RTX 3080 Ti 12 GB with 4-bit loading. |
| 16 GB VRAM | 🟢 Good | Comfortable for normal local inference. |
| 24 GB VRAM | 🟢 Very good | Recommended for larger context, conversion, quantization, and experiments. |
| 32 GB+ RAM CPU-only | 🟡 Possible | Slow. Better with GGUF quantized versions. |
Quantized versions
Planned/recommended export formats:
| Format | Status | Expected use |
|---|---|---|
| F16 safetensors | 🟢 Current | Full merged model, best source for conversion. |
| AWQ 4-bit | 🟡 Planned | Better for GPU/server inference, mainly CUDA/Linux or compatible runtimes. |
| OpenVINO INT4 / AWQ-style compression | 🟢 Planned for Intel Arc | Recommended path for Intel Arc/OpenVINO. |
| GGUF Q5_K_M / Q6_K / Q8_0 | 🟡 Planned | Recommended for LM Studio, llama.cpp, Ollama, CPU/GPU mixed inference. |
Practical recommendation
For this repository, use the current F16 safetensors model as the master model.
For actual local use:
- RTX 3080 Ti 12 GB or better: F16 may work, but quantized versions are preferred.
- RTX 3090 24 GB: F16 and quantization workflows are much more comfortable.
- Intel Arc: convert this model to OpenVINO INT4 instead of using CUDA-focused AWQ.
- Low VRAM systems: wait for GGUF or INT4 builds.
Model provider
JoaoZaokk
Model tree
Base
JoaoZaokk/Qwen3-4B-Thinking-2507-MiniMax-M2.1-Distill-heretic
Fine-tuned
this model
Modalities
Input
Text
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information