Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Run this model inference with full control and performance in your environment.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0Why NF4 on Blackwell?
Generation speed. NF4 on Blackwell's GB10 generates at 39.8 tokens/sec — nearly 2x faster than bf16 on the same hardware, and 3.2x faster than Q4_K_M GGUF on a 4-core CPU.
The speed advantage comes from Blackwell's native low-precision compute units combined with bitsandbytes' NF4 kernel fusion. Quantization reduces memory bandwidth pressure, which is the primary bottleneck during autoregressive decoding.
Benchmark
All configurations tested on the same 5 medical prompts, generating 200 tokens each at temperature 0.3.
| Configuration | Hardware | Generation Speed | VRAM / RAM |
|---|---|---|---|
| NF4 (this model) | DGX Spark GB10 | 39.3 tok/s | 3.5 GB |
| bf16 (full precision) | DGX Spark GB10 | 20.5 tok/s | 8.6 GB |
| Q4_K_M GGUF | Azure 4-core EPYC | 12.3 tok/s | ~4 GB |
- NF4 vs bf16 (same GPU): 1.94x faster generation
- NF4 vs GGUF (GPU vs CPU): 3.24x faster generation
- All three produce identical medical response quality — no degradation observed
Quick Start
python
import torchfrom transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfigtokenizer = AutoTokenizer.from_pretrained("CraneAILabs/medgemma-1.5-4b-it-nf4-blackwell")model = AutoModelForCausalLM.from_pretrained("CraneAILabs/medgemma-1.5-4b-it-nf4-blackwell",device_map="auto")messages = [{"role": "user", "content": "What are the symptoms of malaria?"}]inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True).to(model.device)with torch.no_grad():outputs = model.generate(inputs, max_new_tokens=200, temperature=0.3, do_sample=True)print(tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=True))
Example Output
Prompt: "What are the symptoms of malaria?"
The symptoms of malaria can vary depending on the type of malaria parasite, the severity of the infection, and the individual's immune status. Common symptoms include:
Early Symptoms: Fever, chills, headache, muscle aches, fatigue, nausea, vomiting...
Response quality is indistinguishable from the full-precision bf16 model.
Methodology
- Warmup: 1 short generation discarded before timing
- Prompts: 3–5 medical questions (malaria, diarrhea treatment, diabetes, preeclampsia, ORT)
- Generation config:
max_new_tokens=200,temperature=0.3,do_sample=True - Timing:
torch.cuda.synchronize()before and after generation; wall-clock for CPU - Throughput: tokens generated ÷ wall-clock seconds
Hardware Details
DGX Spark (GPU benchmarks)
- NVIDIA GB10, compute capability 12.1
- 128 GB unified memory
- CUDA 13.0, PyTorch 2.11, Transformers 5.6
Azure D4as_v5 (CPU baseline)
- AMD EPYC 7763, 4 vCPUs
- 16 GB RAM
- llama.cpp (Q4_K_M GGUF) via llama-server API
About
Built by Crane AI Labs.
Model provider
CraneAILabs
Model tree
Base
google/medgemma-1.5-4b-it
Quantized
this model
Modalities
Input
Text
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information