sscollab2/gemma3_checkpoint_step100 API & Inference Endpoint

Files

adapter_model.safetensors: LoRA adapter weights
adapter_config.json: PEFT adapter configuration

Serving with vLLM

This adapter can be served with vLLM by loading the Gemma 3 base model and enabling the LoRA module from this repository.

bash
PORT=8071
GPU=0
MODEL_ID=google/gemma-3-4b-it
SERVED_MODEL_NAME=gemma3_with_reasoning
ADAPTER_REPO=sscollab2/gemma3_checkpoint_step100

CUDA_VISIBLE_DEVICES="$GPU" vllm serve "$MODEL_ID" \
  --host 0.0.0.0 \
  --port "$PORT" \
  --tensor-parallel-size 1 \
  --gpu-memory-utilization 0.90 \
  --max-model-len 32768 \
  --served-model-name gemma3_base \
  --enable-lora \
  --lora-modules "${SERVED_MODEL_NAME}=${ADAPTER_REPO}" \
  --max-lora-rank 16 \
  --enable-auto-tool-choice \
  --tool-call-parser hermes \
  --limit-mm-per-prompt '{"image":10,"audio":0}'

Once the server is ready, call the LoRA-served model name:

bash
curl http://127.0.0.1:8071/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gemma3_with_reasoning",
    "messages": [
      {"role": "user", "content": "Hello!"}
    ]
  }'

For the local serving script this was based on, see:

bash
/local3/elaine1wan/SS_inference/SS_inference_0507/gemma3_scripts/run_serve_gemma3_checkpoint.sh

gemma3_checkpoint_step100

Get help setting up a custom Dedicated Endpoints.

README

Files

Serving with vLLM

Explore FriendliAI today

gemma3_checkpoint_step100