LLM-OS-Models

gemma-4-E4B-it-Terminal-SFT-2Epoch-DDP-4GPU

모델 요약

Base model: google/gemma-4-E4B-it
Training setup: 2 epochs, DDP fine-tuning
Model card snapshot: 2026-06-03 22:09:28 UTC
Corrected TB2-lite evaluated results currently indexed: 60
Corrected TB2-lite score: pending / not matched in current result directory

Quickstart

설치와 로그인:

bash
pip install -U vllm transformers huggingface_hub
huggingface-cli login

관련 코드:

GitHub: https://github.com/LLM-OS-Models/Terminal
vLLM 평가 실행: tb2_lite/scripts/replay_eval.py
chat template/fallback 생성: tb2_lite/scripts/prompt_builder.py
JSON/command 채점: tb2_lite/scripts/replay_metrics.py

vLLM 직접 실행 예시. 평가 코드와 동일하게 chat template을 우선 사용하고, template이 없으면 ChatML/Gemma fallback을 사용합니다.

python
from transformers import AutoTokenizer
from vllm import LLM, SamplingParams

model_id = "LLM-OS-Models/gemma-4-E4B-it-Terminal-SFT-2Epoch-DDP-4GPU"
tp = 1

tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
llm = LLM(
    model=model_id,
    tokenizer=model_id,
    trust_remote_code=True,
    dtype="bfloat16",
    tensor_parallel_size=tp,
    max_model_len=49152,
    gpu_memory_utilization=0.92,
)

messages = [
    {"role": "system", "content": "You are a terminal automation assistant. Return JSON only."},
    {"role": "user", "content": "Inspect the current directory and list Python files."},
]

def render_chatml(messages):
    parts = []
    for message in messages:
        role = "assistant" if message["role"] == "assistant" else message["role"]
        if role == "tool":
            role = "user"
        parts.append(f"<|im_start|>{role}\n{message['content']}<|im_end|>\n")
    parts.append("<|im_start|>assistant\n")
    return "".join(parts)

def render_gemma4_turn(messages, empty_thought_channel=False):
    parts = ["<bos>"]
    for message in messages:
        role = "model" if message["role"] == "assistant" else message["role"]
        if role == "tool":
            role = "user"
        parts.append(f"<|turn>{role}\n{message['content'].strip()}<turn|>\n")
    parts.append("<|turn>model\n")
    if empty_thought_channel:
        parts.append("<|channel>thought\n<channel|>")
    return "".join(parts)

def render_prompt(model_id, tokenizer, messages):
    model_key = model_id.lower()
    if "gemma-4" in model_key:
        try:
            return tokenizer.apply_chat_template(
                messages,
                tokenize=False,
                add_generation_prompt=True,
                enable_thinking=False,
            )
        except Exception:
            return render_gemma4_turn(
                messages,
                empty_thought_channel=("26b" in model_key or "31b" in model_key),
            )
    try:
        return tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    except Exception:
        return render_chatml(messages)

prompt = render_prompt(model_id, tokenizer, messages)
sampling = SamplingParams(
    temperature=0.0,
    top_p=1.0,
    max_tokens=1024,
    repetition_penalty=1.0,
)
outputs = llm.generate([prompt], sampling_params=sampling)
print(outputs[0].outputs[0].text)

권장 출력 형식:

json
{
  "analysis": "brief reasoning about the next terminal action",
  "plan": "short execution plan",
  "commands": [
    {"keystrokes": "ls -la\n", "duration": 0.1}
  ],
  "task_complete": false
}

평가와 동일한 replay 명령:

bash
python tb2_lite/scripts/replay_eval.py \
  --model LLM-OS-Models/gemma-4-E4B-it-Terminal-SFT-2Epoch-DDP-4GPU \
  --model-short LLM-OS-Models__gemma-4-E4B-it-Terminal-SFT-2Epoch-DDP-4GPU \
  --eval-path tb2_lite/data/replay_full.jsonl \
  --output-dir /home/work/.data/tb2_lite_eval/corrected_readme_models_vllm \
  --dtype bfloat16 \
  --tp 1 \
  --max-model-len 49152 \
  --max-tokens 1024 \
  --temperature 0.0 \
  --top-p 1.0 \
  --gpu-memory-utilization 0.92 \
  --thinking-mode off \
  --strip-thinking-history auto \
  --gemma4-empty-thought-channel auto \
  --language-model-only

기본 권장 tensor parallel: 1. OOM이면 --tp와 tensor_parallel_size를 2/4/8로 올리세요.
corrected TB2-lite 평가는 temperature=0.0, top_p=1.0, max_tokens=1024로 고정했습니다.
Gemma 4는 JSON 출력을 위해 enable_thinking=False를 사용하고, 26B/31B 계열은 평가 코드에서 empty thought channel 처리를 자동 적용합니다.

평가 상태

Current corrected TB2-lite score: pending
Reason: 현재 /home/work/.data/tb2_lite_eval/corrected_readme_models_vllm 집계 결과와 이 HF repo명이 직접 매칭되지 않았습니다.
Next step: 동일한 tb2_lite/scripts/replay_eval.py 경로로 평가를 돌린 뒤 점수 카드로 자동 교체합니다.

모델군 해석

Gemma 계열은 native Gemma/Liquid 전처리와 chat template 처리가 중요합니다. 이 repo는 corrected 평가가 끝나면 점수 카드로 교체합니다.
TB2-lite 점수는 일반 지능 벤치마크가 아니라 터미널 next-action JSON 재현 능력을 측정합니다.
생성 명령은 실제 실행 전에 sandbox, allowlist, human review 같은 안전장치를 거쳐야 합니다.

Available on FriendliAI

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Container

Run this model inference with full control and performance in your environment.

Learn more

Model Details

Model Provider

LLM-OS-Models

Model Tree

Base

google/gemma-4-E4B-it

Fine-tuned

this model

Input Modalities

TextImage

Output Modalities

Text

Supported Functionality

Dedicated EndpointsContainer

Explore FriendliAI today

Get started Talk to an engineer

모델 요약

Base model: google/gemma-4-E4B-it
Training setup: 2 epochs, DDP fine-tuning
Model card snapshot: 2026-06-03 22:09:28 UTC
Corrected TB2-lite evaluated results currently indexed: 60
Corrected TB2-lite score: pending / not matched in current result directory

Quickstart

설치와 로그인:

bash
pip install -U vllm transformers huggingface_hub
huggingface-cli login

관련 코드:

GitHub: https://github.com/LLM-OS-Models/Terminal
vLLM 평가 실행: tb2_lite/scripts/replay_eval.py
chat template/fallback 생성: tb2_lite/scripts/prompt_builder.py
JSON/command 채점: tb2_lite/scripts/replay_metrics.py

vLLM 직접 실행 예시. 평가 코드와 동일하게 chat template을 우선 사용하고, template이 없으면 ChatML/Gemma fallback을 사용합니다.

python
from transformers import AutoTokenizer
from vllm import LLM, SamplingParams

model_id = "LLM-OS-Models/gemma-4-E4B-it-Terminal-SFT-2Epoch-DDP-4GPU"
tp = 1

tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
llm = LLM(
    model=model_id,
    tokenizer=model_id,
    trust_remote_code=True,
    dtype="bfloat16",
    tensor_parallel_size=tp,
    max_model_len=49152,
    gpu_memory_utilization=0.92,
)

messages = [
    {"role": "system", "content": "You are a terminal automation assistant. Return JSON only."},
    {"role": "user", "content": "Inspect the current directory and list Python files."},
]

def render_chatml(messages):
    parts = []
    for message in messages:
        role = "assistant" if message["role"] == "assistant" else message["role"]
        if role == "tool":
            role = "user"
        parts.append(f"<|im_start|>{role}\n{message['content']}<|im_end|>\n")
    parts.append("<|im_start|>assistant\n")
    return "".join(parts)

def render_gemma4_turn(messages, empty_thought_channel=False):
    parts = ["<bos>"]
    for message in messages:
        role = "model" if message["role"] == "assistant" else message["role"]
        if role == "tool":
            role = "user"
        parts.append(f"<|turn>{role}\n{message['content'].strip()}<turn|>\n")
    parts.append("<|turn>model\n")
    if empty_thought_channel:
        parts.append("<|channel>thought\n<channel|>")
    return "".join(parts)

def render_prompt(model_id, tokenizer, messages):
    model_key = model_id.lower()
    if "gemma-4" in model_key:
        try:
            return tokenizer.apply_chat_template(
                messages,
                tokenize=False,
                add_generation_prompt=True,
                enable_thinking=False,
            )
        except Exception:
            return render_gemma4_turn(
                messages,
                empty_thought_channel=("26b" in model_key or "31b" in model_key),
            )
    try:
        return tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    except Exception:
        return render_chatml(messages)

prompt = render_prompt(model_id, tokenizer, messages)
sampling = SamplingParams(
    temperature=0.0,
    top_p=1.0,
    max_tokens=1024,
    repetition_penalty=1.0,
)
outputs = llm.generate([prompt], sampling_params=sampling)
print(outputs[0].outputs[0].text)

권장 출력 형식:

json
{
  "analysis": "brief reasoning about the next terminal action",
  "plan": "short execution plan",
  "commands": [
    {"keystrokes": "ls -la\n", "duration": 0.1}
  ],
  "task_complete": false
}

평가와 동일한 replay 명령:

bash
python tb2_lite/scripts/replay_eval.py \
  --model LLM-OS-Models/gemma-4-E4B-it-Terminal-SFT-2Epoch-DDP-4GPU \
  --model-short LLM-OS-Models__gemma-4-E4B-it-Terminal-SFT-2Epoch-DDP-4GPU \
  --eval-path tb2_lite/data/replay_full.jsonl \
  --output-dir /home/work/.data/tb2_lite_eval/corrected_readme_models_vllm \
  --dtype bfloat16 \
  --tp 1 \
  --max-model-len 49152 \
  --max-tokens 1024 \
  --temperature 0.0 \
  --top-p 1.0 \
  --gpu-memory-utilization 0.92 \
  --thinking-mode off \
  --strip-thinking-history auto \
  --gemma4-empty-thought-channel auto \
  --language-model-only

기본 권장 tensor parallel: 1. OOM이면 --tp와 tensor_parallel_size를 2/4/8로 올리세요.
corrected TB2-lite 평가는 temperature=0.0, top_p=1.0, max_tokens=1024로 고정했습니다.
Gemma 4는 JSON 출력을 위해 enable_thinking=False를 사용하고, 26B/31B 계열은 평가 코드에서 empty thought channel 처리를 자동 적용합니다.

평가 상태

Current corrected TB2-lite score: pending
Reason: 현재 /home/work/.data/tb2_lite_eval/corrected_readme_models_vllm 집계 결과와 이 HF repo명이 직접 매칭되지 않았습니다.
Next step: 동일한 tb2_lite/scripts/replay_eval.py 경로로 평가를 돌린 뒤 점수 카드로 자동 교체합니다.

모델군 해석

Gemma 계열은 native Gemma/Liquid 전처리와 chat template 처리가 중요합니다. 이 repo는 corrected 평가가 끝나면 점수 카드로 교체합니다.
TB2-lite 점수는 일반 지능 벤치마크가 아니라 터미널 next-action JSON 재현 능력을 측정합니다.
생성 명령은 실제 실행 전에 sandbox, allowlist, human review 같은 안전장치를 거쳐야 합니다.

gemma-4-E4B-it-Terminal-SFT-2Epoch-DDP-4GPU

README

모델 요약

Quickstart

평가 상태

모델군 해석

Explore FriendliAI today

README

모델 요약

Quickstart

평가 상태

모델군 해석