KETI-AIR

keti-llama-7b-v0.1

Deploy Dedicated

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Container

Run this model inference with full control and performance in your environment.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: other

Model Details

Architecture: LlamaForCausalLM
Parameters: 8B-class
Context length in config: 131,072 tokens
Hidden size: 4096
Layers: 32
Attention heads: 32
KV heads: 8
Vocabulary size: 128,256
Recommended dtype: bfloat16

Evaluation

Evaluation timestamp: 20260604_202553

Table
Category	Dataset	Version	Metric	Mode	Score
Core	core_average	-	naive_average	gen	27.77
Instruction Following	IFEval	353ae7	Prompt-level-strict-accuracy	gen	50.65
Math Calculation	aime2024	bc6078	accuracy	gen	16.67
Math Calculation	aime2025	5e9f4f	accuracy	gen	3.33
Math Calculation	math_prm800k_500	11c4b5	accuracy	gen	60.20
General Reasoning	bbh	-	naive_average	gen	11.87
General Reasoning	GPQA_diamond	5aeece	accuracy	gen	20.71
Knowledge	mmlu_pro	-	naive_average	gen	28.26
Code	openai_humaneval	dcae0e	humaneval_pass@1	gen	60.98
Code	lcb_code_generation	b5b6c5	pass@1	gen	6.00
Long Context Reasoning	leval	-	naive_average	gen	39.37
Long Context Reasoning	longbench	-	naive_average	gen	20.57
Long Context Reasoning	LongBenchv2	75fbba	accuracy	gen	24.85
Long Context Reasoning	keti_long_ctx_gutenberg	-	naive_average	gen	17.62

Quick Start

python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "KETI-AIR/keti-llama-7b-v0.1"

tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)

messages = [
    {"role": "user", "content": "Explain why long-context reasoning is useful."}
]
inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt",
).to(model.device)

outputs = model.generate(
    inputs,
    max_new_tokens=512,
    do_sample=True,
    temperature=0.7,
    top_p=0.9,
)
print(tokenizer.decode(outputs[0][inputs.shape[-1]:], skip_special_tokens=True))

Intended Use

This model is intended for research and development on instruction following, code generation, mathematical reasoning, and long-context generation tasks.

Limitations

The model can generate incorrect, unsafe, or biased content. Users should evaluate the model for their own deployment setting and apply appropriate safety filters and human review where needed.