nvidia/Riva-Translate-4B-Instruct-v1.1 - Fast, Reliable, and Scalable Inference on FriendliAI

Model Overview

We’re excited to share our latest work on the next version of Riva-Translate-4B-Instruct! The new release outperforms the initial version across multiple benchmarks, including FLORES, NTREX, and WMT24, and demonstrates comparable performance to EuroLLM-9B-Instruct. The model supports translation across 12 languages and dialects spanning Chinese, Spanish, and Portuguese varieties. Specifically, it covers: English (en), German (de), European Spanish (es-ES), Latin American Spanish (es-US), French (fr), Brazilian Portuguese (pt-BR), Russian (ru), Simplified Chinese (zh-CN), Traditional Chinese (zh-TW), Japanese (ja), Korean (ko), and Arabic (ar). Built on a decoder-only Transformer architecture, this model is a fine-tuned version of a 4B-base model that was pruned and distilled from nvidia/Mistral-NeMo-Minitron-8B-Base using NVIDIA’s LLM compression techniques. Training followed a multi-stage pipeline consisting of Continued Pre-Training (CPT), Supervised Fine-Tuning (SFT), and Reward-aware Preference Optimization (RPO). The model uses tiktoken as its tokenizer and supports a context length of 8K tokens.

Model Developer: NVIDIA

Model Dates: Riva-Translate-4B-Instruct-v1.1 was trained between June 2025 and August 2025.

License

GOVERNING TERMS: The NIM container is governed by the NVIDIA Software License Agreement and Product-Specific Terms for AI Products. Use of this model is governed by the NVIDIA Community Model License. ADDITIONAL INFORMATION: Apache 2.0.

Quick Start Guide

How to Choose the Language Pair

To select a language pair for translation, include one of the following tags in the system prompt:

en-zh-cn or en-zh English to Simplified Chinese
en-zh-tw: English to Traditional Chinese
en-ar: English to Arabic
en-de: English to German
en-es or en-es-es: English to European Spanish
en-es-us: English to Latin American Spanish
en-fr: English to French
en-ja: English to Japanese
en-ko: English to Korean
en-ru: English to Russian
en-pt: English to Brazilian Portuguese
en-pt-br: English to Brazilian Portuguese
zh-en or zh-cn-en: Simplified Chinese to English
zh-tw-en: Traditional Chinese to English
ar-en: Arabic to English
de-en: German to English
es-en or es-es-en: European Spanish to English
es-us-en: Latin American Spanish to English
fr-en: French to English
ja-en: Japanese to English
ko-en: Korean to English
ru-en: Russian to English
pt-en or pt-br-en: Brazilian Portuguese to English

Use it with Transformers

markdown
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("nvidia/Riva-Translate-4B-Instruct-v1.1")
model = AutoModelForCausalLM.from_pretrained("nvidia/Riva-Translate-4B-Instruct-v1.1").cuda()

# Use the prompt template (along with chat template)
messages = [
    {
        "role": "system",
        "content": "en-zh",
    },
    {"role": "user", "content": "The GRACE mission is a collaboration between the NASA and German Aerospace Center.?"},
 ]
tokenized_chat = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt").to(model.device)
outputs = model.generate(tokenized_chat,  max_new_tokens=128, pad_token_id=tokenizer.eos_token_id)
print(tokenizer.decode(outputs[0]))

Use it with vLLM

To install vllm, use the following pip command in a terminal within a supported environment.

markdown
pip install -U "vllm>=0.12.0"

Launch a vLLM server using the below python command. In this example, we use a context length of 8k as supported by the model.

markdown
python3 -m vllm.entrypoints.openai.api_server \
     --model nvidia/Riva-Translate-4B-Instruct-v1.1 \
     --dtype bfloat16 \
     --gpu-memory-utilization 0.95 \
     --max-model-len 8192 \
     --host 0.0.0.0 \
     --port 8000 \
     --tensor-parallel-size 1 \
     --served-model-name Riva-Translate-4B-Instruct-v1.1

Alternatively, you can use Docker to launch a vLLM server.

markdown
docker run --runtime nvidia --gpus all \
           -v ~/.cache/huggingface:/root/.cache/huggingface \
           -p 8000:8000 \
           --ipc=host \
           vllm/vllm-openai:v0.12.0 \
           --model nvidia/Riva-Translate-4B-Instruct-v1.1 \
           --dtype bfloat16 \
           --gpu-memory-utilization 0.95 \
           --max-model-len 8192 \
           --host 0.0.0.0 \
           --port 8000 \
           --tensor-parallel-size 1 \
           --served-model-name Riva-Translate-4B-Instruct-v1.1

If you are using DGX Spark or Jetson Thor, please use this vllm container. On Jetson Thor, be sure to include --runtime nvidia when running the Docker container.

markdown
# On DGX SPark or Jetson Thor
docker run \
  --runtime nvidia \ # Remove this on DGX Spark
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -p 8000:8000 \
  --ipc=host \
  nvcr.io/nvidia/vllm:25.12.post1-py3 \
  vllm serve nvidia/Riva-Translate-4B-Instruct-v1.1 \
    --dtype bfloat16 \
    --gpu-memory-utilization 0.95 \
    --max-model-len 8192 \
    --host 0.0.0.0 \
    --port 8000 \
    --tensor-parallel-size 1 \
    --served-model-name Riva-Translate-4B-Instruct-v1.1

On Jetson Thor, the previous vLLM cache is not currently cleaned automatically, so it must be cleared manually. Always run this command on the host before serving any model on Jetson Thor.

markdown
sudo sysctl -w vm.drop_caches=3

Here is an example client code for vLLM.

markdown
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json"
-d '{
"model": "Riva-Translate-4B-Instruct-v1.1",
"messages": [
      {"role": "system", "content": "en-zh"},
      {"role": "user", "content": "The GRACE mission is a collaboration between the NASA and German Aerospace Center.?"}
    ]
}'

Chat Template Structure

markdown
{%- set language_pairs = {
  'en-zh-cn': {'source': 'English', 'target': 'Simplified Chinese'},
  'en-zh': {'source': 'English', 'target': 'Simplified Chinese'},
  'en-zh-tw': {'source': 'English', 'target': 'Traditional Chinese'},
  'en-ar': {'source': 'English', 'target': 'Arabic'},
  'en-de': {'source': 'English', 'target': 'German'},
  'en-es': {'source': 'English', 'target': 'European Spanish'},
  'en-es-es': {'source': 'English', 'target': 'European Spanish'},
  'en-es-us': {'source': 'English', 'target': 'Latin American Spanish'},
  'en-fr': {'source': 'English', 'target': 'French'},
  'en-ja': {'source': 'English', 'target': 'Japanese'},
  'en-ko': {'source': 'English', 'target': 'Korean'},
  'en-ru': {'source': 'English', 'target': 'Russian'},
  'en-pt': {'source': 'English', 'target': 'Brazilian Portuguese'},
  'en-pt-br': {'source': 'English', 'target': 'Brazilian Portuguese'},
  'zh-en': {'source': 'Simplified Chinese', 'target': 'English'},
  'zh-cn-en': {'source': 'Simplified Chinese', 'target': 'English'},
  'zh-tw-en': {'source': 'Traditional Chinese', 'target': 'English'},
  'ar-en': {'source': 'Arabic', 'target': 'English'},
  'de-en': {'source': 'German', 'target': 'English'},
  'es-en': {'source': 'European Spanish', 'target': 'English'},
  'es-es-en': {'source': 'European Spanish', 'target': 'English'},
  'es-us-en': {'source': 'Latin American Spanish', 'target': 'English'},
  'fr-en': {'source': 'French', 'target': 'English'},
  'ja-en': {'source': 'Japanese', 'target': 'English'},
  'ko-en': {'source': 'Korean', 'target': 'English'},
  'ru-en': {'source': 'Russian', 'target': 'English'},
  'pt-en': {'source': 'Brazilian Portuguese', 'target': 'English'},
  'pt-br-en': {'source': 'Brazilian Portuguese', 'target': 'English'},
} -%}

{%- set system_message = '' -%}
{%- set source_lang = '' -%}
{%- set target_lang = '' -%}

{%- if messages[0]['role'] == 'system' -%}
  {%- set lang_pair = messages[0]['content'] | trim -%}
  {%- set messages = messages[1:] -%}
  {%- if lang_pair in language_pairs -%}
    {%- set source_lang = language_pairs[lang_pair]['source'] -%}
    {%- set target_lang = language_pairs[lang_pair]['target'] -%}
    {%- set system_message = 'You are an expert at translating text from ' + source_lang + ' to ' + target_lang + '.' -%}
  {%- else -%}
    {%- set system_message = 'You are a translation expert.' -%}
  {%- endif -%}
{%- endif -%}

{{- '<s>System\n' + system_message + '</s>\n' -}}

{%- for message in messages -%}
  {%- if (message['role'] in ['user']) != (loop.index0 % 2 == 0) -%}
    {{- raise_exception('Conversation roles must alternate between user and assistant') -}}
  {%- elif message['role'] == 'user' -%}
    {%- set user_content = (
          target_lang
          and 'What is the ' + target_lang + ' translation of the sentence: ' + message['content'] | trim
          or message['content'] | trim
        ) -%}
    {{- '<s>User\n' + user_content + '</s>\n' -}}
  {%- elif message['role'] == 'assistant' -%}
    {{- '<s>Assistant\n' + message['content'] | trim + '</s>\n' -}}
  {%- endif -%}
{%- endfor -%}

{%- if add_generation_prompt -%}
  {{ '<s>Assistant\n' }}
{%- endif -%}

Inference

Engine: HF, vLLM
Test Hardware: NVIDIA A100, H100 80GB, Jetson Thor, DGX Spark

Ethical Considerations:

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.

Technical Limitations & Mitigation:

Accuracy varies based on the characteristics of input (Domain, Use Case, Noise, Context, etc.). Grammar errors and semantic issues may be present. As a potential mitigation, the user can change the prompt to get a better translation.

Use Case Restrictions:

Abide by NVIDIA Community Model License

Riva-Translate-4B-Instruct-v1.1

Get help setting up a custom Dedicated Endpoints.

README