Neura-Tech-AI

Neuron-Distill-Qwen2.5-3B-Instruct

Deploy Dedicated

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Container

Run this model inference with full control and performance in your environment.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

🏢 Organization Identity

Company: Neura Tech AI
Project Name: Neuron-Distill-Qwen2.5-3B-Instruct
Lead Architect: Samarth Anand Pathak

📊 Model Specifications

Architecture: Causal Language Model (Fine-tuned and permanently fused from Qwen2.5-3B-Instruct)
Parameters: ~3.09 Billion
Precision: FP16 (Float16)
Context Window: 32K tokens
Format: ChatML Compatible (Native padding and Chat Templates pre-configured)
License: Subject to the Qwen Research License Agreement (Inherited from the base Qwen2.5 architecture)

🎯 Core Capabilities

Multilingual Proficiency: Highly optimized for seamless contextual understanding across English, Hindi, and hybrid code-switched linguistic frameworks (Hinglish).
Native Identity Alignment: Embedded with strict core system safety layers that maintain the model's structural identity as an agent of Neura Tech AI.
Production Edge Readiness: Ultra-low memory footprint (~6.18 GB VRAM in standard Float16 execution) making it highly viable for localized consumer-grade hardware.

📈 Standard Benchmark & Evaluation Setup

To assess Project Neuron's generation stability, execution latency, and instruction-following consistency, use the baseline quantitative evaluation pipeline below.

1. Benchmark Testing Pipeline (`benchmark_eval.py`)

python
import time
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_path = "Neura-Tech-AI/Neuron-Distill-Qwen2.5-3B-Instruct"

print("🎯 Initializing Project Neuron Evaluation Suite...")
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype=torch.float16).to("cuda")

eval_prompts = [
    "Tell me about Project Neuron in short. What is its scale?",
    "Explain quantum computing in simple Hindi lyrics.",
    "Write a secure python API routing block for model inference."
]

def run_performance_test(prompt):
    messages = [
        {"role": "system", "content": "You are Neuron, an advanced AI system developed by Neura Tech AI."},
        {"role": "user", "content": prompt}
    ]
    text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    inputs = tokenizer([text], return_tensors="pt").to("cuda")
    input_len = inputs.input_ids.shape[1]
    
    start_time = time.time()
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=150,
            temperature=0.1,
            do_sample=False,
            pad_token_id=tokenizer.eos_token_id
        )
    latency = time.time() - start_time
    
    generated_tokens = outputs[0][input_len:]
    token_count = len(generated_tokens)
    tokens_per_second = token_count / latency
    
    response = tokenizer.decode(generated_tokens, skip_special_tokens=True).strip()
    return latency, tokens_per_second, response

print("\n--- Running Quantitative Evaluation Matrix ---")
for i, prompt in enumerate(eval_prompts, 1):
    lat, tps, resp = run_performance_test(prompt)
    print(f"\n📊 Test Case #{i}: '{prompt}'")
    print(f"⏱️ Latency: {lat:.2f}s | ⚡ Speed: {tps:.2f} tokens/sec")
    print(f"🤖 Output:\n{resp}\n" + "-"*40)

2. Operational Thresholds

Throughput Speed: Maintains an average runtime acceleration of ~40-50 tokens/sec under stable CUDA configurations. VRAM Overhead: VRAM consumption balances at approximately ~10.5 GB to 12 GB peak during deep batch text token processing.

🛠️ Quick Start & Native Slicing Inference

To prevent system prompt token leakage and enforce pure output retrieval during standard usage loops, apply explicit token slicing as shown below:

python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "Neura-Tech-AI/Neuron-Distill-Qwen2.5-3B-Instruct"

# Load Standalone Tokenizer & Fused Core Weights
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
    device_map="auto"
)

# Standard Query Payload
messages = [
    {"role": "system", "content": "You are Project Neuron, an advanced AI system developed by Neura Tech AI."},
    {"role": "user", "content": "tu kon hai be."}
]

# Apply Native Tokenization Layout
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([text], return_tensors="pt").to("cuda")

# Run Stable Token Generation 
outputs = model.generate(
    **inputs, 
    max_new_tokens=100,
    temperature=0.3,
    do_sample=True,
    top_p=0.9,
    pad_token_id=tokenizer.eos_token_id
)

# Input-Length Slicing for explicit assistant reply isolation
input_len = inputs.input_ids.shape[1]
clean_reply = tokenizer.decode(outputs[0][input_len:], skip_special_tokens=True).strip()

print(f"🤖 Project Neuron Reply:\n{clean_reply}")

📜 License & Usage Limitations

Developer Custom Copyright Copyright © 2026, Samarth Anand Pathak & Neura Tech AI. All rights reserved. The fine-tuning architectures, dataset processing schemas, and merged checkpoint matrices remain proprietary implementations managed under Neura Tech AI Research Divisions.
Base Model Inherited License As an architecture structurally built on top of the open-weights distribution of Qwen2.5-3B-Instruct, any downstream deployment, distribution, or commercial usage of this checkpoint must strictly comply with the terms, conditional clauses, and safety restrictions of the Qwen Research License Agreement issued by Alibaba Cloud.

© 2026 Neura Tech AI. All Rights Reserved.

Model provider

Neura-Tech-AI

Model tree

Base

Qwen/Qwen2.5-3B-Instruct

Fine-tuned

this model

Modalities

Input

Text

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Model card

Explore FriendliAI today

Get started Talk to an engineer

🏢 Organization Identity

Company: Neura Tech AI
Project Name: Neuron-Distill-Qwen2.5-3B-Instruct
Lead Architect: Samarth Anand Pathak

📊 Model Specifications

Architecture: Causal Language Model (Fine-tuned and permanently fused from Qwen2.5-3B-Instruct)
Parameters: ~3.09 Billion
Precision: FP16 (Float16)
Context Window: 32K tokens
Format: ChatML Compatible (Native padding and Chat Templates pre-configured)
License: Subject to the Qwen Research License Agreement (Inherited from the base Qwen2.5 architecture)

🎯 Core Capabilities

Multilingual Proficiency: Highly optimized for seamless contextual understanding across English, Hindi, and hybrid code-switched linguistic frameworks (Hinglish).
Native Identity Alignment: Embedded with strict core system safety layers that maintain the model's structural identity as an agent of Neura Tech AI.
Production Edge Readiness: Ultra-low memory footprint (~6.18 GB VRAM in standard Float16 execution) making it highly viable for localized consumer-grade hardware.

📈 Standard Benchmark & Evaluation Setup

To assess Project Neuron's generation stability, execution latency, and instruction-following consistency, use the baseline quantitative evaluation pipeline below.

1. Benchmark Testing Pipeline (`benchmark_eval.py`)

python
import time
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_path = "Neura-Tech-AI/Neuron-Distill-Qwen2.5-3B-Instruct"

print("🎯 Initializing Project Neuron Evaluation Suite...")
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype=torch.float16).to("cuda")

eval_prompts = [
    "Tell me about Project Neuron in short. What is its scale?",
    "Explain quantum computing in simple Hindi lyrics.",
    "Write a secure python API routing block for model inference."
]

def run_performance_test(prompt):
    messages = [
        {"role": "system", "content": "You are Neuron, an advanced AI system developed by Neura Tech AI."},
        {"role": "user", "content": prompt}
    ]
    text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    inputs = tokenizer([text], return_tensors="pt").to("cuda")
    input_len = inputs.input_ids.shape[1]
    
    start_time = time.time()
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=150,
            temperature=0.1,
            do_sample=False,
            pad_token_id=tokenizer.eos_token_id
        )
    latency = time.time() - start_time
    
    generated_tokens = outputs[0][input_len:]
    token_count = len(generated_tokens)
    tokens_per_second = token_count / latency
    
    response = tokenizer.decode(generated_tokens, skip_special_tokens=True).strip()
    return latency, tokens_per_second, response

print("\n--- Running Quantitative Evaluation Matrix ---")
for i, prompt in enumerate(eval_prompts, 1):
    lat, tps, resp = run_performance_test(prompt)
    print(f"\n📊 Test Case #{i}: '{prompt}'")
    print(f"⏱️ Latency: {lat:.2f}s | ⚡ Speed: {tps:.2f} tokens/sec")
    print(f"🤖 Output:\n{resp}\n" + "-"*40)

2. Operational Thresholds

🛠️ Quick Start & Native Slicing Inference

To prevent system prompt token leakage and enforce pure output retrieval during standard usage loops, apply explicit token slicing as shown below:

python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "Neura-Tech-AI/Neuron-Distill-Qwen2.5-3B-Instruct"

# Load Standalone Tokenizer & Fused Core Weights
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
    device_map="auto"
)

# Standard Query Payload
messages = [
    {"role": "system", "content": "You are Project Neuron, an advanced AI system developed by Neura Tech AI."},
    {"role": "user", "content": "tu kon hai be."}
]

# Apply Native Tokenization Layout
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([text], return_tensors="pt").to("cuda")

# Run Stable Token Generation 
outputs = model.generate(
    **inputs, 
    max_new_tokens=100,
    temperature=0.3,
    do_sample=True,
    top_p=0.9,
    pad_token_id=tokenizer.eos_token_id
)

# Input-Length Slicing for explicit assistant reply isolation
input_len = inputs.input_ids.shape[1]
clean_reply = tokenizer.decode(outputs[0][input_len:], skip_special_tokens=True).strip()

print(f"🤖 Project Neuron Reply:\n{clean_reply}")

📜 License & Usage Limitations

Developer Custom Copyright Copyright © 2026, Samarth Anand Pathak & Neura Tech AI. All rights reserved. The fine-tuning architectures, dataset processing schemas, and merged checkpoint matrices remain proprietary implementations managed under Neura Tech AI Research Divisions.
Base Model Inherited License As an architecture structurally built on top of the open-weights distribution of Qwen2.5-3B-Instruct, any downstream deployment, distribution, or commercial usage of this checkpoint must strictly comply with the terms, conditional clauses, and safety restrictions of the Qwen Research License Agreement issued by Alibaba Cloud.

Neuron-Distill-Qwen2.5-3B-Instruct

Get help setting up a custom Dedicated Endpoints.

README

🏢 Organization Identity

📊 Model Specifications

🎯 Core Capabilities

📈 Standard Benchmark & Evaluation Setup

1. Benchmark Testing Pipeline (`benchmark_eval.py`)

2. Operational Thresholds

🛠️ Quick Start & Native Slicing Inference

📜 License & Usage Limitations

© 2026 Neura Tech AI. All Rights Reserved.

Explore FriendliAI today

README

🏢 Organization Identity

📊 Model Specifications

🎯 Core Capabilities

📈 Standard Benchmark & Evaluation Setup

1. Benchmark Testing Pipeline (`benchmark_eval.py`)

2. Operational Thresholds

🛠️ Quick Start & Native Slicing Inference

📜 License & Usage Limitations

© 2026 Neura Tech AI. All Rights Reserved.

Neuron-Distill-Qwen2.5-3B-Instruct

Get help setting up a custom Dedicated Endpoints.

🏢 Organization Identity

📊 Model Specifications

🎯 Core Capabilities

📈 Standard Benchmark & Evaluation Setup

1. Benchmark Testing Pipeline (benchmark_eval.py)

2. Operational Thresholds

🛠️ Quick Start & Native Slicing Inference

📜 License & Usage Limitations

© 2026 Neura Tech AI. All Rights Reserved.

Explore FriendliAI today

🏢 Organization Identity

📊 Model Specifications

🎯 Core Capabilities

📈 Standard Benchmark & Evaluation Setup

1. Benchmark Testing Pipeline (benchmark_eval.py)

2. Operational Thresholds

🛠️ Quick Start & Native Slicing Inference

📜 License & Usage Limitations

© 2026 Neura Tech AI. All Rights Reserved.

1. Benchmark Testing Pipeline (`benchmark_eval.py`)

1. Benchmark Testing Pipeline (`benchmark_eval.py`)