Changelog - Friendli Docs

November, 2025

Nov 27

Feature Availability Update

Dedicated Endpoints’ Basic plan users can now access the following features that were previously available only to Enterprise plan users:

Request count auto-scaling: Scale endpoints based on request count. Request count scaling strategy adjusts worker numbers according to total requests in the queue and in progress. Read more
Multi-LoRA serving: Serve multiple LoRA adapters simultaneously on a single endpoint, allowing you to use different fine-tuned models without additional GPU resources. Read more
Metrics: Track, monitor, and optimize your inference deployment.
Logs: Track logs and spot issues in real time.

Nov 21

New Model Family

Added support for the following new model families:

FluxKontextPipeline (e.g., black-forest-labs/FLUX.1-Kontext-dev)
Olmo3ForCausalLM (e.g., allenai/Olmo-3-32B-Think)
LightOnOCRForConditionalGeneration (e.g., lightonai/LightOnOCR-1B-1025)
PaddleOCRVLForConditionalGeneration (e.g., PaddlePaddle/PaddleOCR-VL)
DeepseekOCRForCausalLM (e.g., deepseek-ai/DeepSeek-OCR)

Nov 7

New Model Family

Added support for the following new model families:

Qwen3VLForConditionalGeneration (e.g., Qwen/Qwen3-VL-4B-Instruct)
Qwen3VLMoeForConditionalGeneration (e.g., Qwen/Qwen3-VL-30B-A3B-Instruct)
GraniteMoeHybridForCausalLM (e.g., ibm-granite/granite-4.0-h-small)
DotsOCRForCausalLM (e.g., rednote-hilab/dots.ocr)

Nov 1

Pricing Update

We have changed the pricing model for Qwen/Qwen3-235B-A22B-Instruct-2507 to token-based pricing.

October, 2025

Oct 31

Model Release

We now support the following serverless model.

zai-org/GLM-4.6

September, 2025

Sep 15

New Model Family

Added support for the following new model families:

Qwen3NextForCausalLM (e.g., Qwen/Qwen3-Next-80B-A3B-Instruct)
HunYuanDenseV1ForCausalLM (e.g., tencent/Hunyuan-MT-7B)
ApertusForCausalLM (e.g., swiss-ai/Apertus-8B-Instruct-2509)
SeedOssForCausalLM (e.g., ByteDance-Seed/Seed-OSS-36B-Instruct)

Sep 12

Custom Chat Template Support

We now support custom chat formatting. You can paste or upload a custom Jinja template during endpoint creation. Read more

4-Bit Online Quantization Support

We now support 4-bit online quantization. By enabling this feature, you can efficiently run models on smaller instances with negligible quality impact. Read more

Sep 10

Reasoning Parsing Support

We now support reasoning parsing. By enabling the feature, the response will provide a separate reasoning_content field rather than including the reasoning content in the content field. Read more

Sep 8

Model Deprecation

We have deprecated the following serverless model.

K-intelligence/Midm-2.0-Base-Instruct

Sep 4

Model Deprecation

We have deprecated the following serverless model.

K-intelligence/Midm-2.0-Mini-Instruct

Sep 1

B200 Hardware Support

We now support NVIDIA B200 GPUs alongside existing A100, H100, and H200 GPUs. Read more

August, 2025

Aug 22

New built-in integration w/ Linkup

New built-in web-search tool integration with Linkup has been added. Read more

New Model Family

Added support for the following new model family:

GptOssForCausalLM (e.g., openai/gpt-oss-20b )

Aug 19

New Auto-Scaling Type ‘Request count’ Added

Enterprise plan users can now choose to scale their endpoints based on request count. Request count scaling strategy adjusts worker numbers according to total requests in the queue and in progress.

Aug 8

Increased Output Token Limits for Reasoning Models

We have increased the output token limits for reasoning models on Serverless endpoints, allowing longer reasoning outputs to be generated.

New Endpoint Feature ‘N-GRAM Speculative Decoding’

Users can now enable N-GRAM speculative decoding for their endpoints. For predictable tasks, this can deliver substantial performance gains. Read more

Aug 1

New Model Family

Added support for the following new model family:

HyperCLOVAXForCausalLM (e.g., naver-hyperclovax/HyperCLOVAX-SEED-Think-14B )

Aug 1

Model Release

We now support the following serverless models.

Qwen/Qwen3-235B-A22B-Thinking-2507
Qwen/Qwen3-235B-A22B-Instruct-2507
skt/A.X-4.0
skt/A.X-3.1
naver-hyperclovax/HyperCLOVAX-SEED-Think-14B

July, 2025

Jul 25

New Endpoint Feature ‘Online Quantization’

Users can now quantize their model endpoints without any preparations and accelerate inference. Read more

Jul 14

Model Release

LG AI Research has partnered with FriendliAI to bring the latest version of EXAONE 4.0. Read more

LGAI-EXAONE/EXAONE-4.0.1-32B

Jul 11

Model Release

We now support the following serverless model.

deepseek-ai/DeepSeek-R1-0528

Jul 8

New Model Family

Added support for the following new model families:

Dots1ForCausalLM (e.g., rednote-hilab/dots.llm1.inst)
Glm4vForConditionalGeneration (e.g., zai-org/GLM-4.1V-9B-Thinking)
KeyeForConditionalGeneration (e.g., Kwai-Keye/Keye-VL-8B-Preview)
HunYuanMoEV1ForCausalLM (e.g., tencent/Hunyuan-A13B-Instruct)
PhiMoEForCausalLM (e.g., microsoft/Phi-mini-MoE-instruct)
MiniMaxM1ForCausalLM (e.g., MiniMaxAI/MiniMax-M1-80k)
Ernie4_5_MoeForCausalLM (e.g., baidu/ERNIE-4.5-21B-A3B-Thinking)
Ernie4_5_ForCausalLM (e.g., baidu/ERNIE-4.5-0.3B-PT)

Jul 3

New Model Family

Added support for the following new model family:

Exaone4ForCausalLM (e.g., LGAI-EXAONE/EXAONE-4.0.1-32B )

​November, 2025

​Feature Availability Update

​New Model Family

​New Model Family

​Pricing Update

​October, 2025

​Model Release

​September, 2025

​New Model Family

​Custom Chat Template Support

​4-Bit Online Quantization Support

​Reasoning Parsing Support

​Model Deprecation

​Model Deprecation

​B200 Hardware Support

​August, 2025

​New built-in integration w/ Linkup

​New Model Family

​New Auto-Scaling Type ‘Request count’ Added

​Increased Output Token Limits for Reasoning Models

​New Endpoint Feature ‘N-GRAM Speculative Decoding’

​New Model Family

​Model Release

​July, 2025

​New Endpoint Feature ‘Online Quantization’

​Model Release

​Model Release

​New Model Family

​New Model Family

November, 2025

Feature Availability Update

New Model Family

New Model Family

Pricing Update

October, 2025

Model Release

September, 2025

New Model Family

Custom Chat Template Support

4-Bit Online Quantization Support

Reasoning Parsing Support

Model Deprecation

Model Deprecation

B200 Hardware Support

August, 2025

New built-in integration w/ Linkup

New Model Family

New Auto-Scaling Type ‘Request count’ Added

Increased Output Token Limits for Reasoning Models

New Endpoint Feature ‘N-GRAM Speculative Decoding’

New Model Family

Model Release

July, 2025

New Endpoint Feature ‘Online Quantization’

Model Release

Model Release

New Model Family

New Model Family