Latest post

April 12, 2024
3 min read

Easily Migrating LLM Inference Serving from vLLM to Friendli Container

Read full article

Easily Migrating LLM Inference Serving from vLLM to Friendli Container thumbnail

April 12, 2024
3 min read

Easily Migrating LLM Inference Serving from vLLM to Friendli Container

Read full article

Building Your RAG Application on LlamaIndex with Friendli Engine: A Step-by-Step Guide thumbnail

April 8, 2024
3 min read

Building Your RAG Application on LlamaIndex with Friendli Engine: A Step-by-Step Guide

Improve Latency and Throughput with Weight-Activation Quantization in FP8 thumbnail

April 3, 2024
6 min read

Improve Latency and Throughput with Weight-Activation Quantization in FP8

Running Quantized Mixtral 8x7B on a Single GPU thumbnail

February 28, 2024
3 min read

Running Quantized Mixtral 8x7B on a Single GPU

Serving Performances of Mixtral 8x7B, a Mixture of Experts (MoE) Model thumbnail

February 20, 2024
4 min read

Serving Performances of Mixtral 8x7B, a Mixture of Experts (MoE) Model

Which Quantization to Use to Reduce the Size of LLMs? thumbnail

February 15, 2024
4 min read

Which Quantization to Use to Reduce the Size of LLMs?

Friendli TCache: Optimizing LLM Serving by Reusing Computations thumbnail

February 7, 2024
2 min read

Friendli TCache: Optimizing LLM Serving by Reusing Computations

Grouped Query Attention (GQA) vs. Multi Head Attention (MHA): Optimizing LLM Inference Serving thumbnail

February 2, 2024
4 min read

Grouped Query Attention (GQA) vs. Multi Head Attention (MHA): Optimizing LLM Inference Serving

Faster and Cheaper Mixtral 8×7B on Friendli Serverless Endpoints thumbnail

January 24, 2024
3 min read

Faster and Cheaper Mixtral 8×7B on Friendli Serverless Endpoints

The LLM Serving Engine Showdown: Friendli Engine Outshines thumbnail

January 12, 2024
3 min read

The LLM Serving Engine Showdown: Friendli Engine Outshines

Friendli Serverless Endpoints: Unleashing Generative AI for Everyone thumbnail

January 4, 2024
2 min read

Friendli Serverless Endpoints: Unleashing Generative AI for Everyone

generative AI models

Groundbreaking Performance of the Friendli Engine for LLM Serving on an NVIDIA H100 GPU thumbnail

December 11, 2023
3 min read

Groundbreaking Performance of the Friendli Engine for LLM Serving on an NVIDIA H100 GPU

Simultaneously Serving Multiple LoRAs on a single GPU with Friendli Engine thumbnail

November 16, 2023
3 min read

Simultaneously Serving Multiple LoRAs on a single GPU with Friendli Engine

Faster serving of the 4-bit quantized Llama 2 70B model with fewer GPUs: Friendli Engine vs. vLLM thumbnail

November 7, 2023
2 min read

Faster serving of the 4-bit quantized Llama 2 70B model with fewer GPUs: Friendli Engine vs. vLLM

Large Language Models

Comparing two LLM serving frameworks: Friendli Engine vs. vLLM thumbnail

October 30, 2023
3 min read

Comparing two LLM serving frameworks: Friendli Engine vs. vLLM

Chat Docs: A RAG Application with Friendli Engine and LangChain thumbnail

October 27, 2023
4 min read

Chat Docs: A RAG Application with Friendli Engine and LangChain

Large Language Models

LangChain Integration with Friendli Dedicated Endpoints thumbnail

October 27, 2023
3 min read

LangChain Integration with Friendli Dedicated Endpoints

Large Language Models

Retrieval-Augmented Generation: A Dive into Contextual AI thumbnail

October 26, 2023
3 min read

Retrieval-Augmented Generation: A Dive into Contextual AI

Large Language Models

Unlocking Efficiency of Serving LLMs with Activation-aware Weight Quantization (AWQ) on Friendli Engine thumbnail

October 23, 2023
3 min read

Unlocking Efficiency of Serving LLMs with Activation-aware Weight Quantization (AWQ) on Friendli Engine

Large Language Models

Understanding Activation-Aware Weight Quantization (AWQ): Boosting Inference Serving Efficiency in LLMs thumbnail

October 16, 2023
4 min read

Understanding Activation-Aware Weight Quantization (AWQ): Boosting Inference Serving Efficiency in LLMs

Large Language Models

Iteration batching (a.k.a. continuous batching) to increase LLM inference serving throughput thumbnail

September 27, 2023
2 min read

Iteration batching (a.k.a. continuous batching) to increase LLM inference serving throughput

Generative AI Tools

Accelerating LLM Training with Memory-Balanced Pipeline Parallelism thumbnail

July 13, 2023
5 min read

Accelerating LLM Training with Memory-Balanced Pipeline Parallelism

Large Language Models

Distributed Systems

Friendli Engine's Enriched Coverage for Sought-After LLMs: MPT, LLaMA, and Dolly thumbnail

July 3, 2023
2 min read

Friendli Engine's Enriched Coverage for Sought-After LLMs: MPT, LLaMA, and Dolly

Generative Model

Get an Extra Speedup of LLM Inference with Integer Quantization on Friendli Engine thumbnail

June 27, 2023
3 min read

Get an Extra Speedup of LLM Inference with Integer Quantization on Friendli Engine

Generative Model

Fine-tuning and Serving CodeGen, a Code Generation Model, with Friendli Engine thumbnail

January 17, 2023
3 min read

Fine-tuning and Serving CodeGen, a Code Generation Model, with Friendli Engine

Save on Training Costs of Generative AI with PeriFlow thumbnail

November 1, 2022
1 min read

Save on Training Costs of Generative AI with PeriFlow

Machine Learning

Serve generative AI models like T5 faster than ever with Friendli Engine (32.8x faster for T5–3B) thumbnail

October 8, 2022
2 min read

Serve generative AI models like T5 faster than ever with Friendli Engine (32.8x faster for T5–3B)

Friendli Engine: How Good is it on Small Models? thumbnail

August 4, 2022
2 min read

Friendli Engine: How Good is it on Small Models?

Machine Learning

Generative Model

Friendli Engine: How to Serve Large-scale Transformer Models thumbnail

July 18, 2022
7 min read

Friendli Engine: How to Serve Large-scale Transformer Models

Machine Learning

System Architecture

Introducing GPT-FAI 13B: A Large-scale Language Model Trained with FriendliAI’s PeriFlow thumbnail

May 20, 2022
3 min read

Introducing GPT-FAI 13B: A Large-scale Language Model Trained with FriendliAI’s PeriFlow