Latest post
Easily Migrating LLM Inference Serving from vLLM to Friendli Container thumbnail
  • April 12, 2024
  • 3 min read

Easily Migrating LLM Inference Serving from vLLM to Friendli Container

Read full article

Building Your RAG Application on LlamaIndex with Friendli Engine: A Step-by-Step Guide thumbnail
  • April 8, 2024
  • 3 min read

Building Your RAG Application on LlamaIndex with Friendli Engine: A Step-by-Step Guide

RAG
LlamaIndex
Improve Latency and Throughput with Weight-Activation Quantization in FP8 thumbnail
  • April 3, 2024
  • 6 min read

Improve Latency and Throughput with Weight-Activation Quantization in FP8

WAQ
FP8
Running Quantized Mixtral 8x7B on a Single GPU thumbnail
  • February 28, 2024
  • 3 min read

Running Quantized Mixtral 8x7B on a Single GPU

Mixtral
AWQ
Serving Performances of Mixtral 8x7B, a Mixture of Experts (MoE) Model thumbnail
  • February 20, 2024
  • 4 min read

Serving Performances of Mixtral 8x7B, a Mixture of Experts (MoE) Model

Mixtral
MoE
Which Quantization to Use to Reduce the Size of LLMs? thumbnail
  • February 15, 2024
  • 4 min read

Which Quantization to Use to Reduce the Size of LLMs?

AWQ
Quantization
LLM
Friendli TCache: Optimizing LLM Serving by Reusing Computations thumbnail
  • February 7, 2024
  • 2 min read

Friendli TCache: Optimizing LLM Serving by Reusing Computations

LLM
Serving
Grouped Query Attention (GQA) vs. Multi Head Attention (MHA): Optimizing LLM Inference Serving thumbnail
  • February 2, 2024
  • 4 min read

Grouped Query Attention (GQA) vs. Multi Head Attention (MHA): Optimizing LLM Inference Serving

GQA
MHA
MQA
Faster and Cheaper Mixtral 8×7B on Friendli Serverless Endpoints thumbnail
  • January 24, 2024
  • 3 min read

Faster and Cheaper Mixtral 8×7B on Friendli Serverless Endpoints

LLM
Serving
The LLM Serving Engine Showdown: Friendli Engine Outshines thumbnail
  • January 12, 2024
  • 3 min read

The LLM Serving Engine Showdown: Friendli Engine Outshines

LLM
Serving Engine
Friendli Serverless Endpoints: Unleashing Generative AI for Everyone thumbnail
  • January 4, 2024
  • 2 min read

Friendli Serverless Endpoints: Unleashing Generative AI for Everyone

inference
generative AI models
Groundbreaking Performance of the Friendli Engine for LLM Serving on an NVIDIA H100 GPU thumbnail
  • December 11, 2023
  • 3 min read

Groundbreaking Performance of the Friendli Engine for LLM Serving on an NVIDIA H100 GPU

LLM
NVIDIA H100
Simultaneously Serving Multiple LoRAs on a single GPU with Friendli Engine thumbnail
  • November 16, 2023
  • 3 min read

Simultaneously Serving Multiple LoRAs on a single GPU with Friendli Engine

LoRA
multi-LoRA
Faster serving of the 4-bit quantized Llama 2 70B model with fewer GPUs: Friendli Engine vs. vLLM thumbnail
  • November 7, 2023
  • 2 min read

Faster serving of the 4-bit quantized Llama 2 70B model with fewer GPUs: Friendli Engine vs. vLLM

Quantization
Large Language Models
Comparing two LLM serving frameworks: Friendli Engine vs. vLLM thumbnail
  • October 30, 2023
  • 3 min read

Comparing two LLM serving frameworks: Friendli Engine vs. vLLM

LLM
Inference
Serving
Chat Docs: A RAG Application with Friendli Engine and LangChain thumbnail
  • October 27, 2023
  • 4 min read

Chat Docs: A RAG Application with Friendli Engine and LangChain

Langchain
Large Language Models
LLM
LangChain Integration with Friendli Dedicated Endpoints thumbnail
  • October 27, 2023
  • 3 min read

LangChain Integration with Friendli Dedicated Endpoints

Langchain
Large Language Models
Model Serving
Retrieval-Augmented Generation: A Dive into Contextual AI thumbnail
  • October 26, 2023
  • 3 min read

Retrieval-Augmented Generation: A Dive into Contextual AI

Large Language Models
Model Serving
Langchain
Unlocking Efficiency of Serving LLMs with Activation-aware Weight Quantization (AWQ) on Friendli Engine thumbnail
  • October 23, 2023
  • 3 min read

Unlocking Efficiency of Serving LLMs with Activation-aware Weight Quantization (AWQ) on Friendli Engine

Quantization
Large Language Models
Transformers
Understanding Activation-Aware Weight Quantization (AWQ): Boosting Inference Serving Efficiency in LLMs thumbnail
  • October 16, 2023
  • 4 min read

Understanding Activation-Aware Weight Quantization (AWQ): Boosting Inference Serving Efficiency in LLMs

Quantization
Large Language Models
Transformers
Iteration batching (a.k.a. continuous batching) to increase LLM inference serving throughput thumbnail
  • September 27, 2023
  • 2 min read

Iteration batching (a.k.a. continuous batching) to increase LLM inference serving throughput

Llm
Llm Serving
Generative AI Tools
Accelerating LLM Training with Memory-Balanced Pipeline Parallelism thumbnail
  • July 13, 2023
  • 5 min read

Accelerating LLM Training with Memory-Balanced Pipeline Parallelism

Large Language Models
Transformers
Distributed Systems
Friendli Engine's Enriched Coverage for Sought-After LLMs: MPT, LLaMA, and Dolly thumbnail
  • July 3, 2023
  • 2 min read

Friendli Engine's Enriched Coverage for Sought-After LLMs: MPT, LLaMA, and Dolly

Transformers
Generative Model
Large Model
Get an Extra Speedup of LLM Inference with Integer Quantization on Friendli Engine thumbnail
  • June 27, 2023
  • 3 min read

Get an Extra Speedup of LLM Inference with Integer Quantization on Friendli Engine

Quantization
Transformers
Generative Model
Fine-tuning and Serving CodeGen, a Code Generation Model, with Friendli Engine thumbnail
  • January 17, 2023
  • 3 min read

Fine-tuning and Serving CodeGen, a Code Generation Model, with Friendli Engine

Codegen
Mlops
Transformers
Save on Training Costs of Generative AI with PeriFlow thumbnail
  • November 1, 2022
  • 1 min read

Save on Training Costs of Generative AI with PeriFlow

Machine Learning
AI
VC
Serve generative AI models like T5 faster than ever with Friendli Engine (32.8x faster for T5–3B) thumbnail
  • October 8, 2022
  • 2 min read

Serve generative AI models like T5 faster than ever with Friendli Engine (32.8x faster for T5–3B)

Generative AI
Transformers
Mlops
Friendli Engine: How Good is it on Small Models? thumbnail
  • August 4, 2022
  • 2 min read

Friendli Engine: How Good is it on Small Models?

Machine Learning
Transformers
Generative Model
Friendli Engine: How to Serve Large-scale Transformer Models thumbnail
  • July 18, 2022
  • 7 min read

Friendli Engine: How to Serve Large-scale Transformer Models

AI
Machine Learning
System Architecture
Introducing GPT-FAI 13B: A Large-scale Language Model Trained with FriendliAI’s PeriFlow thumbnail
  • May 20, 2022
  • 3 min read

Introducing GPT-FAI 13B: A Large-scale Language Model Trained with FriendliAI’s PeriFlow

Gpt 3
Mlops
Mlops Platform