Latest post
thumbnail
  • February 28, 2024
  • 3 min read

Running Quantized Mixtral 8x7B on a Single GPU

Read full article

thumbnail
  • February 20, 2024
  • 4 min read

Serving Performances of Mixtral 8x7B, a Mixture of Experts (MoE) Model

Mixtral
MoE
thumbnail
  • February 15, 2024
  • 3 min read

Which Quantization to Use to Reduce the Size of LLMs?

AWQ
Quantization
LLM
thumbnail
  • February 7, 2024
  • 2 min read

Friendli TCache: Optimizing LLM Serving by Reusing Computations

LLM
Serving
thumbnail
  • February 2, 2024
  • 4 min read

Grouped Query Attention (GQA) vs. Multi Head Attention (MHA): Optimizing LLM Inference Serving

GQA
MHA
MQA
thumbnail
  • January 24, 2024
  • 3 min read

Faster and Cheaper Mixtral 8×7B on Friendli Serverless Endpoints

LLM
Serving
thumbnail
  • January 12, 2024
  • 3 min read

The LLM Serving Engine Showdown: Friendli Engine Outshines

LLM
Serving Engine
thumbnail
  • January 4, 2024
  • 2 min read

Friendli Serverless Endpoints: Unleashing Generative AI for Everyone

inference
generative AI models
thumbnail
  • December 11, 2023
  • 3 min read

Groundbreaking Performance of the Friendli Engine for LLM Serving on an NVIDIA H100 GPU

LLM
NVIDIA H100
thumbnail
  • November 16, 2023
  • 3 min read

Simultaneously Serving Multiple LoRAs on a single GPU with Friendli Engine

LoRA
multi-LoRA
thumbnail
  • November 7, 2023
  • 2 min read

Faster serving of the 4-bit quantized Llama 2 70B model with fewer GPUs: Friendli Engine vs. vLLM

Quantization
Large Language Models
thumbnail
  • October 30, 2023
  • 3 min read

Comparing two LLM serving frameworks: Friendli Engine vs. vLLM

LLM
Inference
Serving
thumbnail
  • October 27, 2023
  • 4 min read

Chat Docs: A RAG Application with Friendli Engine and LangChain

Langchain
Large Language Models
LLM
thumbnail
  • October 27, 2023
  • 3 min read

LangChain Integration with Friendli Dedicated Endpoints

Langchain
Large Language Models
Model Serving
thumbnail
  • October 26, 2023
  • 3 min read

Retrieval-Augmented Generation: A Dive into Contextual AI

Large Language Models
Model Serving
Langchain
thumbnail
  • October 23, 2023
  • 3 min read

Unlocking Efficiency of Serving LLMs with Activation-aware Weight Quantization (AWQ) on Friendli Engine

Quantization
Large Language Models
Transformers
thumbnail
  • October 16, 2023
  • 4 min read

Understanding Activation-Aware Weight Quantization (AWQ): Boosting Inference Serving Efficiency in LLMs

Quantization
Large Language Models
Transformers
thumbnail
  • September 27, 2023
  • 2 min read

Iteration batching (a.k.a. continuous batching) to increase LLM inference serving throughput

Llm
Llm Serving
Generative AI Tools
thumbnail
  • July 13, 2023
  • 5 min read

Accelerating LLM Training with Memory-Balanced Pipeline Parallelism

Large Language Models
Transformers
Distributed Systems
thumbnail
  • July 3, 2023
  • 2 min read

Friendli Engine's Enriched Coverage for Sought-After LLMs: MPT, LLaMA, and Dolly

Transformers
Generative Model
Large Model
thumbnail
  • June 27, 2023
  • 3 min read

Get an Extra Speedup of LLM Inference with Integer Quantization on Friendli Engine

Quantization
Transformers
Generative Model
thumbnail
  • January 17, 2023
  • 2 min read

Fine-tuning and Serving CodeGen, a Code Generation Model, with Friendli Engine

Codegen
Mlops
Transformers
thumbnail
  • November 1, 2022
  • 1 min read

Save on Training Costs of Generative AI with PeriFlow

Machine Learning
AI
VC
thumbnail
  • October 8, 2022
  • 1 min read

Serve generative AI models like T5 faster than ever with Friendli Engine (32.8x faster for T5–3B)

Generative AI
Transformers
Mlops
thumbnail
  • August 4, 2022
  • 2 min read

Friendli Engine: How Good is it on Small Models?

Machine Learning
Transformers
Generative Model
thumbnail
  • July 18, 2022
  • 7 min read

Friendli Engine: How to Serve Large-scale Transformer Models

AI
Machine Learning
System Architecture
thumbnail
  • May 20, 2022
  • 2 min read

Introducing GPT-FAI 13B: A Large-scale Language Model Trained with FriendliAI’s PeriFlow

Gpt 3
Mlops
Mlops Platform