Supercharge
Generative AI Inference
Efficient, fast, and reliable generative AI inference solution for production
Accelerate generative AI inference
Friendli Engine provides fast and low-cost inference
GROUNDBREAKING
PERFORMANCE
Our cutting-edge technologies make this possible.
Iteration Batching
Groundbreaking optimization technique developed by us
(Also known as Continuous Batching)
Friendli DNN Library
Optimized GPU kernels for generative AI
Friendli TCache
Intelligently reusing computational results
Native Quantization
Efficient serving without compromising accuracy
Lightning-fast performance for Large Language Models
Our engine accelerates open-source and custom LLMs. The engine supports a wide array of quantization techniques, including FP8, INT8, and AWQ in all models. Take advantage of our optimized open-sourced models or leverage Friendli Engine for your business with a custom model.
Llama 3.1
Arctic
Gemma 2
Mixtral
All-in-one platform for AI agents
Build and serve compound AI systems for complex tasks
Deploy custom models effortlessly
Serve custom models tailored to your specific needs. You can upload your model or import it from either W&B Registry or Hugging Face Model Hub.
Train and fine-tune models
Effortlessly fine-tune and deploy your models. Use PEFT to efficiently tune your models and deploy them using Multi-LoRA serving.
Monitor and debug LLM performance
Our advanced monitoring and debugging tools empower you to understand your model, identify issues, and optimize performance.
Model-agnostic function calls and structured outputs
Build reliable API integrations for your AI agent using function calling or structured outputs. Our engine guarantees consistent results regardless of the model you use.
Seamless data integration for real-time RAG
Enhance your AI’s knowledge in real time with our Retrieval-Augmented Generation (RAG) system. Effortlessly update your agents to up-to-date information, reducing hallucinations.
Integrate predefined tools or provide your own
Empower your AI agent’s abilities with tools. Choose from our extensive library of predefined tools or seamlessly integrate your own.
Ready for production
Our services are here to help you scale your business with ease.
Guaranteed SLA and performance
Experience peace with consistent performance and high reliability. We are committed to delivering exceptional service so you can focus on growing your business.
Maximum security in our cloud or yours
Protect your data with our robust security measures. Whether you choose our cloud or prefer to operate in your infrastructure, we prioritize your security and compliance.
Autoscale on growing demands
Stay ahead of the curve with our intelligent autoscaling capabilities. Our system automatically adjusts resources to ensure optimal performance, allowing you to scale as you grow.
NextDay AI’s personalized character chatbots process ~0.5 trillion tokens per month, incurring high H100 GPU costs. By using Friendli Container, they instantly cut their GPU costs by more than 50%. NextDay AI’s chatbot is ranked among the top 20 generative AI web products by Andreessen Horowitz (a16z).
Meet FriendliAI’s partners
With our partners, we deliver reliable and efficient solutions customized to your specific needs.
Friendli Suite
The complete platform that unlocks your full generative AI potential
01
Friendli Dedicated Endpoints
Build and run LLMs/LMMs on autopilot in the cloud
02
Friendli Container
Serve generative AI in your secure environment
03
Friendli Serverless Endpoints
Access fast and affordable generative AI inference
Friendli Dedicated Endpoints
Build and run LLMs/LMMs on autopilot in the cloud
Easy and scalable deployment for production workloads
With our user-friendly interface and robust infrastructure, you can seamlessly transition from development to production with minimal effort. Dedicated Endpoints simplify LLM operation, allowing you to focus on your business goals. Our integrated dashboard provides a complete insight into the endpoint performance over time.
Fine-tune custom models with proprietary datasets
Create highly specialized models tailored to your industry, use case, or company requirements by fine-tuning your AI models with your proprietary datasets. Leverage the Parameter-Efficient Fine-Tuning (PEFT) method to reduce training costs, or try integrating your Weights & Biases account to monitor the training process continuously.
Auto-scale your endpoints efficiently
Our system dynamically adjusts resources based on your real-time demand, ensuring stable performance during peak times and overall cost efficiency. With the added capability to scale down to zero, you can eliminate unnecessary costs during periods of low activity. This intelligent auto-scaling feature prevents both under-provisioning and over-provisioning of expensive GPU resources.
Dedicated GPU resource management
Dedicated Endpoints provides exclusive access to high-performance GPU resources, ensuring consistent access to computing resources without contention or performance fluctuations. By eliminating resource sharing, you can rely on predictable performance levels to enhance your AI workloads, improving productivity and reliability.
Maximum privacy and security
Running models within your infrastructure allows you to maintain complete control of your data, ensuring that sensitive information never leaves your environment.
Integrate with internal systems
Our solution offers seamless Kubernetes integration, facilitating orchestration and observability. You can easily integrate Prometheus and Grafana for monitoring.
Save big on GPU costs
Whether on-premise or through a managed cluster, Friendli Container can process heavy requests efficiently, requiring you to spend fewer GPUs for a larger scale.
Friendli Serverless Endpoints
Access fast and affordable generative AI inference
250 tokens/sec at $0.1/1M tokens
Serverless Endpoints delivers output tokens at a staggering 250 tokens per second with per-token billing as low as $0.1 per million tokens for the Llama 3.1 8B model.
Supports 128K context length
Build complex applications that require in-depth understanding and context retention on Serverless Endpoints. Our Llama 3.1 endpoints support complete 128K context length handling.
Easily build AI agents with tool-assist
Are you building an AI agent that can search the web, integrate knowledge bases, and solve complex problems using many tools? Serverless Endpoints has it all.
Read more from our blogs
- July 22, 2024
- 6 min read
Building AI Agents Using Function Calling with LLMs
- May 22, 2024
- 8 min read
Measuring LLM Serving Performance with LLMServingPerfEvaluator
- June 27, 2024
- 7 min read