May 11, 2026
3 min read

FriendliAI Expands to San Francisco to Scale Frontier AI Inference for Open-Weight and Custom Models

TL;DR

FriendliAI is opening a 7,000 sq ft San Francisco office at 20 Hawthorne St (SoMa), next to SFMOMA
Expansion targets the inference bottleneck as AI agents consume 5–30× more tokens per task than chatbots
Open-weight models (GLM-5.1, Kimi K2.6, DeepSeek V4, Nemotron 3) now rival closed frontier models
Hiring across go-to-market, partnerships, and engineering; office doubles as a hub for developer meetups, hackathons, and executive briefings

FriendliAI Expands to San Francisco to Scale Frontier AI Inference for Open-Weight and Custom Models thumbnail

Today we’re announcing the opening of our new San Francisco office at 20 Hawthorne Street, where we now occupy 7,000 square feet inside the historic Crown Point Press building, around the corner from the San Francisco Museum of Modern Art and the Moscone Center. The new space puts us at the heart of the Bay Area AI ecosystem and closer to the customers, partners, and developers building the next generation of AI applications.

FriendliAI’s 7,000-square-foot SoMa space anchors our U.S. expansion.

An inflection point for inference

Our expansion lands at an inflection point for AI inference. Two forces are driving the shift: AI agents — which plan, reason across many steps, and call tools on every turn — require five to thirty times more tokens per task than chatbots. That consumption compounds as agents move from pilots into always-on production workflows. Meanwhile, the latest open-weight models including Z.ai’s GLM-5.1, Moonshot AI’s Kimi K2.6, DeepSeek V4, and NVIDIA Nemotron 3 now match or exceed leading closed models like Anthropic’s Claude Opus at a fraction of the cost, and custom fine-tunes align even more tightly with enterprise use cases. Production-grade inference infrastructure has become the bottleneck — and the prize.

“San Francisco is the epicenter of AI innovation, and a deeper presence here lets us partner with the customers and developers shaping what comes next,” said FriendliAI CEO, Byung-Gon Chun. “The industry is no longer asking whether to build with AI — it’s asking how to run AI in production, profitably, at scale. FriendliAI, The Frontier AI Inference Cloud, was built for exactly that.”

How we got here

FriendliAI was founded by Professor Byung-Gon Chun and members of his research team at Seoul National University, where they pioneered continuous batching — the inference optimization technique that is now an industry standard. Today FriendliAI runs state-of-the-art open-weight and custom models at production scale with industry-leading throughput, latency, and reliability. Independent benchmarks from Artificial Analysis and OpenRouter rank FriendliAI as the top inference provider for models such as GLM-5.1 and Gemma 4 across output speed, latency, tool calling, and structured outputs. We partners with model creators on launch — most recently as a Day 0 partner for NVIDIA Nemotron 3 and Z.ai’s GLM-5.1 — and with cloud providers including AWS, OCI, and Samsung Cloud Platform on infrastructure to scale globally.

Customers including Twelve Labs and LG are already scaling with FriendliAI in production, and that momentum is translating into rapid business growth. FriendliAI is on a trajectory to grow revenue tenfold this year, with a goal of growing another tenfold the year after, as AI-native and AI-augmented SaaS companies migrate production workloads to its platform. The San Francisco expansion is built to support the trajectory: FriendliAI plans to significantly grow its U.S. team across go-to-market, partnerships, and engineering functions over the coming year.

“Inference is where AI economics are won or lost,” said our Chief Business Officer, Brian Yoo. “Every percentage point of GPU efficiency translates directly to margin, and every millisecond of latency translates to user experience. Putting senior commercial and engineering leadership on the ground in San Francisco lets us move at the speed our customers need as they scale.”

FriendliAI’s new SF office: a hub for the AI builder community

Our bright, loft-style space is also purpose-built as a hub for the AI builder community, hosting developer meetups, hackathons, and executive briefings on the practical realities of deploying inference at scale — from open-weight model deployments and GPU efficiency to multimodal and agentic workloads.

Stay tuned for hackathons, developer days, and event afterparties!

Written by

FriendliAI Tech & Research

Share

General FAQ

What is FriendliAI?

FriendliAI is a GPU-inference platform that lets you deploy, scale, and monitor large language and multimodal models in production, without owning or managing GPU infrastructure. We offer three things for your AI models: Unmatched speed, cost efficiency, and operational simplicity. Find out which product is the best fit for you in here.

How does FriendliAI help my business?

Our Friendli Inference allows you to squeeze more tokens-per-second out of every GPU. Because you need fewer GPUs to serve the same load, the true metric—tokens per dollar—comes out higher even if the hourly GPU rate looks similar on paper. View pricing

Which models and modalities are supported?

Over 560,000 text, vision, audio, and multi-modal models are deployable out of the box. You can also upload custom models or LoRA adapters. Explore models

Can I deploy models from Hugging Face directly?

Yes. A one-click deploy by selecting “Friendli Endpoints” on the Hugging Face Hub will take you to our model deployment page. The page provides an easy-to-use interface for setting up Friendli Dedicated Endpoints, a managed service for generative AI inference. Learn more about our Hugging Face partnership

Still have questions?

If you want a customized solution for that key issue that is slowing your growth, support@friendli.ai or click Talk to an engineer — our engineers (not a bot) will reply within one business day.

General FAQ

What is FriendliAI?

FriendliAI is a GPU-inference platform that lets you deploy, scale, and monitor large language and multimodal models in production, without owning or managing GPU infrastructure. We offer three things for your AI models: Unmatched speed, cost efficiency, and operational simplicity. Find out which product is the best fit for you in here.

How does FriendliAI help my business?

Our Friendli Inference allows you to squeeze more tokens-per-second out of every GPU. Because you need fewer GPUs to serve the same load, the true metric—tokens per dollar—comes out higher even if the hourly GPU rate looks similar on paper. View pricing

Which models and modalities are supported?

Over 560,000 text, vision, audio, and multi-modal models are deployable out of the box. You can also upload custom models or LoRA adapters. Explore models

Can I deploy models from Hugging Face directly?

Yes. A one-click deploy by selecting “Friendli Endpoints” on the Hugging Face Hub will take you to our model deployment page. The page provides an easy-to-use interface for setting up Friendli Dedicated Endpoints, a managed service for generative AI inference. Learn more about our Hugging Face partnership

Still have questions?

If you want a customized solution for that key issue that is slowing your growth, support@friendli.ai or click Talk to an engineer — our engineers (not a bot) will reply within one business day.

FriendliAI Expands to San Francisco to Scale Frontier AI Inference for Open-Weight and Custom Models

An inflection point for inference

How we got here

FriendliAI’s new SF office: a hub for the AI builder community

General FAQ

General FAQ

Related Posts

Gemma-4-31B-it API on FriendliAI: #1 Output Speed & Response Time

What's So Special About DeepSeek V4? Find Out On FriendliAI

Explore FriendliAI today

General FAQ