July 3, 2023
2 min read

Friendli Inference's Enriched Coverage for Sought-After LLMs: MPT, LLaMA, and Dolly

We have some exciting news to share!

As you probably know, our Friendli Inference supports various LLMs, including GPT and T5. We further added support for three more highly sought-after open-source models: MPT [1], LLaMA [2], and Dolly [3].

MPT

MosaicML provides tools that streamline the process of training machine learning models and has open-sourced LLMs recently. Recognizing its value, Databricks recently announced the acquisition of MosaicML for $1.3B [4].

MosaicML’s MPT-7B [5] and MPT-30B [1] have been trained using state-of-the-art techniques such as Alibi and FlashAttention. MPT-30B especially supports long-context inference by leveraging an 8K context window during training. Furthermore, it stands out as the first public model trained on an NVIDIA H100 cluster.

LLaMA

LLaMA stands as a collection of foundation models from Meta, providing various parameter sizes: 7B, 13B, 33B, and 65B. Remarkably, the LLaMA-13B model surpasses the GPT-3 175B model on certain tasks [2], despite having parameters an order of magnitude smaller.

The true value of LLaMA lies in its contribution to the research community — openly sharing the training methodology, including the model architecture and code. This transparency fosters a collaborative environment, where researchers can either fine-tune existing LLaMA models or create their models from scratch by adopting LLaMA’s insights. For example, Alpaca [6], Vicuna [7], Gorilla [8], and Koala [9] are fine-tuned derivatives from the LLaMA models, while RedPajama [10] is a fully open-source reproduction of LLaMA.

Dolly

Dolly is an open-source language model developed by Databricks, based on the Pythia model of EleutherAI [11]. In addition to the model checkpoint, Databricks introduced ‘databricks-dolly-15k’ [12], a new high-quality human-generated instruction dataset that played a crucial role in fine-tuning Dolly. By virtue of the new dataset, Dolly is the first open-source instruction-following language model, catering to both research and commercial applications.

In summary, Friendli Inference supports many LLMs — and can now serve MPT, LLaMA, and Dolly. Friendli Inference moreover supports various data types including fp32, fp16, bf16, and int8 (for int8, please refer to our recent blog post!), and tensor/pipeline parallelism for various serving environments. Enjoy Friendli Inference's high performance while serving LLM models like MPT, LLaMA, and Dolly!

For more information about FriendliAI, check the link.
About Friendli Inference, check the link.

[1] https://www.mosaicml.com/blog/mpt-30b

[2] Touvron, Hugo, et al. “Llama: Open and efficient foundation language models.” arXiv preprint arXiv:2302.13971 (2023).

[3] https://www.databricks.com/blog/2023/04/12/dolly-first-open-commercially-viable-instruction-tuned-llm

[4] https://www.mosaicml.com/blog/mosaicml-databricks-generative-ai-for-all

[5] https://www.mosaicml.com/blog/mpt-7b

[6] https://crfm.stanford.edu/2023/03/13/alpaca.html

[7] https://lmsys.org/blog/2023-03-30-vicuna/

[8] https://gorilla.cs.berkeley.edu/

[9] https://bair.berkeley.edu/blog/2023/04/03/koala/

[10] https://www.together.xyz/blog/redpajama

[11] https://www.eleuther.ai/papers-blog/pythia-a-suite-for-analyzing-large-language-modelsacross-training-and-scaling

[12] https://huggingface.co/datasets/databricks/databricks-dolly-15k

Written by

FriendliAI Tech & Research

General FAQ

What is FriendliAI?

FriendliAI is a GPU-inference platform that lets you deploy, scale, and monitor large language and multimodal models in production, without owning or managing GPU infrastructure. We offer three things for your AI models: Unmatched speed, cost efficiency, and operational simplicity. Find out which product is the best fit for you in here.

How does FriendliAI help my business?

Our Friendli Inference allows you to squeeze more tokens-per-second out of every GPU. Because you need fewer GPUs to serve the same load, the true metric—tokens per dollar—comes out higher even if the hourly GPU rate looks similar on paper. View pricing

Which models and modalities are supported?

Over 520,000 text, vision, audio, and multi-modal models are deployable out of the box. You can also upload custom models or LoRA adapters. Explore models

Can I deploy models from Hugging Face directly?

Yes. A one-click deploy by selecting “Friendli Endpoints” on the Hugging Face Hub will take you to our model deployment page. The page provides an easy-to-use interface for setting up Friendli Dedicated Endpoints, a managed service for generative AI inference. Learn more about our Hugging Face partnership

Still have questions?

If you want a customized solution for that key issue that is slowing your growth, contact@friendli.ai or click Talk to an engineer — our engineers (not a bot) will reply within one business day.