July 24, 2024
4 min read

Friendli Tools Part 3: Function Calling—How Llama 3 70B Can Outperform GPT-4o

Friendli Tools Series: Part 3 of 3

At FriendliAI, we’re on a mission to make cutting-edge generative AI technologies accessible to everyone. That’s why we are thrilled to introduce Friendli Tools, a powerful feature that enables accurate function calling in most open-source language models!

Thanks to Friendli Tools, Llama 3 70B on Friendli performs on par with OpenAI GPT-4o and Fireworks Firefunction v2 in function calling. It even excels in complex tasks like “parallel multiple” function calling. Our outstanding function calling capabilities, combined with our cost efficiency, position Friendli Suite at the forefront of AI agent development.

In our previous blogs in this Friendli Tools Series, we explored the fundamentals of function calling and how to build AI agents using this technology. If you missed them, be sure to check out Part 1: Function Calling—Connecting LLMs with Functions and APIs and Friendli Tools Part 2: Function Calling—Building AI Agents with Slack Integration and Weather Tool. These posts provide a comprehensive introduction to the concepts and practical applications of function calling with LLMs.

Highlights of Friendli Tools:

Function Calling Accuracy: Excels in “parallel multiple function” tasks, outperforming GPT-4o even without fine-tuning.
Cost Efficiency: Llama 3 70B on Friendli matches the overall performance of GPT-4o at just 4% of the cost ( $0.6 vs$ 15 per 1M output tokens).

Our Friendli Tools blog series concludes with this third and final post, completing our exploration of this new exciting offering. Learn about Friendli Tool’s impressive function calling benchmark results in this article!

The Gorilla Benchmark

We selected the Gorilla LLM function calling dataset to evaluate the function calling accuracy of Friendli Tools. The dataset covers a wide range of domains including math, sports, and finance, and offers a thorough evaluation of models' function calling capabilities in different real-world contexts.

Consider a scenario where a user seeks assistance with real estate inquiries in three cities. A thousand benchmarked scenarios, including this one, are given to the model to evaluate the model’s accuracy in function calling. The model has to correctly answer the user query with the appropriate function calls.

Here’s the example:


Question:

Can you help me find a property in San Francisco, CA that is a condo with 2 bedrooms and fits within my budget range of $500,000 to $800,000? After that, could you also provide an estimated value for a villa in Los Angeles, CA with 3 bedrooms that is 5 years old? Lastly, I would also like to know the estimated value of an apartment in New York, NY with 1 bedroom that is 10 years old.


Answer:

{
    "realestate.find_properties": {
        "location": ["San Francisco, CA", "SF, CA"],
        "propertyType": ["condo"],
        "bedrooms": [2],
        "budget": [
            {
                "min": [500000],
                "max": [800000]
            }
        ]
    },
    "property_valuation.get_1": {
        "location": ["Los Angeles, CA", "LA, CA"],
        "propertyType": ["villa"],
        "bedrooms": [3],
        "age": [5]
    },
    "property_valuation.get_2": {
        "location": ["New York, NY", "NY, NY"],
        "propertyType": ["apartment"],
        "bedrooms": [1],
        "age": [10]
    }
}

Don’t you find the example quite challenging? This example is included in the “parallel multiple function” evaluation category, which Llama 3 70B on Friendli excels at!

The function calling scenarios of the Gorilla benchmark are categorized into four groups: “simple,” “multiple,” “parallel,” and “parallel multiple.” The most complex category, "parallel multiple function," is defined as a combination of the “multiple function” and “parallel function” categories. The "multiple function" tests models on user queries that invoke a call out of 2 to 4 functions. The "parallel function" involves simultaneously executing multiple function calls in response to a single user query.

Benchmark Results

Ready for the Gorilla benchmark results? We compared four models: Llama 3 8B, Llama 3 70B, Firefunction v2, and GPT-4o. The Llama 3 models were tested on the Friendli Suite.

Prepare to be amazed by the benchmark showdown! The results reveal that Llama 3 70B on Friendli consistently achieves top-notch accuracy. It particularly stands out in parallel multiple function calling, outperforming the next-best model by a significant margin of 7%. Notably, the next-best model was created by fine-tuning on Llama 3, whereas the Llama 3 running on Friendli is the vanilla model.

Table

Benchmark results chart

These results highlight that the original Llama 3 models supported by Friendli Tools are comparable to the leading fine-tuned models, specially tailored for function calling applications. Friendli Tools offers an innovative method to enhance LLM function calling without the necessity of fine-tuning.

Friendli Tools supports precise function calling across most language models. By integrating this function calling capability into our LLM inferencing, models like Llama 3 70B can deliver remarkable performance in function calling. Friendli Tools simplifies the process of using custom function calling models, allowing you to create high-performing AI agents without any fine-tuning required.

How to get started

Curious about how to begin using Friendli Tools? Access our documentation to discover a range of detailed guides designed to simplify your onboarding process. Moreover, we host Friendli Tools on the Friendli Serverless Endpoints API which is completely OpenAI-compatible. You can easily use Friendli Tools by switching to our client or changing the model name and base URL in your existing OpenAI client.

Concluding the Friendli Tools Series

Friendli Tools is a fundamental feature for building fast and accurate agents. We’re excited to put this exceptional AI technology into the hands of our community and can’t wait to see what you create!

Explore function calling by reading our full blog series on Friendli Tools. Begin your journey with Part one: Function Calling - Connecting LLMs with Functions and APIs, an essential guide to understanding the basics. Follow it up with Part Two: Building AI Agents Using Function Calling with LLMs to learn how to build AI agents.

The future of building intelligent AI agents is here - Start building today on Friendli!

Written by

FriendliAI Tech & Research

General FAQ

What is FriendliAI?

FriendliAI is a GPU-inference platform that lets you deploy, scale, and monitor large language and multimodal models in production, without owning or managing GPU infrastructure. We offer three things for your AI models: Unmatched speed, cost efficiency, and operational simplicity. Find out which product is the best fit for you in here.

How does FriendliAI help my business?

Our Friendli Inference allows you to squeeze more tokens-per-second out of every GPU. Because you need fewer GPUs to serve the same load, the true metric—tokens per dollar—comes out higher even if the hourly GPU rate looks similar on paper. View pricing

Which models and modalities are supported?

Over 380,000 text, vision, audio, and multi-modal models are deployable out of the box. You can also upload custom models or LoRA adapters. Explore models

Can I deploy models from Hugging Face directly?

Yes. A one-click deploy by selecting “Friendli Endpoints” on the Hugging Face Hub will take you to our model deployment page. The page provides an easy-to-use interface for setting up Friendli Dedicated Endpoints, a managed service for generative AI inference. Learn more about our Hugging Face partnership

Still have questions?

If you want a customized solution for that key issue that is slowing your growth, contact@friendli.ai or click Talk to an expert — our experts (not a bot) will reply within one business day.