January 17, 2023
3 min read

Fine-tuning and Serving CodeGen, a Code Generation Model, with Friendli Inference

CodeGen, unveiled in 2022 by Salesforce, is a language model that allows users to create programs with natural language without extensive programming knowledge. CodeGen is an exciting tool as it enables humans and AI to program together, making programming easier and faster than ever before.

As an example of its capabilities, CodeGen can take a “Return n-th Fibonacci number” request and quickly generate the corresponding JavaScript code. This implies that developers can quickly convert complex concepts into lines of code without having to manually write the code themselves, saving valuable time and resources. Let’s take a look as below.

Input to CodeGen:

javascript
/* Return n-th Fibonacci number .
  >>> fib(10)
  55
  >>> fib(1)
  1
  >>> fib(8)
  21
  */
  const fib = (n) => {

Output generated by CodeGen:

javascript
 if (n < 2) {
   return n;
 }
 return fib(n - 1) + fib(n - 2);
};

CodeGen is multilingual, meaning the model can support popular programming languages such as Python, Java, Javascript, C++, and Go. CodeGen chooses the appropriate language according to clues from users just like the above example; in this case, it generated JavaScript code by looking at the style of comments and function signature.

Another example is test generation. CodeGen quickly generates tests, which is extremely helpful in the process of testing program validity. By having tests automatically generated, software developers can improve the quality of their code with much less effort. The following shows an example of test code generation.

Input to CodeGen:

javascript
// sumToN is a function that sums numbers from 1 to n.
const sumToN = (n) => {
 return n \* (n + 1) / 2;
}
// write a test code that assets the correctness of sumToN()
const testSumToN = () => {

Output generated by CodeGen:

javascript
  assert.equal(sumToN(1), 1);
  assert.equal(sumToN(2), 3);
  assert.equal(sumToN(3), 6);
  assert.equal(sumToN(4), 10);
  assert.equal(sumToN(5), 15);
  assert.equal(sumToN(6), 21);
  assert.equal(sumToN(7), 28);
  assert.equal(sumToN(8), 36);
  assert.equal(sumToN(9), 45);
  assert.equal(sumToN(10), 55);
}

Users of CodeGen can now ensure code quality with minimal time and resource commitment, allowing them more time to focus on the core logic of their program, resulting in faster delivery time, improved efficiency, and greater productivity.

With Friendli Inference, one can adapt (i.e., perform fine-tuning or parameter-efficient training) CodeGen and serve CodeGen effortlessly and quickly for their applications, further emphasizing FriendliAI’s commitment towards making generative AI serving and training simple and cost-efficient. With Friendli Inference, our clients can perform CodeGen inference much faster than any other solution available.

We compared the serving performance on Friendli Inference (a.k.a. PeriFlow or Orca) against NVIDIA Triton + FasterTransformer in serving CodeGen with 16B parameters on a NVIDIA A100 80GB GPU. In the experiments, the range of the input token length is between 128 and 512, and the range of the output token length is between 32 and 256. The below figure shows throughput (req/s) and mean latency (ms).

Throughput and mean latency comparison on Friendli Inference

Our findings reveal that the serving performance of the Friendli Inference (Orca) outperforms Triton + FasterTransformer by achieving an astonishing 30X higher throughput at the same latency level, thanks to its novel architecture. Note that the actual gain can vary depending on workloads and hardware. In addition to our previous posts about Orca’s speedup for both large and small-scale GPT3 and T5 models, we demonstrate that Friendli Inference is an ideal choice for minimizing costs when it comes to serving CodeGen.

In summary, CodeGen opens up an exciting opportunity for humans and AI to program together. With Friendli Inference, users can quickly adapt CodeGen and serve the model much more efficiently. We are thrilled about the potential to open doors for any users who wish to leverage Friendli Inference to combine the best of human creativity with AI-driven coding capabilities!

For more information about FriendliAI, check the link.
About Friendli Inference, check the link.

Written by

FriendliAI Tech & Research

General FAQ

What is FriendliAI?

FriendliAI is a GPU-inference platform that lets you deploy, scale, and monitor large language and multimodal models in production, without owning or managing GPU infrastructure. We offer three things for your AI models: Unmatched speed, cost efficiency, and operational simplicity. Find out which product is the best fit for you in here.

How does FriendliAI help my business?

Our Friendli Inference allows you to squeeze more tokens-per-second out of every GPU. Because you need fewer GPUs to serve the same load, the true metric—tokens per dollar—comes out higher even if the hourly GPU rate looks similar on paper. View pricing

Which models and modalities are supported?

Over 380,000 text, vision, audio, and multi-modal models are deployable out of the box. You can also upload custom models or LoRA adapters. Explore models

Can I deploy models from Hugging Face directly?

Yes. A one-click deploy by selecting “Friendli Endpoints” on the Hugging Face Hub will take you to our model deployment page. The page provides an easy-to-use interface for setting up Friendli Dedicated Endpoints, a managed service for generative AI inference. Learn more about our Hugging Face partnership

Still have questions?

If you want a customized solution for that key issue that is slowing your growth, contact@friendli.ai or click Talk to an expert — our experts (not a bot) will reply within one business day.