May 1, 2025
7 min read

How LoRA Brings Ghibli-Style AI Art to Life

Did You Know?

Have you seen those dreamy, Ghibli-style AI images making waves online? With just a few reference pictures and some clever prompting, people are generating stunning, nostalgic visuals — but what’s the real magic happening behind the scenes?

Recently, Sam Altman of OpenAI joked that their GPUs are melting from all the creative heat. And he’s not exaggerating — scaling generative AI inference is a serious technical challenge, especially when it comes to delivering more personalized outputs, like matching a specific art style or aesthetic.

As large language models and diffusion models continue to grow, adapting them for specific downstream tasks becomes increasingly resource-intensive. Full fine-tuning isn’t just costly — for many, it’s simply out of reach.

That’s where Low-Rank Adaptation, or LoRA, comes in. LoRA is a lightweight technique that allows you to train a small adapter to guide a massive model — teaching it a new style, character, or task without retraining the entire model from scratch. It's efficient, fast, and surprisingly powerful.

That’s why LoRA is a game-changer. It redefines what efficient fine-tuning looks like — unlocking creativity, personalization, and innovation at scale.

To put its impact in perspective: Hugging Face now hosts roughly over 200,000 adapters. That number alone speaks volumes about how vital adapter-based fine-tuning has become in the AI landscape.

Some LoRA adapters on Hugging Face include:

These adapters demonstrate LoRA’s flexibility — not only can it transform art styles, but it can also shift a general-purpose LLM into a highly specialized expert in just a few training steps.

Why Use LoRA?

Fine-tuning large language models (LLMs) is expensive. It typically involves fine-tuning all model parameters – leading to:

Huge GPU memory consumption
Long training times
Difficulty switching between tasks or domains

LoRA offers an elegant solution: freeze the original weights, and train only a small number of low-rank matrices – drastically cutting compute requirements and memory usage:

Lower fine-tuning costs
Minimal inference-time overhead
Significant storage savings when managing multiple fine-tuned models
Easier handling of multiple tasks through quick switching between LoRA adapters

What Is LoRA?

LoRA injects a pair of trainable low-rank matrices into existing weight matrices:

Representation of LoRA

Figure 1: Representation of LoRA. Reference: https://arxiv.org/pdf/2106.09685.

h = W_0x + \Delta Wx = W_0x + BAx

$W_0$ : Frozen pre-trained weights
$BA$ : Trainable low-rank update (rank $r \ll d \scriptscriptstyle model$ )

These adapters are trained while keeping the base model untouched, capturing task-specific patterns with minimal parameter updates. This means the model’s core knowledge remains intact, while new skills are layered on top.

The core idea behind LoRA is based on the observation that the updates to the weight matrices during fine-tuning often have a low-rank intrinsic dimension. In simpler terms, the changes needed to adapt a pre-trained model for a new task can be achieved with far fewer parameters.

To understand LoRA better, it's helpful to understand the fundamental principles of linear algebra.

Matrix Rank

The rank of a matrix refers to the maximum number of linearly independent rows or columns it contains. Think of rank as a measure of a matrix's information content. You can think of a low-rank matrix as a compressed version of a full matrix, capturing only the most important patterns.

Singular Value Decomposition & Principal Component Analysis

Singular Value Decomposition (SVD) breaks down a matrix into its core components:

W = U \Sigma V^T \quad\text{(SVD)}

Where:

$U$ and $V$ : Orthogonal matrices representing directions in the input and output spaces
$\Sigma$ : Diagonal matrix containing singular values that represent how much each direction contributes

SVD is powerful because it reveals the intrinsic dimensionality of a matrix — which is often much smaller than the full size of the matrix.

A closely related concept is Principal Component Analysis (PCA), a technique used to reduce dimensionality by projecting data onto the directions with the most variance. PCA finds the directions (principal components) in which the data varies the most and projects the data onto these directions. This process simplifies the data without losing its most informative aspects

SVD and PCA show us that many high-dimensional transformations in neural nets can be approximated using fewer basis directions. LoRA applies this idea directly by assuming that weight updates live in a much smaller subspace.

Low-Rank Approximation

Formally, for a matrix $A \in R^{m \times n}$ , a low-rank approximation seeks a matrix $A’ \in R^{m \times n}$ that minimizes the the difference $|A - A’|$ with respect to a chosen norm, such as the Frobenius norm, which can be computed via truncated singular value decomposition (SVD).

A research by Aghajanyan et al (2020) found that pre-trained language models have a surprisingly low intrinsic dimension — meaning much of their capacity can be retained even when projected into smaller subspaces. LoRA extends this principle to weight updates. During fine-tuning, instead of modifying the full weight matrix $W \in R^{d \times k}$ , LoRA learns two much smaller matrices:

\Delta W=BA \text{, where } B \in R^{d \times r},\; A \in R^{r \times k},\; r≪\min(d,k)

In essence, LoRA treats weight updates like low-rank approximations, applying just enough change to steer the model toward the new task, while keeping the majority of the original model intact and frozen. In most implementations, LoRA is applied to the query and value projection matrices in the self-attention layers — a choice supported by the original LoRA paper by Hu et al (2021).

Since $r$ is much smaller than the hidden size $d$ or the output size $k$ , LoRA dramatically reduces the number of trainable parameters and memory footprint.

Use Cases

LoRA's versatility shines through its wide range of applications:

Image Generation

LoRA can adapt diffusion models (like Stable Diffusion) for personalized image generation. For example, you can train a LoRA adapter on a few images of a specific subject, then use it to generate new images of that subject in different contexts.

One of the most viral use cases of LoRA today? Teaching diffusion models to generate art in Ghibli, Pixar, or anime-inspired styles — with just a few example images.

By training a LoRA adapter on a handful of images (sometimes fewer than 10), you can inject entirely new aesthetics into models. This works because LoRA adapts only the parts of the model that matter for the task — in this case, the visual “language” of Ghibli-style linework, color palettes, and character design.

Natural Language Processing (NLP)

LoRA excels at adapting LLMs for various NLP tasks, such as text classification, question answering, and text generation. It allows you to fine-tune a powerful LLM for a specific domain or application without the cost of full fine-tuning.

With just a few parameters, you can personalize powerful models for niche domains — fast and efficiently.

Multi-LoRA

Why choose one adapter when you can use many?

Another exciting aspect of LoRA is the ability to support multiple adapters simultaneously. This multi-LoRA approach enables a single model to handle multiple tasks by loading different adapters as needed. For instance, a model could switch between adapters for sentiment analysis, question answering, and text summarization without the overhead of maintaining separate models for each task.

You can even combine adapters to capture multiple attributes – e.g., tone + domain + format – in a composable way. By combining these adapters, you can achieve more granular control over the model's output. This opens up possibilities for creating highly customized models that can adapt to a wide range of user preferences or task requirements.

Run multiple tasks from one base model
Mix-and-match adapters to get more nuanced behavior
Serve personalized experiences to users on demand

LoRA in Action: Benefits in Production

In production environments, the ability to serve multiple adapters concurrently is essential.

LoRA's design allows for efficient management of adapter weights, enabling rapid switching between tasks with minimal latency. This flexibility is particularly beneficial in production environments where models need to adapt to changing requirements on the fly.

Furthermore, LoRA facilitates the addition or removal of adapters without retraining the entire model. This modular approach simplifies the deployment of new tasks and ensures that resources are utilized efficiently:

Serve multiple tasks: Load appropriate adapters with minimal memory overhead
Switch tasks instantly: Replace weights without touching the frozen model
Add/remove tasks easily: No retraining—just plug in new adapters

Traditional approaches require duplicating full model weights per task $O(m \times n)$ , LoRA avoids this inefficiency with a compact structure $O(m + n \times o)$ , where $m$ is the size of the base model, $n$ is the number of tasks, and $o$ is the size of each adapter. This leaner model structure makes LoRA inherently scalable and production-ready.

With FriendliAI’s Dedicated Endpoints, you can simultaneously serve multiple LoRA adapters—a capability not offered by any other provider. This exclusive capability unlocks real-time multi-task inference with unparalleled efficiency and performance.

Conclusion

Low-Rank Adaptation (LoRA) adapters represent a significant advancement in the field of AI model adaptation. By introducing a lightweight, efficient mechanism for fine-tuning large models, LoRA makes it feasible to deploy versatile, task-specific models at scale. Whether in research or production, LoRA offers a compelling solution for adapting pre-trained models to meet diverse and evolving needs. As the field of AI continues to evolve, LoRA is poised to play an increasingly important role in unlocking the full potential of pre-trained models.

What’s Next?

Stay tuned for the next blog post, where we’ll cover how to use LoRA adapters on Friendli Dedicated Endpoints.

Written by

FriendliAI Tech & Research

General FAQ

What is FriendliAI?

FriendliAI is a GPU-inference platform that lets you deploy, scale, and monitor large language and multimodal models in production, without owning or managing GPU infrastructure. We offer three things for your AI models: Unmatched speed, cost efficiency, and operational simplicity. Find out which product is the best fit for you in here.

How does FriendliAI help my business?

Our Friendli Inference allows you to squeeze more tokens-per-second out of every GPU. Because you need fewer GPUs to serve the same load, the true metric—tokens per dollar—comes out higher even if the hourly GPU rate looks similar on paper. View pricing

Which models and modalities are supported?

Over 380,000 text, vision, audio, and multi-modal models are deployable out of the box. You can also upload custom models or LoRA adapters. Explore models

Can I deploy models from Hugging Face directly?

Yes. A one-click deploy by selecting “Friendli Endpoints” on the Hugging Face Hub will take you to our model deployment page. The page provides an easy-to-use interface for setting up Friendli Dedicated Endpoints, a managed service for generative AI inference. Learn more about our Hugging Face partnership

Still have questions?

If you want a customized solution for that key issue that is slowing your growth, contact@friendli.ai or click Talk to an expert — our experts (not a bot) will reply within one business day.