April 1, 2026
6 min read

Automating Industrial Inspection with Vision Language Models

TL;DR

Manufacturers are adopting Vision-Language Models (VLMs) to bridge the gap between high-speed automation and flexible, human-like reasoning.
VLMs provide superior flexibility and transparency at a lower development cost than traditional Deep Learning models; however, they typically suffer from high computational latency and complex infrastructure scaling.
FriendliAI enables high-speed inference and elastic scaling, allowing manufacturers to meet strict production SLAs without compromising on model complexity.

Automating Industrial Inspection with Vision Language Models thumbnail

Vision Language Models (VLMs) are revolutionizing industrial inspection by bridging the gap between rigid, high-speed deep learning models and flexible human reasoning.

For years, manufacturers have relied on a binary approach for industrial inspections: human oversight or Automatic Defect Classification (ADC) powered by standard Deep Learning models like CNNs and Vision Transformers.

However, this conventional approach is changing as VLMs provide a more cost-effective and flexible alternative. The 2025 ROI of AI in Manufacturing Report reveals that 54% of organizations using AI agents now deploy them for quality control, and VLMs are at the center of this move. While generally slower than traditional deep learning models at the edge, VLMs offer a level of adaptability and reasoning that was previously impossible to automate.

This blog post will examine Vision Language Models, how you might be able to incorporate them into your workflow, and how FriendliAI is positioned to assist in the deployment and scaling of these models for production environments.

Understanding VLMs’ Capabilities

Figure 1: Qwen3 VL labelling cars in a crowded street (https://qwen.ai/blog?id=99f0335c4ad9ff6153e517418d48535ab6d8afef)

VLMs have three characteristics that differentiate them from traditional models.

Cost-Effectiveness

Historically, deploying a vision model required a specialized team of machine learning engineers to design the architecture and train a bespoke model for your product lines. VLMs fundamentally change this dynamic. While the knowledge required to engineer effective prompts or fine-tune models is not trivial, the barrier to entry is significantly lower.

Beyond saving engineering time, VLMs drastically reduce data requirements. Traditional models require massive, hand-labeled datasets, but VLMs need only a fraction of that data. In many cases, they can even operate out of the box in a "zero-shot" capacity, accurately identifying cracks or reading visual input without any task-based fine-tuning.

If you already have an onboarding guide for newcomers or a reference image, that is even better. That could be enough information for the model to understand your quality standards and start classifying defects on day one. Ultimately, you save on both engineering overhead and data acquisition costs.

Adaptiveness

Deep learning models are notoriously brittle. Because they can only recognize the exact patterns they were explicitly trained on and output specific limited codes, they throw false flags or miss novel defects entirely when faced with "out-of-distribution" data. Furthermore, should any of the guidelines or rules change, the previous model becomes obsolete, forcing you into a costly and time-consuming retraining cycle just to update the logic.

In contrast, Vision-Language Models (VLMs) leverage their knowledge background and reasoning capabilities to adapt on the fly. If a new client standard suddenly requires you to differentiate between a superficial scratch that passes and a deep gouge that gets routed to scrap, you don't need to gather new data and retrain an entire architecture. That new routing logic can be deployed immediately just by updating a few lines of instruction in your prompt. By relying on natural language, VLMs seamlessly adjust to new rules and classify novel defects.

Explainability

Figure 2: NVIDIA utilizing Cosmos-2-Reasoning for Industrial Inspection of Semiconductor Wafers. Link: (https://developer.nvidia.com/blog/optimizing-semiconductor-defect-classification-with-generative-ai-and-vision-foundation-models/)

Figure 2: NVIDIA utilizing Cosmos-2-Reasoning for Industrial Inspection of Semiconductor Wafers. https://developer.nvidia.com/blog/optimizing-semiconductor-defect-classification-with-generative-ai-and-vision-foundation-models/)

Traditional vision models often operate as black boxes. They provide a simple result without context, leaving engineers to guess exactly why it made that decision. VLMs change this by providing natural language justifications for their decisions.

For example, given an image of a semiconductor wafer’s defect pattern, the VLM can—rather than just flagging the image as a defect—explain the error pattern to be ‘a center ring defect’ and even suggest a possible cause for the defect. This information can then be passed on to an engineer or even another agent to handle.

Furthermore, because the model's behavior is guided by text instructions, it is highly traceable; engineers can easily link misclassifications directly back to the exact prompt used, making debugging much faster.

Key Challenges in Implementing VLMs in Industry:

While the advantages of VLMs are undeniable, there are many bottlenecks in implementing VLMs in industry. Compared to traditional models, the size and the resource requirements of VLMs are much greater, pushing the inference from edge machines into GPU clusters. This brings up two critical challenges: Latency and Scalability.

Latency: General-purpose VLMs introduce a latency gap compared to traditional methods. On the factory floor, even a few milliseconds of latency can slow down production lines and reduce overall operational efficiency.
Scalability: As the agent scales from POC to multiple factories, the demand for reliable high-performance infrastructure grows exponentially; without an optimized serving layer, the increased volume of high-resolution data quickly leads to a collapse in total throughput.

While your specific workflow will dictate your exact requirements for both latency and scalability, to make VLMs viable for the factory floor, you need an inference stack designed specifically for the Generative AI era.

How FriendliAI Bridges the Gap

Vision Specialized Inference

In many workflows, a reference image may be given as a prompt to be analyzed multiple times. FriendliAI uses image-caching vision encoders to store visual tokens for images, avoiding redundant reprocessing. This optimization, combined with our dedicated inference engine, significantly reduces Time to First Token (TTFT). This allows for complex classification and reasoning within just a few hundreds of milliseconds—fast enough for the vast majority of high-value industrial tasks.

State of the Art Token Throughput

While image and input preprocessing is often the first bottleneck, the speed at which a model generates its response is just as critical for overall latency. FriendliAI’s engine is architected for maximum token throughput, ensuring that the "Explainability" phase of the VLM doesn't become a bottleneck. By accelerating the generation process, we ensure that the model provides its reasoning and corrective suggestions at a speed that keeps pace with your operational flow.

Flexible and Affordable Scaling

Our platform handles all inference management, allowing you to scale seamlessly from a single POC to multiple factories by adjusting replicas, concurrency, and GPU types to match production reality. This flexible infrastructure eliminates the VLM latency gap, ensuring that as your vision agents grow, your throughput remains consistent and real-time.

Deploy Vision-Language Models with Friendli Suite

Setting Up Your VLM Endpoint with FDE

Figure 3: Example Quality Check output using the Friendli Dedicated Endpoint Playground. Image from (Bergmann et al., 2021)

FriendliAI simplifies deploying Vision-Language Models (VLMs), providing a quick and easy way to assess VLM performance on your specific use cases and scale them into a production-ready environment.

The following guide details the steps to deploy a model from the Qwen-3-VL family using Friendli Suite. This demonstration focuses on configuring and setting up Qwen-3-VL-30B-A3B-Instruct.

Deployment Steps:

Sign In: Access Friendli Suite and log in to your account.
Start New Endpoint: Navigate to Dedicated Endpoints in the left panel and click the New Endpoint button to initiate deployment.
Select Model: Search for and select the Qwen/Qwen-3-VL-30B-A3B-Instruct model from Hugging Face.
Choose GPU: Select the appropriate GPU instance. A single H100 (1x H100) is the recommended GPU configuration for this model.
Create Endpoint: Select any additional configurations required, then click Create.

Once deployed, you can immediately test the endpoint and interact with the model in the playground.

You can also check out other VLM models supported in our platform here.

Calling Your VLM via the Friendli API

Now that the endpoint is ready, you can invoke the API to interact with the model. The following is an example of a simple multimodal request in Python.

request.py

import requests
import os

# 1. Authentication & Configuration
token = os.environ["FRIENDLI_TOKEN"]
url = "https://api.friendli.ai/dedicated/v1/chat/completions"

headers = {
    "Authorization": f"Bearer {token}",
    "Content-Type": "application/json"
}

# 2. Input Data
image_url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/cat.png"

# 3. Request Payload
payload = {
    "model": "YOUR_ENDPOINT_ID", # Replace with your specific deployment ID
    "messages": [
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {
                        "url": image_url
                    }
                },
                {
                    "type": "text",
                    "text": "Describe this image."
                }
            ]
        }
    ],
    "stream": True,
}

# 4. Execution
response = requests.request("POST", url, json=payload, headers=headers)

For more information, please refer to our documentation on multimodal capability of our product.

Get started with VLM Models on Friendli Suite

As Vision-Language Models become more practical for real-world inspection workflows, the key challenge is how reliably and efficiently they can run in production. FriendliAI helps bridge that gap by making it easier to deploy, scale, and operate VLMs with the performance required for industrial environments.

👉 Learn more at FriendliAI

👉 Sign up for Friendli Suite and get started immediately

Written by

FriendliAI Tech & Research

General FAQ

What is FriendliAI?

FriendliAI is a GPU-inference platform that lets you deploy, scale, and monitor large language and multimodal models in production, without owning or managing GPU infrastructure. We offer three things for your AI models: Unmatched speed, cost efficiency, and operational simplicity. Find out which product is the best fit for you in here.

How does FriendliAI help my business?

Our Friendli Inference allows you to squeeze more tokens-per-second out of every GPU. Because you need fewer GPUs to serve the same load, the true metric—tokens per dollar—comes out higher even if the hourly GPU rate looks similar on paper. View pricing

Which models and modalities are supported?

Over 580,000 text, vision, audio, and multi-modal models are deployable out of the box. You can also upload custom models or LoRA adapters. Explore models

Can I deploy models from Hugging Face directly?

Yes. A one-click deploy by selecting “Friendli Endpoints” on the Hugging Face Hub will take you to our model deployment page. The page provides an easy-to-use interface for setting up Friendli Dedicated Endpoints, a managed service for generative AI inference. Learn more about our Hugging Face partnership

Still have questions?

If you want a customized solution for that key issue that is slowing your growth, support@friendli.ai or click Talk to an engineer — our engineers (not a bot) will reply within one business day.