Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
Motivation
Automated descriptive captions enable the training and finetuning of diffusion models on a wider range of images, since trainers are no longer required to either find images with already associated text or write the descriptions themselves. They also improve the quality of generations produced by Text-to-Image models trained on them (ref: DALL-E 3 paper). But to-date, the community has been stuck with ChatGPT, which is expensive and heavily censored; or alternative models, like CogVLM, which are weaker than ChatGPT and have abysmal performance outside of the SFW domain.
I'm building JoyCaption to help fill this gap by performing near or on-par with GPT4o in captioning images, while being free, unrestricted, and open.
How to Get Started with the Model
Please see the Github for more details.
Example usage:
markdown
import torchfrom PIL import Imagefrom transformers import AutoProcessor, LlavaForConditionalGenerationIMAGE_PATH = "image.jpg"PROMPT = "Write a long descriptive caption for this image in a formal tone."MODEL_NAME = "1038lab/llama-joycaption-beta-one"# Load JoyCaption# bfloat16 is the native dtype of the LLM used in JoyCaption (Llama 3.1)# device_map=0 loads the model into the first GPUprocessor = AutoProcessor.from_pretrained(MODEL_NAME)llava_model = LlavaForConditionalGeneration.from_pretrained(MODEL_NAME, torch_dtype="bfloat16", device_map=0)llava_model.eval()with torch.no_grad():# Load imageimage = Image.open(IMAGE_PATH)# Build the conversationconvo = [{"role": "system","content": "You are a helpful image captioner.",},{"role": "user","content": PROMPT,},]# Format the conversation# WARNING: HF's handling of chat's on Llava models is very fragile. This specific combination of processor.apply_chat_template(), and processor() works# but if using other combinations always inspect the final input_ids to ensure they are correct. Often times you will end up with multiple <bos> tokens# if not careful, which can make the model perform poorly.convo_string = processor.apply_chat_template(convo, tokenize = False, add_generation_prompt = True)assert isinstance(convo_string, str)# Process the inputsinputs = processor(text=[convo_string], images=[image], return_tensors="pt").to('cuda')inputs['pixel_values'] = inputs['pixel_values'].to(torch.bfloat16)# Generate the captionsgenerate_ids = llava_model.generate(**inputs,max_new_tokens=512,do_sample=True,suppress_tokens=None,use_cache=True,temperature=0.6,top_k=None,top_p=0.9,)[0]# Trim off the promptgenerate_ids = generate_ids[inputs['input_ids'].shape[1]:]# Decode the captioncaption = processor.tokenizer.decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)caption = caption.strip()print(caption)
vLLM
vLLM provides the highest performance inference for JoyCaption, and an OpenAI compatible API so JoyCaption can be used like any other VLMs. Example usage:
markdown
vllm serve 1038lab/llama-joycaption-beta-one --max-model-len 4096 --enable-prefix-caching
VLMs are a bit finicky on vLLM, and vLLM is memory hungry, so you may have to adjust settings for your particular environment, such as forcing eager mode, adjusting max-model-len, adjusting gpu_memory_utilization, etc.
Model provider
fwwrsd
Model tree
Base
google/siglip2-so400m-patch14-384
Fine-tuned
this model
Modalities
Input
Text, Image
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information