Skip to content

Text & analysis

5 text models that read an image or video and return text — captioning, OCR, classification, Q&A, and summarization. Unlike every other mode, these LLMs analyze media instead of generating it.

Quick start

They run through gen-ai describe (not generate), and print the answer to stdout:

bash
# describe an image (Claude Sonnet is the default)
gen-ai describe -i photo.jpg

# ask a specific question
gen-ai describe -i receipt.jpg -p "extract the total and tax"

# summarize a video (auto-routes to a video-capable model)
gen-ai describe --video clip.mp4 -p "summarize what happens"

From an MCP client, the same models are reachable through picsart_generate with an imageUrls (or videoUrl) input.

Input types

TypeMeaningModels
i2timage → textClaude Opus / Sonnet / Haiku, GPT-5.5
v2tvideo (or image) → textGemini 3 Pro

Only Gemini 3 Pro accepts video, so gen-ai describe --video … auto-selects it unless you force another model with -m.

Providers

ProviderModelsHighlights
AnthropicClaude Opus 4.8, Sonnet 4.6, Haiku 4.5Opus for hard reasoning; Haiku for high-volume
OpenAIGPT-5.5Strong general image understanding
GoogleGemini 3 ProThe only model that reads video

Common parameters

ParamCLI flagNotes
prompt-pThe question or instruction (optional — defaults to "describe this")
imageUrls-iImage(s) to analyze
videoUrl--videoVideo to analyze (Gemini 3 Pro only)
thinking--thinkingReasoning depth, where the model supports it

These models return text, so the CLI prints the result and skips download / Drive save. Add --script for clean, pipeable output.

Built on @picsart/ai-sdk · gen-ai CLI · Picsart MCP · Skills