Text & analysis

5 text models that read an image or video and return text — captioning, OCR, classification, Q&A, and summarization. Unlike every other mode, these LLMs analyze media instead of generating it.

Quick start

They run through gen-ai describe (not generate), and print the answer to stdout:

bash

# describe an image (Claude Sonnet is the default)
gen-ai describe -i photo.jpg

# ask a specific question
gen-ai describe -i receipt.jpg -p "extract the total and tax"

# summarize a video (auto-routes to a video-capable model)
gen-ai describe --video clip.mp4 -p "summarize what happens"

From an MCP client, the same models are reachable through picsart_generate with an imageUrls (or videoUrl) input.

Input types

Type	Meaning	Models
`i2t`	image → text	Claude Opus / Sonnet / Haiku, GPT-5.5
`v2t`	video (or image) → text	Gemini 3 Pro

Only Gemini 3 Pro accepts video, so gen-ai describe --video … auto-selects it unless you force another model with -m.

Providers

Provider	Models	Highlights
Anthropic	Claude Opus 4.8, Sonnet 4.6, Haiku 4.5	Opus for hard reasoning; Haiku for high-volume
OpenAI	GPT-5.5	Strong general image understanding
Google	Gemini 3 Pro	The only model that reads video

Common parameters

Param	CLI flag	Notes
`prompt`	`-p`	The question or instruction (optional — defaults to "describe this")
`imageUrls`	`-i`	Image(s) to analyze
`videoUrl`	`--video`	Video to analyze (Gemini 3 Pro only)
`thinking`	`--thinking`	Reasoning depth, where the model supports it

These models return text, so the CLI prints the result and skips download / Drive save. Add --script for clean, pipeable output.

Text & analysis ​

Quick start ​

Input types ​

Providers ​

Common parameters ​

Text & analysis

Quick start

Input types

Providers

Common parameters