Send Video and Text to Gemini for Analysis

Gemini 1.5 Pro and Flash are natively multimodal with unique video understanding capabilities not available in other major LLM APIs. The model can process video files up to 2 hours long, audio files, images, and text in the same request. This example shows a request that includes a video file URL for meeting transcription and action item extraction — a task that demonstrates Gemini's distinctive advantage over text-only models. Video is tokenized at 263 tokens per second of content, making a 1-hour meeting approximately 946,800 tokens. For a 1M token context window, this leaves room for substantial system instructions and output. Gemini handles the temporal relationships in video natively — it understands that a speaker mentioned in the first minute of video is the same person referenced in the 45th minute, without requiring any chunking or sequential processing. Audio content in video is also processed natively, including speaker diarization (distinguishing between different speakers), transcription with timestamps, and understanding paralinguistic cues like tone and emphasis. For meeting analysis pipelines, Gemini's video understanding eliminates the need for a separate speech-to-text step before LLM processing.

Example
{
  "model": "gemini-1.5-pro",
  "systemInstruction": {
    "parts": [{"text": "You are an expert meeting analyst. Extract action items, decisions, and key discussion points from meeting recordings."}]
  },
  "contents": [
    {
      "role": "user",
      "parts": [
        {
          "fileData": {
            "mimeType": "video/mp4",
            "fileUri": "https://storage.googleapis.com/example-meetings/q4-planning.mp4"
          }
        },
        {
          "text": "Please: (1) List all action items with owner and deadline, (2) Summarize key decisions made, (3) Identify any unresolved issues that need follow-up"
        }
      ]
    }
  ]
}
[ open in Gemini API Request Builder → ]

FAQ

How long can video files be for Gemini?
Gemini 1.5 Pro supports videos up to 2 hours in length. Videos are tokenized at 263 tokens per second. A 2-hour video uses approximately 1.9 million tokens, which fits within the 2M token context window.
Does Gemini support speaker identification in videos?
Yes. Gemini 1.5 Pro can identify and track different speakers throughout a video, attribute quotes and statements to the correct speaker, and maintain speaker consistency across the full video duration.
How do I upload video files to use with the Gemini API?
Use the Gemini Files API to upload files up to 2GB. The uploaded file gets a URI that you reference in fileData.fileUri. Files are stored for 48 hours. For short videos, you can also include them as base64-encoded inline data.

Related Examples