Marob AI

Transcribe audio and video with timestamps, language detection, and optional diarisation.

POST/v1/ai/speech-to-text

Transcribe any audio or video source. Returns the full text plus per-chunk timestamps and, optionally, per-speaker segments.

Parameters

Name	Type	Required	Description
`url`	`string`	conditional	URL of the audio or video file. One of `url` or `file_store_key` is required.
`file_store_key`	`string`	conditional	Key of a previously uploaded file.
`language`	`string`	no	ISO-639-1 language code, or `"auto"` to detect. Defaults to `auto`.
`translate`	`boolean`	no	Translate output into English (or `language` if provided).
`by_speaker`	`boolean`	no	Return per-speaker segments.
`batch_size`	`number`	no	Chunking size. Max 40. Defaults to 30.
`chunk_duration`	`number`	no	Chunk duration in seconds. Max 15. Defaults to 3.

Request

curl https://api.marob.ai/v1/ai/speech-to-text \
  -H "Authorization: Bearer $MAROB_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://marob.ai/samples/meeting.wav",
    "language": "en",
    "by_speaker": true
  }'

Response

{
  "success": true,
  "_usage": {
    "input_tokens": 80,
    "output_tokens": 640,
    "inference_time_tokens": 4800,
    "total_tokens": 5520
  },
  "log_id": "log_01JABC...",
  "text": "Welcome to the meeting. Today we will cover…",
  "chunks": [
    { "timestamp": [0.0, 3.2], "text": "Welcome to the meeting." }
  ],
  "speakers": [
    {
      "speaker": "speaker_0",
      "timestamp": [0.0, 3.2],
      "text": "Welcome to the meeting."
    }
  ],
  "language_detected": "en",
  "confidence": 0.96
}

Async mode

Sending a webhook_url is not yet supported through Marob AI. All requests run synchronously and return the transcription in the response body.

Speech to Text

Parameters

Request

Response

Async mode

On this page