Marob AI
AI

Speech to Text

Transcribe audio and video with timestamps, language detection, and optional diarisation.

POST/v1/ai/speech-to-text

Transcribe any audio or video source. Returns the full text plus per-chunk timestamps and, optionally, per-speaker segments.

Parameters

NameTypeRequiredDescription
urlstringconditionalURL of the audio or video file. One of url or file_store_key is required.
file_store_keystringconditionalKey of a previously uploaded file.
languagestringnoISO-639-1 language code, or "auto" to detect. Defaults to auto.
translatebooleannoTranslate output into English (or language if provided).
by_speakerbooleannoReturn per-speaker segments.
batch_sizenumbernoChunking size. Max 40. Defaults to 30.
chunk_durationnumbernoChunk duration in seconds. Max 15. Defaults to 3.

Request

curl https://api.marob.ai/v1/ai/speech-to-text \
  -H "Authorization: Bearer $MAROB_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://marob.ai/samples/meeting.wav",
    "language": "en",
    "by_speaker": true
  }'

Response

{
  "success": true,
  "_usage": {
    "input_tokens": 80,
    "output_tokens": 640,
    "inference_time_tokens": 4800,
    "total_tokens": 5520
  },
  "log_id": "log_01JABC...",
  "text": "Welcome to the meeting. Today we will cover…",
  "chunks": [
    { "timestamp": [0.0, 3.2], "text": "Welcome to the meeting." }
  ],
  "speakers": [
    {
      "speaker": "speaker_0",
      "timestamp": [0.0, 3.2],
      "text": "Welcome to the meeting."
    }
  ],
  "language_detected": "en",
  "confidence": 0.96
}

Async mode

Sending a webhook_url is not yet supported through Marob AI. All requests run synchronously and return the transcription in the response body.

On this page