AI
Speech to Text
Transcribe audio and video with timestamps, language detection, and optional diarisation.
POST/v1/ai/speech-to-text
Transcribe any audio or video source. Returns the full text plus per-chunk timestamps and, optionally, per-speaker segments.
Parameters
| Name | Type | Required | Description |
|---|---|---|---|
url | string | conditional | URL of the audio or video file. One of url or file_store_key is required. |
file_store_key | string | conditional | Key of a previously uploaded file. |
language | string | no | ISO-639-1 language code, or "auto" to detect. Defaults to auto. |
translate | boolean | no | Translate output into English (or language if provided). |
by_speaker | boolean | no | Return per-speaker segments. |
batch_size | number | no | Chunking size. Max 40. Defaults to 30. |
chunk_duration | number | no | Chunk duration in seconds. Max 15. Defaults to 3. |
Request
curl https://api.marob.ai/v1/ai/speech-to-text \
-H "Authorization: Bearer $MAROB_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://marob.ai/samples/meeting.wav",
"language": "en",
"by_speaker": true
}'Response
{
"success": true,
"_usage": {
"input_tokens": 80,
"output_tokens": 640,
"inference_time_tokens": 4800,
"total_tokens": 5520
},
"log_id": "log_01JABC...",
"text": "Welcome to the meeting. Today we will cover…",
"chunks": [
{ "timestamp": [0.0, 3.2], "text": "Welcome to the meeting." }
],
"speakers": [
{
"speaker": "speaker_0",
"timestamp": [0.0, 3.2],
"text": "Welcome to the meeting."
}
],
"language_detected": "en",
"confidence": 0.96
}Async mode
Sending a webhook_url is not yet supported through Marob AI. All requests run
synchronously and return the transcription in the response body.