Support3 min read

Voxtral FAQ

Common questions about Voxtral implementation, troubleshooting, and best practices.

What are the main differences between Voxtral 3B and 24B models?

Voxtral 3B is optimized for local and edge deployment with lower resource requirements (8GB GPU memory) while maintaining good accuracy. Voxtral 24B offers maximum accuracy and capabilities for production environments but requires more resources (48GB GPU memory). Choose 3B for local development and 24B for production systems requiring the highest quality results.

Which programming languages are supported for Voxtral integration?

Voxtral primarily supports Python through the vLLM framework and Mistral Common libraries. You can also interact with Voxtral through REST API endpoints from any programming language that supports HTTP requests. SDKs and wrappers for other languages like JavaScript, Go, and Java may be available from the community.

How much audio can Voxtral process in a single request?

Voxtral can handle audio files up to 30-40 minutes in length while maintaining context and accuracy. For longer audio content, consider splitting into smaller segments or using streaming processing approaches to maintain optimal performance.

What audio formats does Voxtral support?

Voxtral supports common audio formats including WAV (uncompressed), MP3 (compressed), FLAC (lossless), M4A, and other standard formats. For best results, use uncompressed formats like WAV or FLAC, especially for critical applications.

Can Voxtral process real-time audio streams?

Yes, Voxtral supports streaming audio processing through the vLLM framework. This enables real-time transcription and analysis for live audio feeds, though latency will depend on your hardware configuration and model choice.

How accurate is Voxtral compared to other speech recognition systems?

Voxtral achieves over 95% transcription accuracy in optimal conditions and demonstrates competitive performance against both open-source and proprietary speech recognition systems. Accuracy varies based on audio quality, background noise, speaker accents, and language.

What languages does Voxtral support?

Voxtral supports multiple languages including English, Spanish, French, Portuguese, Hindi, German, Dutch, and Italian with automatic language detection. The model can also handle code-switching scenarios where speakers alternate between languages.

How do I implement function calling with Voxtral?

Function calling allows Voxtral to convert natural language voice commands into structured function calls. Configure your functions in the API request and Voxtral will automatically detect when voice input should trigger a function call rather than just transcription.

What are the minimum hardware requirements for running Voxtral?

For Voxtral 3B: 8GB GPU memory and 16GB system RAM. For Voxtral 24B: 48GB GPU memory and 64GB system RAM. CUDA-compatible GPUs are recommended for optimal performance. You can run smaller models on CPU, but expect significantly slower processing times.

Can I fine-tune Voxtral for my specific use case?

Voxtral is based on Mistral's architecture and follows open-source principles. While the base models are pre-trained, you can explore fine-tuning approaches using the model checkpoints and training frameworks compatible with the Mistral ecosystem.

How do I handle multiple speakers in audio files?

Voxtral can process multi-speaker audio and maintain context across speakers. For best results with speaker identification, consider pre-processing audio to separate speakers or use Voxtral's analysis capabilities to identify different speakers in the transcription output.

What's the difference between transcription and speech understanding?

Transcription converts speech to text, while speech understanding includes analysis, question answering, summarization, and context comprehension. Voxtral provides both capabilities, allowing you to not just transcribe audio but also analyze and extract insights from the content.

How do I optimize Voxtral performance for production use?

Use GPU acceleration, implement proper load balancing, configure appropriate batch sizes, enable tensor parallelism for large models, and optimize your infrastructure for the expected load. Monitor GPU utilization and adjust resources accordingly.

Is Voxtral suitable for sensitive or confidential audio content?

Voxtral can be deployed locally, ensuring that sensitive audio never leaves your infrastructure. This makes it suitable for confidential content processing, unlike cloud-based services where data is processed on external servers.

How do I troubleshoot 'out of memory' errors?

Reduce batch size, use model sharding across multiple GPUs, implement gradient checkpointing, or switch to the smaller 3B model if your use case allows. Monitor GPU memory usage and adjust configuration parameters accordingly.

Can Voxtral be integrated with existing business applications?

Yes, Voxtral provides REST API endpoints that can be integrated with any business application. Common integrations include CRM systems, meeting platforms, content management systems, and customer support tools.

Still have questions?

If you couldn't find the answer you're looking for, check out our other resources:

Getting Started Guide Features & Capabilities Deployment Guide