User Guide
Learn how to use DocuMind to transform your documents into text and audio
What is DocuMind?
DocuMind is an AI-powered document processing application that combines Optical Character Recognition (OCR) and Text-to-Speech (TTS) technology. It allows you to extract text from PDFs and images using Google's Gemini AI, and then convert that text into natural-sounding audio using Kokoro TTS.
How It Works
Upload Documents
Upload PDF files or images (PNG, JPG, WEBP) by dragging and dropping them into the upload zone or clicking to browse.
OCR Processing
Click "Extract Text with OCR" to process your documents. The app converts PDFs to images and sends them to Google's Gemini AI in chunks of 6 pages at a time for optimal processing. The AI extracts text, corrects spelling errors, and formats it for natural speech output.
Edit Text (Optional)
Review the extracted text and make any necessary edits. You can copy the text to your clipboard or modify it before converting to audio.
Generate Audio
Choose between female (Sky) or male (Michael) voice, then click "Generate Audio". The text will be converted to natural-sounding speech using Kokoro TTS running on your device.
Listen & Download
Use the audio player to listen to your document. You can control playback, adjust volume, and download the audio file for offline use.
Key Features
- AI-powered OCR with Gemini 2.0 Flash
- Natural-sounding TTS with Kokoro
- Support for PDFs and images
- Editable extracted text
- Multiple voice options
- Downloadable audio files
- Light and dark mode support
- Client-side TTS processing
Important Notes
TTS Processing Time
Audio generation happens entirely on your device using WebAssembly (WASM). This ensures privacy but may take longer than server-side processing, especially for longer texts. The first generation may take extra time as the model loads.
Privacy & Security
OCR processing is done via Google's Gemini API (requires internet), but TTS runs entirely in your browser. Your audio never leaves your device.
Document Limits
For optimal performance, the app processes up to 25 images at a time. Large PDFs are automatically split into chunks of 6 pages for processing.
Tips for Best Results
- •Use high-quality, clear images or PDFs for better OCR accuracy
- •Review and edit the extracted text before generating audio
- •For long documents, consider processing them in smaller batches
- •Keep your browser tab active during TTS generation for best performance
- •Use a modern browser (Chrome, Edge, Firefox) for optimal compatibility