DocuMind

AI-Powered OCR & TTS

User Guide

Learn how to use DocuMind to transform your documents into text and audio

What is DocuMind?

DocuMind is an AI-powered document processing application that combines Optical Character Recognition (OCR) and Text-to-Speech (TTS) technology. It allows you to extract text from PDFs and images using Google's Gemini AI, and then convert that text into natural-sounding audio using Kokoro TTS.

How It Works

Upload Documents

Upload PDF files or images (PNG, JPG, WEBP) by dragging and dropping them into the upload zone or clicking to browse.

OCR Processing

Click "Extract Text with OCR" to process your documents. The app converts PDFs to images and sends them to Google's Gemini AI in chunks of 6 pages at a time for optimal processing. The AI extracts text, corrects spelling errors, and formats it for natural speech output.

Edit Text (Optional)

Review the extracted text and make any necessary edits. You can copy the text to your clipboard or modify it before converting to audio.

Generate Audio

Choose between female (Sky) or male (Michael) voice, then click "Generate Audio". The text will be converted to natural-sounding speech using Kokoro TTS running on your device.

Listen & Download

Use the audio player to listen to your document. You can control playback, adjust volume, and download the audio file for offline use.

Key Features

AI-powered OCR with Gemini 2.0 Flash
Natural-sounding TTS with Kokoro
Support for PDFs and images
Editable extracted text
Multiple voice options
Downloadable audio files
Light and dark mode support
Client-side TTS processing

Important Notes

TTS Processing Time

Audio generation happens entirely on your device using WebAssembly (WASM). This ensures privacy but may take longer than server-side processing, especially for longer texts. The first generation may take extra time as the model loads.

Privacy & Security

OCR processing is done via Google's Gemini API (requires internet), but TTS runs entirely in your browser. Your audio never leaves your device.

Document Limits

For optimal performance, the app processes up to 25 images at a time. Large PDFs are automatically split into chunks of 6 pages for processing.

Tips for Best Results

•Use high-quality, clear images or PDFs for better OCR accuracy
•Review and edit the extracted text before generating audio
•For long documents, consider processing them in smaller batches
•Keep your browser tab active during TTS generation for best performance
•Use a modern browser (Chrome, Edge, Firefox) for optimal compatibility

Back to Home