File Context via Optical Character Recognition (OCR)
LibreChat’s OCR (Optical Character Recognition) feature enables AI agents to extract and process text from images and documents. This capability enhances the AI’s ability to work with visual content, making it possible to analyze, understand, and respond to information contained in images.
Overview
OCR functionality in LibreChat allows agents to:
- Extract text from images and documents
- Maintain document structure and formatting
- Process complex layouts including multi-column text
- Handle tables, equations, and other specialized content
- Work with multilingual content
Availability
Currently, OCR is only available as an agent capability. This means you must use an agent via the Agents endpoint to leverage OCR functionality.
Configuration
OCR can be enabled in the LibreChat configuration file (librechat.yaml
). The OCR configuration supports two strategies:
- Mistral OCR (Default and currently the only available option)
- Custom OCR (Planned for future releases)
Basic Configuration Example
If using the Mistral OCR API, you only need the following environment variables to get started:
# `.env`
OCR_API_KEY=your-mistral-api-key
# OCR_BASEURL=https://api.mistral.ai/v1 # this is the default value
For additional, detailed configuration options, see the OCR Config Object Structure.
# `librechat.yaml`
ocr:
mistralModel: "mistral-ocr-latest" # Optional: Specify Mistral model, defaults to "mistral-ocr-latest"
apiKey: "your-mistral-api-key" # Optional: Defaults to OCR_API_KEY env variable
baseURL: "https://api.mistral.ai/v1" # Optional: Defaults to OCR_BASEURL env variable, or Mistral's API if no variable set
strategy: "mistral_ocr" # Optional: Defaults to "mistral_ocr" (only option currently available)
Mistral OCR
Currently, LibreChat uses Mistral’s OCR API as the default and only available OCR provider. Mistral OCR offers state-of-the-art document understanding capabilities.
Key Features of Mistral OCR
- Document Structure Preservation: Maintains formatting like headers, paragraphs, lists, and tables
- Multilingual Support: Processes text in multiple languages and scripts
- Complex Layout Handling: Handles multi-column text and mixed content
- Mathematical Expression Recognition: Accurately processes equations and formulas
- High-Speed Processing: Processes up to 2000 pages per minute
Important Considerations
- Cost: Using Mistral OCR may incur costs as it’s a paid API service (though free trials may be available)
- Data Privacy: Data processed through Mistral OCR is subject to Mistral’s cloud environment and their terms of service
- Document Limitations:
- Maximum file size: 50 MB
- Maximum document length: 1,000 pages
Future Plans
- Mistral plans to make their OCR API available through their cloud partners, such as GCP and AWS, and enterprise self-hosting for organizations with stringent data privacy requirements (source).
- LibreChat will continue to support Mistral OCR and explore additional OCR providers, including open-source solutions, for enhanced functionality.
- LibreChat currently does not include the parsed image content from the OCR process in its responses, even though services like Mistral’s OCR API may provide these in the result. This feature may be supported in future updates.
Using File Context (OCR) in LibreChat
LibreChat provides two main ways to use OCR functionality:
1. Upload as Text in Chat
In any chat conversation, you can use OCR to extract text from images or documents:
- Click the attachment icon in the chat input
- Select “Upload as Text” from the menu
- Choose an image or document file
- The OCR system will process the file and insert the extracted text into your message
2. File Context for Agents
When working with agents, you can add documents as context using OCR:
- Open the Agent Builder panel or edit an existing agent
- In the File Context section, click “Upload File Context”
- Select a document or image file
- The OCR system will extract text from the file and add it to the agent’s instructions
Files uploaded as “Context” are processed using OCR to extract text, which is then added to the Agent’s instructions. This is ideal for documents, images with text, or PDFs where you need the full text content of a file to be available to the agent.
Note, the OCR is performed at the time of upload and is not stored as a separate file, rather purely as text in the database.
Example Use Cases
- Document Analysis: Extract and analyze text from scanned documents, PDFs, or images
- Data Extraction: Pull specific information from forms, receipts, or invoices
- Research Assistance: Process academic papers, articles, or books
- Language Translation: Extract text from foreign language documents for translation
- Content Digitization: Convert printed materials into digital, searchable text
Limitations
- OCR accuracy may vary depending on image quality, document complexity, and text clarity
- Some specialized formatting or unusual layouts might not be perfectly preserved
- Very large documents may be truncated due to token limitations of the underlying AI models
Future Enhancements
LibreChat plans to expand OCR capabilities in future releases:
- Support for custom OCR providers
- A
user_provided
strategy option that will allow users to choose their preferred OCR service - Integration with open-source OCR solutions
- Enhanced document processing options
- More granular control over OCR settings
For more information on configuring OCR, see the OCR Config Object Structure.