File Context via Optical Character Recognition (OCR)
LibreChat’s OCR (Optical Character Recognition) feature enables AI agents to extract and process text from images and documents. This capability enhances the AI’s ability to work with visual content, making it possible to analyze, understand, and respond to information contained in images.
Overview
OCR functionality in LibreChat allows agents to:
- Extract text from images and documents
- Maintain document structure and formatting
- Process complex layouts including multi-column text
- Handle tables, equations, and other specialized content
- Work with multilingual content
Availability
Currently, OCR is only available as an agent capability. This means you must use an agent via the Agents endpoint to leverage OCR functionality.
OCR Strategies
LibreChat supports multiple OCR strategies to meet different deployment needs and requirements. Choose the strategy that best fits your infrastructure and compliance requirements.
1. Mistral OCR (Default)
The default strategy uses Mistral’s cloud API service directly. This is the simplest setup and requires only an API key from Mistral.
Environment Variables:
# `.env`
OCR_API_KEY=your-mistral-api-key
# OCR_BASEURL=https://api.mistral.ai/v1 # this is the default value
Configuration:
# `librechat.yaml`
ocr:
mistralModel: "mistral-ocr-latest" # Optional: Specify Mistral model, defaults to "mistral-ocr-latest"
apiKey: "your-mistral-api-key" # Optional: Defaults to OCR_API_KEY env variable
baseURL: "https://api.mistral.ai/v1" # Optional: Defaults to OCR_BASEURL env variable, or Mistral's API if no variable set
strategy: "mistral_ocr" # Optional: Defaults to "mistral_ocr"
Key Features:
- Document Structure Preservation: Maintains formatting like headers, paragraphs, lists, and tables
- Multilingual Support: Processes text in multiple languages and scripts
- Complex Layout Handling: Handles multi-column text and mixed content
- Mathematical Expression Recognition: Accurately processes equations and formulas
- High-Speed Processing: Processes up to 2000 pages per minute
Considerations:
- Cost: Using Mistral OCR may incur costs as it’s a paid API service (though free trials may be available)
- Data Privacy: Data processed through Mistral OCR is subject to Mistral’s cloud environment and their terms of service
- Document Limitations:
- Maximum file size: 50 MB
- Maximum document length: 1,000 pages
2. Azure Mistral OCR
For organizations using Azure AI Foundry, you can deploy Mistral OCR models to your Azure infrastructure. Currently, the Mistral OCR 2503 model is available for Azure deployment.
Configuration:
# `librechat.yaml`
ocr:
mistralModel: "deployed-mistral-ocr-2503" # Should match your Azure deployment name
apiKey: "${AZURE_MISTRAL_OCR_API_KEY}" # Reference to your Azure API key in .env
baseURL: "https://your-deployed-endpoint.models.ai.azure.com/v1" # Your Azure endpoint
strategy: "azure_mistral_ocr" # Use Azure strategy
Benefits:
- Data Residency: Keep data processing within your Azure region
- Compliance: Meet organizational security and compliance requirements
- Integration: Leverage existing Azure infrastructure and security policies
- Control: Manage your own deployment and scaling
Azure Model Information: You can explore the latest Mistral OCR model available on Azure AI Foundry here (requires Azure subscription):
https://ai.azure.com/explore/models/mistral-ocr-2503
3. Custom OCR (Planned)
Support for custom OCR providers and user-defined strategies is planned for future releases.
Detailed Configuration
For additional, detailed configuration options, see the OCR Config Object Structure.
Using File Context (OCR) in LibreChat
LibreChat provides two main ways to use OCR functionality:
1. Upload as Text in Chat
In any chat conversation, you can use OCR to extract text from images or documents:
- Click the attachment icon in the chat input
- Select “Upload as Text” from the menu
- Choose an image or document file
- The OCR system will process the file and insert the extracted text into your message
2. File Context for Agents
When working with agents, you can add documents as context using OCR:
- Open the Agent Builder panel or edit an existing agent
- In the File Context section, click “Upload File Context”
- Select a document or image file
- The OCR system will extract text from the file and add it to the agent’s instructions
Files uploaded as “Context” are processed using OCR to extract text, which is then added to the Agent’s instructions. This is ideal for documents, images with text, or PDFs where you need the full text content of a file to be available to the agent.
Note, the OCR is performed at the time of upload and is not stored as a separate file, rather purely as text in the database.
Example Use Cases
- Document Analysis: Extract and analyze text from scanned documents, PDFs, or images
- Data Extraction: Pull specific information from forms, receipts, or invoices
- Research Assistance: Process academic papers, articles, or books
- Language Translation: Extract text from foreign language documents for translation
- Content Digitization: Convert printed materials into digital, searchable text
Limitations
- OCR accuracy may vary depending on image quality, document complexity, and text clarity
- Some specialized formatting or unusual layouts might not be perfectly preserved
- Very large documents may be truncated due to token limitations of the underlying AI models
Future Enhancements
LibreChat plans to expand OCR capabilities in future releases:
- Support for custom OCR providers
- A
user_provided
strategy option that will allow users to choose their preferred OCR service - Integration with open-source OCR solutions
- Enhanced document processing options
- More granular control over OCR settings
- Mistral plans to make their OCR API available through their cloud partners, such as GCP and AWS, and enterprise self-hosting for organizations with stringent data privacy requirements (source)
- LibreChat currently does not include the parsed image content from the OCR process in its responses, even though services like Mistral’s OCR API may provide these in the result. This feature may be supported in future updates.
For more information on configuring OCR, see the OCR Config Object Structure.