File Context via Optical Character Recognition (OCR)

LibreChat’s OCR (Optical Character Recognition) feature enables AI agents to extract and process text from images and documents. This capability enhances the AI’s ability to work with visual content, making it possible to analyze, understand, and respond to information contained in images.

Overview

OCR functionality in LibreChat allows agents to:

Extract text from images and documents
Maintain document structure and formatting
Process complex layouts including multi-column text
Handle tables, equations, and other specialized content
Work with multilingual content

Availability

Currently, OCR is only available as an agent capability. This means you must use an agent via the Agents endpoint to leverage OCR functionality.

OCR Strategies

LibreChat supports multiple OCR strategies to meet different deployment needs and requirements. Choose the strategy that best fits your infrastructure and compliance requirements.

1. Mistral OCR (Default)

The default strategy uses Mistral’s cloud API service directly. This is the simplest setup and requires only an API key from Mistral.

Environment Variables:

# `.env`
OCR_API_KEY=your-mistral-api-key
# OCR_BASEURL=https://api.mistral.ai/v1 # this is the default value

Configuration:

# `librechat.yaml`
ocr:
  mistralModel: "mistral-ocr-latest"       # Optional: Specify Mistral model, defaults to "mistral-ocr-latest"
  apiKey: "your-mistral-api-key"           # Optional: Defaults to OCR_API_KEY env variable
  baseURL: "https://api.mistral.ai/v1"     # Optional: Defaults to OCR_BASEURL env variable, or Mistral's API if no variable set
  strategy: "mistral_ocr"                  # Optional: Defaults to "mistral_ocr"

Key Features:

Document Structure Preservation: Maintains formatting like headers, paragraphs, lists, and tables
Multilingual Support: Processes text in multiple languages and scripts
Complex Layout Handling: Handles multi-column text and mixed content
Mathematical Expression Recognition: Accurately processes equations and formulas
High-Speed Processing: Processes up to 2000 pages per minute

Considerations:

Cost: Using Mistral OCR may incur costs as it’s a paid API service (though free trials may be available)
Data Privacy: Data processed through Mistral OCR is subject to Mistral’s cloud environment and their terms of service
Document Limitations:
- Maximum file size: 50 MB
- Maximum document length: 1,000 pages

2. Azure Mistral OCR

For organizations using Azure AI Foundry, you can deploy Mistral OCR models to your Azure infrastructure. Currently, the Mistral OCR 2503 model is available for Azure deployment.

Configuration:

# `librechat.yaml`
ocr:
  mistralModel: "deployed-mistral-ocr-2503"              # Should match your Azure deployment name
  apiKey: "${AZURE_MISTRAL_OCR_API_KEY}"                 # Reference to your Azure API key in .env
  baseURL: "https://your-deployed-endpoint.models.ai.azure.com/v1"  # Your Azure endpoint
  strategy: "azure_mistral_ocr"                          # Use Azure strategy

Azure Model Information: You can explore the latest Mistral OCR model available on Azure AI Foundry here (requires Azure subscription):

https://ai.azure.com/explore/models/mistral-ocr-2503

3. Google Vertex AI Mistral OCR

For organizations using Google Cloud Platform, you can deploy Mistral OCR models to your Google Cloud Vertex AI infrastructure.

Environment Variables:

# `.env`
# Option 1: File path
GOOGLE_SERVICE_KEY_FILE=/path/to/your/service-account-key.json
 
# Option 2: URL to fetch the key
GOOGLE_SERVICE_KEY_FILE=https://your-secure-server.com/service-account-key.json
 
# Option 3: Base64 encoded JSON
GOOGLE_SERVICE_KEY_FILE=eyJ0eXBlIjogInNlcnZpY2VfYWNjb3VudCIsICJwcm9qZWN0X2lkIjogInlvdXItcHJvamVjdC1pZCIsIC4uLn0=
 
# Option 4: Raw JSON string
GOOGLE_SERVICE_KEY_FILE='{
  "type": "service_account",
  "project_id": "your-project-id",
  "private_key_id": "...",
  "private_key": "-----BEGIN PRIVATE KEY-----\n...\n-----END PRIVATE KEY-----\n",
  "client_email": "...",
  "client_id": "...",
  "auth_uri": "https://accounts.google.com/o/oauth2/auth",
  "token_uri": "https://oauth2.googleapis.com/token",
  "auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
  "client_x509_cert_url": "..."
}'

Configuration:

# `librechat.yaml`
ocr:
  mistralModel: "mistral-ocr-2505"                        # Model name as deployed in Vertex AI
  strategy: "vertexai_mistral_ocr"                       # Use Google Vertex AI strategy

Setup Requirements:

Deploy a Mistral OCR model to Google Vertex AI (e.g., mistral-ocr-2505)
Create a service account with appropriate permissions to access the Vertex AI endpoint
Download the service account JSON key file
Set the GOOGLE_SERVICE_KEY_FILE environment variable using one of the supported methods

4. Custom OCR (Planned)

Support for custom OCR providers and user-defined strategies is planned for future releases.

Detailed Configuration

For additional, detailed configuration options, see the OCR Config Object Structure.

Using File Context (OCR) in LibreChat

LibreChat provides two main ways to use OCR functionality:

1. Upload as Text in Chat

In any chat conversation, you can use OCR to extract text from images or documents:

Click the attachment icon in the chat input
Select “Upload as Text” from the menu
Choose an image or document file
The OCR system will process the file and insert the extracted text into your message

Upload as Text option in the attachment menu

2. File Context for Agents

When working with agents, you can add documents as context using OCR:

Open the Agent Builder panel or edit an existing agent
In the File Context section, click “Upload File Context”
Select a document or image file
The OCR system will extract text from the file and add it to the agent’s instructions

File Context using OCR for agents

Files uploaded as “Context” are processed using OCR to extract text, which is then added to the Agent’s instructions. This is ideal for documents, images with text, or PDFs where you need the full text content of a file to be available to the agent.

Note, the OCR is performed at the time of upload and is not stored as a separate file, rather purely as text in the database.

Example Use Cases

Document Analysis: Extract and analyze text from scanned documents, PDFs, or images
Data Extraction: Pull specific information from forms, receipts, or invoices
Research Assistance: Process academic papers, articles, or books
Language Translation: Extract text from foreign language documents for translation
Content Digitization: Convert printed materials into digital, searchable text

Limitations

OCR accuracy may vary depending on image quality, document complexity, and text clarity
Some specialized formatting or unusual layouts might not be perfectly preserved
Very large documents may be truncated due to token limitations of the underlying AI models

Future Enhancements

LibreChat plans to expand OCR capabilities in future releases:

Support for custom OCR providers
A user_provided strategy option that will allow users to choose their preferred OCR service
Integration with open-source OCR solutions
Enhanced document processing options
More granular control over OCR settings
Mistral plans to make their OCR API available through their cloud partners, such as GCP and AWS, and enterprise self-hosting for organizations with stringent data privacy requirements (source)
LibreChat currently does not include the parsed image content from the OCR process in its responses, even though services like Mistral’s OCR API may provide these in the result. This feature may be supported in future updates.

For more information on configuring OCR, see the OCR Config Object Structure.

Automated Moderation Password Reset