【OpenWebUI增强】改进,为不支持图像识别的模型添加图像识别

由于有些ai模型可能是第三方逆向或者本身就不支持图像识别,有时候用着又想传入一些图片,逛论坛发现已经有在OpenWeb UI上xiniah佬的实现:

在Openwebui中对传入的图片进行文字识别后再交给模型处理

但是有一些小小的缺陷:

  1. 只能在首轮对话时传入一张图片
  2. 只有在首轮对话时才会进行文字识别,在之后的对话中,会将传入的图片剔除

但是传入多图或者基于图片进行多轮对话都是偶尔可能会用到的。心想能否基于此代码做出一些改进,于是实现了以下功能:

  1. 支持在任意轮对话中传入图片
  2. 可以传入多张图片
  3. 后续对话可以基于已经传入过的图片继续进行问答
  4. 如果对当前轮图像识别结果不满意可以重新进行识别
  5. 后续多轮对话中不会剔除图片,但是也不会重复提交图片进行请求。

如果你想为不支持图像识别的模型增加图像识别功能、或者当前使用的模型对其图像识别能力不满意但又希望借助其它模型的图像识别能力,那么这个方法适合你。

开始教程

准备工作

**前提:**上述功能的实现需要借助redis缓存,请为你的OpenWebUI环境安装redis缓存,如果你和我一样在huggingface中部署,复制 @Coker 佬的抱脸空间,那么简单只需要修改Dockerfile:

FROM ghcr.io/open-webui/open-webui:main

# 安装Redis
RUN apt-get update && apt-get install -y redis-server

# 修改Redis配置和权限
RUN mkdir -p /var/run/redis && \
   chown -R 1000:1000 /var/run/redis && \
   chown -R 1000:1000 /var/lib/redis && \
   chmod 777 /var/run/redis

# 创建启动Redis的脚本
RUN echo "#!/bin/bash" > redis-start.sh && \
   echo "redis-server --daemonize yes --save '' --appendonly no" >> redis-start.sh && \
   echo "sleep 2" >> redis-start.sh && \
   echo "echo 'Redis status:'" >> redis-start.sh && \
   echo "redis-cli ping" >> redis-start.sh

COPY sync_data.sh sync_data.sh

RUN chmod -R 777 ./data && \
   sed -i "1r sync_data.sh" ./start.sh && \
   sed -i "1r redis-start.sh" ./start.sh

然后等待重新构建并restart,没有问题的话你会在运行log开头中看到:

Redis status:
PONG

这样就成功安装并运行了。

添加OpenWebUI函数

  • 前往管理员设置->函数->复制添加下面的OpenWebUI函数

    """
    title: Image Recognition Filter for OpenWebUI
    author: D3bu9r (based on @xiniah's work)
    version: 1.0
    license: MIT
    requirements: 
        - pydantic>=2.0.0
        - aiohttp>=3.0.0
    environment_variables:
        - OCR_API_KEY (required)
    """
    
    from typing import (
        Callable,
        Awaitable,
        Any,
        Optional,
        Dict,
        Tuple,
        List,
    )
    import aiohttp
    from pydantic import BaseModel, Field
    import base64
    from enum import Enum
    from datetime import datetime
    import redis
    import hashlib
    import json
    
    
    class APIType(Enum):
        OPENAI = "openai"
        GOOGLE = "google"
    
    
    class Filter:
        API_VERSION = "2024-03"
        REQUEST_TIMEOUT = (3.05, 60)
        SUPPORTED_IMAGE_TYPES = ["image/jpeg", "image/png", "image/gif", "image/webp"]
        MAX_IMAGE_SIZE = 5 * 1024 * 1024
        MAX_RETRIES = 3
        RETRY_DELAY = 1.0
    
        class Valves(BaseModel):
            API_TYPE: str = Field(
                default="openai",
                description="API type for image recognition (openai or google).",
            )
    
            BASE_URL: str = Field(
                default="https://api.openai.com/v1",
                description="Base URL for the API endpoint.",
            )
            OCR_API_KEY: str = Field(default="", description="API key for the API.")
            CACHE_EXPIRE_HOURS: int = Field(
                default=24,
                description="Cache expiration time in hours.",
            )
            ocr_prompt: str = Field(
                default="""1. If text or code is present in the image:
       - Start with "[OCR_IMAGE_TO_TEXT]"
       - Extract all text maintaining exact formatting and hierarchy
       - Use markdown syntax for formatting (e.g., headers, lists, tables, code blocks, link, Quotes, Sections, etc...)
       - Mathematical formulas use LaTeX:
         - Use `$` symbols to wrap formulas
         - Inline formulas: $formula$ 
         - Display formulas: $$formula$$ 
         - Leave spaces before and after the `$` wrapped formulas
         - Ensure LaTeX syntax is correct, formulas match the content in original images, and can be rendered properly
       - Use **bold**, *italics* etc. to emphasize content
       - Preserve line breaks and text layout where meaningful
       - Include text location indicators if relevant (e.g., "header:", "footer:", "Sidebar", etc...)
       - For plain text images, there is no need to interpret or comment on the text content, unless there are specific circumstances that require explanation.
    2. If no text, code, or other additional context:
       - Start with "[IMAGE_TO_ANALYSIS]"
       - Provide a concise description of key visual elements
       - Note any relevant structural or layout information
       - Mention any important visual hierarchy or emphasis
       - Use bullet points to note color and info for distinct elements
       - Report image quality issues
    3. For mixed content (both text and significant visual elements):
       - Start with "[MIX_IMAGE_ANALYSIS]"
       - Include both sections with their respective headers
       - Maintain clear separation between text content and visual description
    
    Please ensure the output is well-structured, easy to read, and maintains professional formatting.""",
                description="进行图像识别的提示词",
            )
            model_name: str = Field(
                default="gpt-4o-mini",
                description="Model name used for OCR on images.",
            )
    
        def __init__(self):
            self.valves = self.Valves()
            self.request_id = None
            self.emitter = None
            # 初始化Redis连接
            self.redis_client = redis.Redis(host="localhost", port=6379, db=0)
            # 将缓存时间从小时转换为秒
            self.CACHE_EXPIRE_TIME = self.valves.CACHE_EXPIRE_HOURS * 60 * 60
    
        async def emit_status(
            self, message: str = "", done: bool = False, error: bool = False
        ):
            if self.emitter:
                await self.emitter(
                    {
                        "type": "status",
                        "data": {
                            "description": message,
                            "done": done,
                            "error": error,
                        },
                    }
                )
    
        def _get_image_hash(self, image_url: str) -> str:
            """生成图片内容的hash值"""
            return hashlib.md5(image_url.encode()).hexdigest()
    
        def _del_image_cache(self, message):
            """删除对应消息图片中的缓存"""
            if isinstance(message["content"], list):
                for content in message["content"]:
                    if content["type"] == "image_url":
                        image_url = content["image_url"]["url"]
                        image_hash = self._get_image_hash(image_url)
                        cached_result = self.redis_client.get(image_hash)
                        if cached_result:
                            self.redis_client.delete(image_hash)
    
        def _find_process_image_messages(
            self, messages: List[Dict]
        ) -> Tuple[bool, List[Dict]]:
            """
            查找所有包含图片的消息
            返回: (是否处理过缓存图片, 需要OCR处理的消息列表)
            """
            i = 0
            ocr_messages = []
            is_process = False
            # 检查每张图片是否有缓存,如果有缓存则进行内容替换,没有则将图片放入一个新的message交给模型处理
            for message in messages:
                if message["role"] == "user" and isinstance(message.get("content"), list):
                    has_uncached_image = False
                    # 先检查这条消息是否需要OCR处理
                    for content in message["content"]:
                        if content["type"] == "image_url":
                            url = content["image_url"]["url"]
                            image_hash = self._get_image_hash(url)
                            cached_result = self.redis_client.get(image_hash)
    
                            if cached_result:
                                content["type"] = "text"
                                content["text"] = f"[{++i}]{cached_result.decode()}"
                                is_process = True
                            else:
                                has_uncached_image = True
    
                    # 只有当消息中有未缓存的图片时,才添加到ocr_messages
                    if has_uncached_image:
                        ocr_messages.append(message)
    
            return is_process, ocr_messages
    
        def _prepare_request(self, message) -> tuple[dict, dict, dict]:
            # 修改提示词,要求返回结构化内容
            structured_prompt = (
                """
    Please analyze each image and provide responses in this exact format:
    
    ---START_IMAGE_{n}---
    {content}
    ---END_IMAGE_{n}---
    
    Where:
    - {n} is the sequential number of the image (1, 2, etc.)
    - {content} follows the original analysis guidelines
    
    Original guidelines:
    """
                + self.valves.ocr_prompt
            )
            api_key = self.valves.OCR_API_KEY.strip()
            params = None
            if self.valves.API_TYPE == APIType.OPENAI.value:
                headers = {
                    "Content-Type": "application/json",
                    "Authorization": f"Bearer {api_key}",
                }
            elif self.valves.API_TYPE == APIType.GOOGLE.value:
                headers = {
                    "Content-Type": "application/json",
                    "Accept": "*/*",
                    "Host": "generativelanguage.googleapis.com",
                }
                params = {"key": api_key}
    
            # 修复消息处理
            if isinstance(message, dict) and "content" in message:
                message_content = message["content"]
            else:
                message_content = message
    
            # 只保留图片类型的内容
            newmessage = [
                item
                for item in message_content
                if isinstance(item, dict) and item.get("type") == "image_url"
            ]
    
            # OpenAI请求格式
            if self.valves.API_TYPE == APIType.OPENAI.value:
                body = {
                    "model": self.valves.model_name,
                    "messages": [
                        {
                            "role": "system",
                            "content": [{"type": "text", "text": structured_prompt}],
                        },
                        {"role": "user", "content": newmessage},
                    ],
                    "temperature": 0.0,
                }
                return headers, body, None
            elif self.valves.API_TYPE == APIType.GOOGLE.value:
                body = {
                    "generationConfig": {
                        "temperature": 0.0,
                        "topP": 0.9,
                        "topK": 50,
                        "maxOutputTokens": 8192,
                        "stopSequences": [],
                    },
                    "system_instruction": {
                        "parts": [{"text": structured_prompt}],
                    },
                    "safetySettings": [
                        {"category": "HARM_CATEGORY_HATE_SPEECH", "threshold": "OFF"},
                        {"category": "HARM_CATEGORY_SEXUALLY_EXPLICIT", "threshold": "OFF"},
                        {"category": "HARM_CATEGORY_HARASSMENT", "threshold": "OFF"},
                        {"category": "HARM_CATEGORY_DANGEROUS_CONTENT", "threshold": "OFF"},
                    ],
                    "contents": [
                        {
                            "role": "user",
                            "parts": [
                                {"text": ""},
                                *[
                                    {
                                        "inlineData": {
                                            "data": item["image_url"]["url"].split(",")[
                                                1
                                            ],  # 提取base64数据部分
                                            "mimeType": item["image_url"]["url"]
                                            .split(";")[0]
                                            .split(":")[1],  # 提取mime类型
                                        }
                                    }
                                    for item in newmessage
                                ],
                            ],
                        }
                    ],
                }
                return headers, body, params
    
        async def _perform_ocr(self, message) -> str:
            """Internal method for performing OCR recognition."""
    
            headers, body, params = self._prepare_request(message)
    
            if self.valves.API_TYPE == APIType.OPENAI.value:
                url = f"{self.valves.BASE_URL}/chat/completions"
            elif self.valves.API_TYPE == APIType.GOOGLE.value:
                url = f"{self.valves.BASE_URL}/models/{self.valves.model_name}:generateContent"
    
            async with aiohttp.ClientSession() as session:
                try:
                    async with session.post(
                        url, json=body, headers=headers, params=params, timeout=60
                    ) as response:
                        response_data = await response.json()
                        response.raise_for_status()
                        # 获取原始结果
                        if self.valves.API_TYPE == APIType.OPENAI.value:
                            result = response_data["choices"][0]["message"]["content"]
                        elif self.valves.API_TYPE == APIType.GOOGLE.value:
                            result = response_data["candidates"][0]["content"]["parts"][0][
                                "text"
                            ]
                        else:
                            raise ValueError(f"No result found for {self.valves.API_TYPE}")
    
                    return result
    
                except (aiohttp.ClientResponseError, aiohttp.ClientError) as e:
                    if isinstance(e, aiohttp.ClientResponseError):
                        error_message = f"图片识别请求上游失败,状态码: {e.status}, 错误消息: {e.message}"
                    else:
                        error_message = f"图片识别请求失败,错误消息: {e}"
    
                    await self.emit_status(message=error_message, done=True, error=True)
                    return ""
                except json.JSONDecodeError:
                    await self.emit_status(
                        message="API返回了无效的JSON响应", done=True, error=True
                    )
                    return ""
                except Exception as e:
                    await self.emit_status(message=f"未知错误: {e}", done=True, error=True)
                    return ""
    
        def _convert_message_format(self, messages: list) -> list:
            """Convert complex message format to simple format."""
            converted_messages = []
            for message in messages:
                if isinstance(message.get("content"), list):
                    # Combine all text content
                    combined_text = []
                    for content in message["content"]:
                        if content["type"] == "text":
                            combined_text.append(content["text"])
                        # If we have already converted image to text
                        elif "text" in content:
                            combined_text.append(content["text"])
    
                    # Join all text with spaces
                    converted_messages.append(
                        {
                            "role": message["role"],
                            "content": " ".join(combined_text).strip(),
                        }
                    )
                else:
                    # If it's already in simple format, keep it as is
                    converted_messages.append(message)
    
            return converted_messages
    
        def _parse_ocr_result(self, result: str) -> List[Tuple[int, str]]:
            """解析OCR结果为(图片索引, 内容)的列表"""
            parsed_results = []
            import re
    
            # 使用正则匹配每个图片块
            # pattern = r"---START_IMAGE_(\d+)---\n(.*?)\n---END_IMAGE_\1---"
            pattern = r"---START_IMAGE_\{?(\d+)\}?---\n(.*?)\n---END_IMAGE_\{?\1\}?---"
            matches = re.finditer(pattern, result, re.DOTALL)
    
            for match in matches:
                index = int(match.group(1))
                content = match.group(2).strip()
                parsed_results.append((index, content))
    
            return parsed_results
    
        def _check_image_size(self, image_url: str) -> bool:
            if image_url.startswith("data:image"):
                # 解析base64
                try:
                    header, data = image_url.split(",", 1)
                    binary_data = base64.b64decode(data)
                    return len(binary_data) <= self.MAX_IMAGE_SIZE
                except:
                    return False
            return True  # URL形式的图片暂时不检查大小
    
        async def inlet(
            self,
            body: dict,
            __event_emitter__: Callable[[Any], Awaitable[None]],
            __user__: Optional[dict] = None,
            __model__: Optional[dict] = None,
        ) -> dict:
            self.emitter = __event_emitter__
            # 检查Key是否存在
            if not self.valves.OCR_API_KEY or not self.valves.OCR_API_KEY.strip():
                await self.emit_status(
                    message="❌ 错误:OCR_API_KEY未设置", done=True, error=True
                )
                return body
    
            # 检查API类型
            if self.valves.API_TYPE not in [APIType.OPENAI.value, APIType.GOOGLE.value]:
                await self.emit_status(
                    message="❌ 错误:API_TYPE未设置或不正确(openai/google)",
                    done=True,
                    error=True,
                )
                return body
    
            # 检查reidis是否连接成功
            if not self.redis_client.ping():
                await self.emit_status(
                    message="❌ 错误:Redis连接失败,请检查redis是否正常",
                    done=True,
                    error=True,
                )
                return body
    
            messages = body.get("messages", [])
            # 检查是否需要重新生成图像识别内容
            is_roll = body.get("roll", False)
            description = ""
            if is_roll:
                self._del_image_cache(messages[-1])
                del body["roll"]
                description = "🧹 清除缓存,重新进行图像分析,请耐心等待..."
            else:
                description = "🔍 开始进行图像分析,请耐心等待..."
    
            # 查找所有需要进行OCR的messages
            instructions = "[INSTRUCTION]\nThe user has sent images, which have been converted to text information through AI. Labels [1], [2]... indicate the sequence number of the images."
            is_process, ocr_messages = self._find_process_image_messages(messages)
            if not ocr_messages:
                if is_process:
                    body["messages"] = self._convert_message_format(messages)
                    # 对ai加一条说明
                    body["messages"].insert(0, {"role": "system", "content": instructions})
                return body
    
            start_time = datetime.now()
            await self.emit_status(message=description, done=False)
    
            # 将需要进行ocr的图像交给模型进行识别并处理
            for message in ocr_messages:
                # for content in message["content"]:
                #     if content["type"] == "image_url":
                #         if not self._check_image_size(content["image_url"]["url"]):
                #             await self.emit_status(message="❌ 错误:图片过大", done=True, error=True)
                #             return body
                result = await self._perform_ocr(message)
                # 解析结果并分别缓存
                if not result:
                    return body
    
                parsed_results = self._parse_ocr_result(result)
    
                # 获取message中的图片URL列表并更新内容
                i = 0
                for content in message["content"]:
                    if content["type"] == "image_url":
                        url = content["image_url"]["url"]
                        image_hash = self._get_image_hash(url)
                        # 缓存结果
                        self.redis_client.setex(
                            image_hash,
                            self.CACHE_EXPIRE_TIME,
                            parsed_results[i][1],  # content
                        )
                        # 更新message内容
                        content["type"] = "text"
                        content["text"] = f"[{i+1}]{parsed_results[i][1]}"
                        i += 1
            # 计算处理时间
            end_time = datetime.now()
            process_time = (end_time - start_time).total_seconds()
            # 发送状态更新
            await self.emit_status(
                message=f"🎉识别成功,耗时:{process_time}秒,交由模型进行处理...", done=True
            )
    
            # 转换消息格式
            body["messages"] = self._convert_message_format(messages)
            # 对ai加一条说明
            body["messages"].insert(0, {"role": "system", "content": instructions})
            return body
    
        async def outlet(
            self,
            body: dict,
            __event_emitter__: Callable[[Any], Awaitable[None]],
            __user__: Optional[dict] = None,
            __model__: Optional[dict] = None,
        ) -> dict:
            return body
    
    
  • 导入函数后点击旁边的齿轮按钮,为其设置参数

注意Ocr Prompt最好不要修改!图像识别后会生成固定格式的内容,后续会按照格式进行处理。如需要修改,请按照原格式进行。

  • 设置完后记得开启这个函数的开关

  • 在工作空间中的模型或者设置中的模型为需要的模型添加过滤器

添加油猴脚本支持重新识别

  • 导入下面的油猴脚本,这里不再过多赘述

    // ==UserScript==
    // @name         clear_image_cache
    // @namespace    http://tampermonkey.net/
    // @version      0.2
    // @description  为重新生成按钮添加功能并修改请求
    // @match        https://yoururl.xxx/*
    // @grant        none
    // ==/UserScript==
    
    (function() {
        'use strict';
    
        // 配置项
        const CONFIG = {
            buttonClass: 'regenerate-response-button',
            apiEndpoint: '/api/chat/completions',
            debounceTime: 300,
        };
    
        // 状态管理
        const state = {
            buttonPressed: false,
            isProcessing: false
        };
    
        // 工具函数:防抖
        function debounce(func, wait) {
            let timeout;
            return function executedFunction(...args) {
                const later = () => {
                    clearTimeout(timeout);
                    func(...args);
                };
                clearTimeout(timeout);
                timeout = setTimeout(later, wait);
            };
        }
    
        // 工具函数:安全的JSON解析
        function safeJSONParse(str) {
            try {
                return JSON.parse(str);
            } catch (e) {
                console.error('JSON解析失败:', e);
                return null;
            }
        }
    
        // 检查消息是否包含图片
        function hasImageInLastMessage(messages) {
            if (!Array.isArray(messages) || messages.length === 0) {
                return false;
            }
    
            // 获取最后一条消息
            const lastMessage = messages[messages.length - 1];
    
            // 检查消息格式和内容
            if (!lastMessage || !lastMessage.content) {
                return false;
            }
    
            // 处理两种不同的消息格式
            if (Array.isArray(lastMessage.content)) {
                // 格式1: content是数组
                return lastMessage.content.some(item =>
                    item && item.type === 'image_url'
                );
            } else {
                // 格式2: content是字符串
                return false;
            }
        }
    
        // 修改请求体
        function modifyRequestBody(body) {
            if (!body || typeof body !== 'object') {
                return null;
            }
    
            // 检查messages字段
            if (!Array.isArray(body.messages)) {
                console.log('请求体中没有有效的messages数组');
                return body;
            }
    
            // 检查最后一条消息是否包含图片
            const shouldAddRoll = hasImageInLastMessage(body.messages);
            console.log('是否包含图片:', shouldAddRoll);
    
            // 只有在包含图片时才添加roll字段
            if (shouldAddRoll) {
                return {
                    ...body,
                    roll: true
                };
            }
    
            return body;
        }
    
        // 处理按钮点击
        const handleButtonClick = debounce((e) => {
            const button = e.target.closest(`button.${CONFIG.buttonClass}`);
            if (!button || state.isProcessing) return;
    
            console.log('按钮被点击');
            state.buttonPressed = true;
            console.log('重新生成按钮被点击,标志设置为:', state.buttonPressed);
        }, CONFIG.debounceTime);
    
        // 拦截并处理fetch请求
        function setupFetchIntercept() {
            const originalFetch = window.fetch;
            window.fetch = async function(url, options) {
                if (!url.includes(CONFIG.apiEndpoint) || !state.buttonPressed) {
                    return originalFetch(url, options);
                }
    
                try {
                    state.isProcessing = true;
                    console.log('拦截到目标请求');
    
                    // 克隆并修改请求选项
                    const newOptions = {...options};
                    const originalBody = safeJSONParse(newOptions.body);
    
                    if (!originalBody) {
                        throw new Error('无效的请求体格式');
                    }
    
                    const modifiedBody = modifyRequestBody(originalBody);
                    newOptions.body = JSON.stringify(modifiedBody);
    
                    // console.log('修改后的请求体:', newOptions.body);
    
                    // 重置状态
                    state.buttonPressed = false;
                    console.log('请求已修改,标志重置为:', state.buttonPressed);
    
                    return originalFetch(url, newOptions);
                } catch (error) {
                    console.error('处理请求时出错:', error);
                    return originalFetch(url, options);
                } finally {
                    state.isProcessing = false;
                }
            };
        }
    
        // 初始化函数
        function initialize() {
            // 添加事件监听器
            document.addEventListener('click', handleButtonClick, true);
    
            // 设置fetch拦截器
            setupFetchIntercept();
    
            console.log('脚本已初始化完成');
        }
    
        // 启动脚本
        initialize();
    })();
    
  • 请将脚本中@match 的url改为你的OpenWebUI Url

  • 如果不安装这个油猴脚本,那么重新生成将不会对图片识别内容进行重新请求

至此,可以愉快使用了。本来想写一个OpenWebUI action函数来支持点击按钮清理该轮对话中图片的redis缓存信息的,但是折腾半天发现action函数收到的message请求体中不包含image! 无奈只好借助使用油猴实现,当点击重新生成按钮时往请求体中加一个键值,fliter函数中再进行处理。

效果




更新

日期 内容
2025-1-15 1. 添加google gemini官方接口支持,推荐使用gemini模型,速度嘎嘎快
2. 优化部分代码
3. 优化ocr prompt,特别是针对数学公式等,识别效果更好。
2025-2-7 修复可能存在的潜在问题
48 Likes

牛啊大佬,马上尝试

4 Likes

感谢教程,有空试试。

4 Likes

大佬tql

4 Likes

感谢大佬教程!

4 Likes

牛啊大佬

4 Likes

搭配deepseek试试?

4 Likes

佬,我用了GitHub的gpt-4o-mini
显示识图失败,这是咋回事啊?




5 Likes

图像识别的模型时openai格式吗?
记得接口url后面/v1

3 Likes


这是我测试的,不用v1也可以,
就是说必须要openai格式是吗,我也不知道是不是

1 Like

神了,刚有需求就逛到解决办法了

1 Like

改了一下 重新复制替换函数应该可以了

1 Like

牛啊 :tieba_013:

2 Likes

多亏大佬提供的代码基础上进行修改呀 :tieba_024:

1 Like

BOSS能不能修改一下函数,因为GOOGLE应该还支持WORD等文档读取的吧?

1 Like

openwebui不是本身有rag吗,文档这类用rag不就行了。

1 Like

更新了一下修复潜在的问题。 重新复制更新后的函数即可

1 Like

Gemini是识别效果比openai好吗请问大佬

1 Like

佬,请问如果上传的pdf里面全是图片,这种该怎么处理呢 :bili_007:

1 Like

太强了大佬

1 Like