open webui知识库上传文件过大导致向量化失败

preacher · 2025 年2 月 5 日 10:13

小文件200KB完全没问题，但是上传8MB"《全唐诗》.txt "直接失败了。

VrianCao · 2025 年2 月 5 日 11:35

8M 的 txt 文本量可不是开玩笑的
这种还是本地 RAG 比较好一点

CodeAtlantis · 2025 年2 月 5 日 11:46

我在cherry studio里面成功量化了约50M的生物pdf教材，但是效果非常不好，也可能是和里面非文字内容过多有关

preacher · 2025 年2 月 5 日 13:03

全唐诗，收录四万二千八百六十三首。
不过我4M的txt文件也上传失败了。

preacher · 2025 年2 月 5 日 13:04

RAG效果不好加个重排模型，之后好很多。

Sammy · 2025 年2 月 5 日 13:29

确实，我记得我之前还可以上传10MB的，现在上传不了了，不知道是不是模型问题

preacher · 2025 年2 月 5 日 13:32

上传成功不代表向量化成功我这边看8MB数据量太恐怖了，win系统打开都会顿一下，向量化失败感觉很正常了。

Sammy · 2025 年2 月 5 日 13:52

可能是用的免费模型的原因？佬用的是免费的模型吗？我之前看日志上传10MB的现实too many request

preacher · 2025 年2 月 5 日 13:54

我笔记本上的ollama下载的bge-m3模型。（内网穿透到服务器）

Harp · 2025 年2 月 5 日 14:54

我用的openai的text-embedding-3-small，上传3M多的txt文件也失败了，去issues里面搜到有人反馈文件不能超过1.5M，就让AI写了一个py脚本把文件切开，挨个传然后引用合集，基本能满足需求。
脚本如下：

import os

def split_file(input_file, output_dir, max_size_bytes):
    """Splits a file into smaller files based on size."""
    if not os.path.exists(input_file):
        print(f"Error: Input file '{input_file}' not found.")
        return

    if not os.path.exists(output_dir):
        try:
            os.makedirs(output_dir)
        except OSError as e:
            print(f"Error: Could not create output directory '{output_dir}'. {e}")
            return

    file_size = os.path.getsize(input_file)
    if file_size <= max_size_bytes:
        print("File is already smaller than the max size. No splitting needed.")
        return

    file_count = 1
    base_name = os.path.splitext(os.path.basename(input_file))[0]
    with open(input_file, 'r', encoding='utf-8') as infile:
        while True:
            output_file_name = os.path.join(output_dir, f"{base_name}_part_{file_count}.txt")
            with open(output_file_name, 'w', encoding='utf-8') as outfile:
                current_size = 0
                while current_size < max_size_bytes:
                    line = infile.readline()
                    if not line:
                        break
                    outfile.write(line)
                    current_size += len(line.encode('utf-8'))  # Approximate size
            if not line:
                break
            file_count += 1
    print(f"File '{input_file}' has been split into {file_count-1} smaller files in '{output_dir}'.")


if __name__ == "__main__":
    # 1. Get input file path
    while True:
        input_file = input("Enter the path to the input file: ").strip()
        if os.path.isfile(input_file):
            break
        else:
            print("Invalid input file path. Please enter a valid file path.")

    # 2. Get output directory (default to same directory as input file)
    default_output_dir = os.path.dirname(input_file)
    output_dir = input(f"Enter the output directory (default: '{default_output_dir}'): ").strip()
    if not output_dir:
        output_dir = default_output_dir

    # 3. Get max file size
    while True:
        try:
            max_size_str = input("Enter the maximum file size in bytes (default: 1500000 for 1.5MB): ").strip()
            if not max_size_str:
                max_size_bytes = 1500000
                break
            max_size_bytes = int(max_size_str)
            if max_size_bytes > 0:
                break
            else:
                print("Max size must be a positive integer.")
        except ValueError:
            print("Invalid input. Please enter an integer.")

    split_file(input_file, output_dir, max_size_bytes)

JoeCHEN99 · 2025 年2 月 5 日 14:57

能否设置成 openwebui 的函数的格式，然后全局激活这个函数

naokuoyoudiantong · 2025 年2 月 5 日 15:02

本地cherry studio可以上重排模型吗

CitizenScyu · 2025 年2 月 5 日 15:13

是这样吗？怪不得我上传一本书500m的pdf各种方法都失败

Harp · 2025 年2 月 5 日 15:14

我不会写啊，这个脚本全是AI写的而且我看docs里面三个类型的函数，pipe管模型代理，filter管文本处理，action给聊天界面加按钮，没有一个涉及文件上传的，感觉只能等open webui自己修……这个问题不止一个人提了，很快就被关闭issue然后转移到discussion那，不懂

JoeCHEN99 · 2025 年2 月 5 日 15:15

我现在涉及大一点的文件，都直接用lobechat了。我两个都部署了

Harp · 2025 年2 月 5 日 15:20

我基本都换成cherry studio本地向量化了，除非觉得要多端同步才传open webui，本地舒服多了，lobechat用过感觉比open webui还重

JoeCHEN99 · 2025 年2 月 5 日 15:21

我主要是设备多，多端同步是刚需。目前自己部署lobechat，感觉还不错。他们开发者最近在优化加载速度了

kkkpppp · 2025 年2 月 8 日 03:51

我也遇见这个问题了，1M多的都上传不上去。
明明设置了不限制，哪怕限制一下大小，也远达不到那个大小。
只能传几百K的

话题		回复	浏览量
openwebui知识库向量化和重排模型选择分享开发调优人工智能 , 纯水	12	417	2025 年2 月 5 日
佬友求助OpenWebUI聊天上传文件无法读取问题开发调优 ChatGPT , 快问快答	43	854	2025 年2 月 13 日
Open WebUI中的语义向量模型引擎到底该怎么设置？求大佬教学搞七捻三人工智能 , 快问快答	119	2098	2025 年2 月 1 日
不是，openwebui这个内存占比是不是有点离谱了开发调优纯水	22	369	2025 年1 月 17 日
佬友们，抱脸部署的OpenWebUI奇慢无比开发调优人工智能 , OpenWebUI , 快问快答	11	344	2025 年1 月 30 日

open webui知识库上传文件过大导致向量化失败

相关话题