Refacto/file #2544

AmineDiro · 2024-05-04T21:17:15Z

Description

Hey,

Here's a breakdown of what I've done:

Reducing the number of opened fd and memory footprint: Previously, for each uploaded file, we were opening a temporary NamedTemporaryFile to write existing content read from Supabase. However, due to the dependency on langchain loader classes, we couldn't use memory buffers for the loaders. Now, with the changes made, we only open a single temporary file for each process_file_and_notify, cutting down on excessive file opening, read syscalls, and memory buffer usage. This could cause stability issues when ingesting and processing large volumes of documents. Unfortunately, there is still reopening of temporary files in some code paths but this can be improved further in later work.
Removing UploadFile class from File: The UploadFile ( a FastAPI abstraction over a SpooledTemporaryFile for multipart upload) was redundant in our File setup since we already downloaded the file from remote storage and read it into memory + wrote the file into a temp file. By removing this abstraction, we streamline our code and eliminate unnecessary complexity.
async function Adjustments: I've removed the async labeling from functions where it wasn't truly asynchronous. For instance, calling filter_file for processing files isn't genuinely async, ass async file reading isn't actually asynchronous—it uses a threadpool for reading the file . Given that we're already leveraging celery for parallelism (one worker per core), we need to ensure that reading and processing occur in the same thread, or at least minimize thread spawning. Additionally, since the rest of the code isn't inherently asynchronous, our bottleneck lies in CPU operations rather than asynchronous processing.

These changes aim to improve performance and streamline our codebase.
Let me know if you have any questions or suggestions for further improvements!

Checklist before requesting a review

My code follows the style guidelines of this project
I have performed a self-review of my code
I have ideally added tests that prove my fix is effective or that my feature works

vercel · 2024-05-04T21:17:19Z

Someone is attempting to deploy a commit to the Quivr-app Team on Vercel.

A member of the Team first needs to authorize it.

StanGirard · 2024-05-04T22:06:34Z

Thanks a lot ! I'll review it and let you know if there is anything

StanGirard · 2024-05-04T22:22:56Z

Thanks a lot! It works great except for when you upload URLs ;) I'll fix that

aminediro added 4 commits May 4, 2024 20:54

refacto File class

c8c29db

Merge branch 'main' into refacto/tempfile

2f19912

refacto parsers

732c0fa

no async process_audio

d1bbb25

dosubot bot added the size:L This PR changes 100-499 lines, ignoring generated files. label May 4, 2024

dosubot bot added the area: backend Related to backend functionality or under the /backend directory label May 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refacto/file #2544

Refacto/file #2544

AmineDiro commented May 4, 2024

vercel bot commented May 4, 2024

StanGirard commented May 4, 2024

StanGirard commented May 4, 2024

Refacto/file #2544

Are you sure you want to change the base?

Refacto/file #2544

Conversation

AmineDiro commented May 4, 2024

Description

Checklist before requesting a review

vercel bot commented May 4, 2024

StanGirard commented May 4, 2024

StanGirard commented May 4, 2024