Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
Hey,
Here's a breakdown of what I've done:
langchain
loader classes, we couldn't use memory buffers for the loaders. Now, with the changes made, we only open a single temporary file for eachprocess_file_and_notify
, cutting down on excessive file opening, read syscalls, and memory buffer usage. This could cause stability issues when ingesting and processing large volumes of documents. Unfortunately, there is still reopening of temporary files in some code paths but this can be improved further in later work.UploadFile
class from File: TheUploadFile
( a FastAPI abstraction over a SpooledTemporaryFile for multipart upload) was redundant in ourFile
setup since we already downloaded the file from remote storage and read it into memory + wrote the file into a temp file. By removing this abstraction, we streamline our code and eliminate unnecessary complexity.async
function Adjustments: I've removed the async labeling from functions where it wasn't truly asynchronous. For instance, callingfilter_file
for processing files isn't genuinely async, ass async file reading isn't actually asynchronous—it uses a threadpool for reading the file . Given that we're already leveragingcelery
for parallelism (one worker per core), we need to ensure that reading and processing occur in the same thread, or at least minimize thread spawning. Additionally, since the rest of the code isn't inherently asynchronous, our bottleneck lies in CPU operations rather than asynchronous processing.These changes aim to improve performance and streamline our codebase.
Let me know if you have any questions or suggestions for further improvements!
Checklist before requesting a review