Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewritten or Restructure #464

Open
2 tasks done
climba03003 opened this issue Jul 25, 2023 · 21 comments
Open
2 tasks done

Rewritten or Restructure #464

climba03003 opened this issue Jul 25, 2023 · 21 comments
Labels
feature request New feature to be added

Comments

@climba03003
Copy link
Member

Prerequisites

  • I have written a descriptive issue title
  • I have searched existing issues to ensure the feature has not already been requested

🚀 Feature Proposal

We should reduce the methods to something that is necessary.
For example, parts, files, file does nearly the same but confusing.

We should either provide parsing method directly or async iterator for advanced user.

The current request.body object is not user friendly, which leads to multiple config (e.g. keyValues)

Motivation

With the growth of feature and never removed deprecated functions.
The current codebase is messy and not user-friendly to beginner.

I found must people is not family with how stream works and ask for simple solution multiple times.
That's why I created fastify-formidable and fastify-busboy.

I believe we can do something to provide a more generic interface and not fixed to a single parsing engine.
So, I created @kakang/fastify-multipart and seek for advice.

Example

No response

@climba03003 climba03003 added the feature request New feature to be added label Jul 25, 2023
@Uzlopak
Copy link
Contributor

Uzlopak commented Jul 25, 2023

I think the biggest issue is handling big files as new devs dont think about memory leaks etc..

@Eomm
Copy link
Member

Eomm commented Jul 25, 2023

I call it "Apple style" - The user is requesting a limited number of options otherwise it can't choose

I agree from the support perspective

We should list the use case we want to support first

@mcollina
Copy link
Member

I kind of agree that a refactoring of this is needed, however I'm not sure exactly in which direction.

Most of the recently added features are not safe to use in generic sense, as keeping files in memory is a bad practice.

@climba03003
Copy link
Member Author

climba03003 commented Jul 26, 2023

We should list the use case we want to support first

The important use-case and most requesting feature is the below two.

  1. Solution for beginner

No extra configuration and provides json body for the users.
AFAICT most of the web server (e.g. caddy, nginx, apache, php-fpm) download the files to the temporary folder directly and allows the proxy server to consume later.

  1. Solution for advanced user

async iterator is one of the option which is the .parts API currently.
Providing .files, .file, .saveRequestFiles is strange for me, especially .file. If the client is sending multiple file but you only called .file the request should be staling.

@kibertoad
Copy link
Member

@climba03003
Copy link
Member Author

Why not use https://www.npmjs.com/package/@fastify/busboy?

Since here we are using @fastify/busboy, I just choose other competitor for the parsing engines.
Not a big deal to add one by default though.

@climba03003
Copy link
Member Author

next branch is created for gathering breaking changes.
I have rebased both #465 and #468

@Eomm
Copy link
Member

Eomm commented Jul 29, 2023

Could we breakdown some issues into a project for this?

@gurgunday
Copy link
Member

gurgunday commented Jul 29, 2023

Before more things are removed/changed, I just want to add that I really do like the DX of attachFieldsToBody: true and attachFieldsToBody: 'keyValues'

The docs warn you about accumulating files in memory but provide multiple ways to deal with that:

  1. Setting a fileSize limit when registering and listening for error.statusCode === 413 in an error handler -> very easy and I frequently find myself using this

  2. Defining an onFile handler -> a little harder to integrate but still works

As I said, option 1 (especially with 'keyValues') is really useful and can be foolproof if it's explained properly

For example, parts, files, file does nearly the same but confusing.

So, all in all, I'm more interested in modifying/refactoring these lower-level methods and maybe changing the defaults of the plugin

@Uzlopak
Copy link
Contributor

Uzlopak commented Jul 29, 2023

Maybe we should design multipart to be configured "incremental". You need multipart for processing forms. E.g. a login form does not need file uploads. So we first support out of the box that you can process forms and have the fields processed as attributes of the body object.
Files should be then either be flushed away or throw an error.

Then we can determine, what we do with the files. Should we out of the box handle them as Buffers assigned to the body? Should we handle them as streams?
Obviously they will always be problematic.

Also we should consider, that we maybe could implement a better integration into the fastify schema system.

In a perfect world, i would register multipart plugin with maybe some general limits.
Then in the route i can specifiy formdata as possible request body. If formdata is specified I would like that multipart is setup accordingly. I didnt define files in the schema? Throw an error. I get a field which is not defined in the schema? Throw an error.
I want to specify in the schema the maxLength of a file part and that is respected over the general setting of multipart.
I expect a file but i dont have defined what i want to do with the file? onRoute hook should throw early.

Maybe every file part is a stream. We supply the guys with two stores, like we supply fastify-rate-limit with two stores. One memory and one redis.

So the body gets a getter for the file fields, which instantiate a stream. So your Memory store is the leaky bad one, like in fastify-rate-limit.

You upload the file, the memory store keeps the Buffer in memory. Memory store is a wrapper around a Map. We generate an uuid as key and store the buffer in there. Then the body gets a corresponding property named to the file field and the getter is basically Readable(Map.get(uuid)). We could enrich that Readable with attributes like size and contentEncoding, and maybe even the key in the Map (makes more sense with the redis implementation below)
So when we are in the handler, we can consume the stream.

The redis store basically the same. We configure the redis store and pass it as a configuration to the multipart plugin. We pump the files with uuids to the redis instance. Also add a property to the body, with a getter to a redisstream. Then in the handler you can use the stream to pipe it somewhere else.

Imho this would be the best. For the memory store and for the redis store you could configure ttl.

When i used file uploads in express, i always streamed the files to redis and passed the key of redis to the other microservices. So the other microservices could handle them, and maybe even unlink them after use.

The benefit is also, that people could e.g. implement their own stores,like a mongodb gridfs store, a sqlite store, or a fs store or, or, or...

Also by using streams and stores, it is easier to test without databases and we kind of encourage people to consider using streams and it would be easier to implement microservices, as the state would be forced to be outside the fastify microservice.

I wrote once a redis stream implementation. Maybe it could be adapted accordingly.

redis/ioredis#1376

@gurgunday
Copy link
Member

gurgunday commented Jul 29, 2023

Maybe we should design multipart to be configured "incremental". You need multipart for processing forms. E.g. a login form does not need file uploads. So we first support out of the box that you can process forms and have the fields processed as attributes of the body object. Files should be then either be flushed away or throw an error.

Fully agree, things can be a lot cleaner if file processing is opt-in, set either by a global setting or by per-route configuration, like how rate-limit works. This would mean registering the plugin with fileSize: 0 by default according to the current implementation

Then in the route i can specifiy formdata as possible request body. If formdata is specified I would like that multipart is setup accordingly. I didnt define files in the schema? Throw an error. I get a field which is not defined in the schema? Throw an error.

I'm digging it, as long as it's consistent with other default content parsers. For instance, if I'm not mistaken, Fastify doesn't throw when a field that wasn't defined in the schema exists in JSON input

For the rest, I know everyone hates on storing files in-memory and all, but I love it and use it very often – it's just too convenient 🤣

And it doesn't have to be leaky and bad. As I described above, by setting an appropriate size limit and error handler, it should always be possible to get rid of a file if necessary while making it easy to keep the validated buffer around during the lifecycle of a request

So yeah I'm behind this plan as long as the in-memory store is not left out :D

@gurgunday
Copy link
Member

Finally, since getting #468 out would really help me now and I'm nearly sure nobody is consuming files as strings, maybe bumping major twice wouldn't be so bad – it doesn't have to include #465

Or maybe a new release channel for next?

@mcollina

@Uzlopak
Copy link
Contributor

Uzlopak commented Jul 29, 2023

Yeah it is just a thought. Details can be discussed.

The throwing part was just to illustrate, that we have to maybe couple the validation stronger. PreParsing Hook happens before PreValidation.

This could also mean that when you put files into memory or a database or wherever, it will be filling them before we realize that the data is actually garbage and we dont need it.
So we need to remove the entries from the store if validation fails. Or else we will have potential attack vector.

I looked now and openapi defines a file in formdata with format: binary. We could define with ajv-formats what binary means. Can it be defined in the plugin? Or are we limited and need to patch fastify core for it? But the binary format we could check if it is a stream with additional fields like contentEncoding, key and size.

Then it would be more like body gets parsed like a json payload and binary fields are streams. The ajv validator could run over it. And you can do whatever you want to do with the streams.

The bad and leaky memory store means only, that we dont give it the fastify seal of approval that it is good. We could of course implement it properly and state of the art.

@climba03003
Copy link
Member Author

climba03003 commented Jul 29, 2023

I believe everyone is skipping this part of comment (AFAICT most of the web server (e.g. caddy, nginx, apache, php-fpm) download the files to the temporary folder directly and allows the proxy server to consume later.).

Maybe every file part is a stream. We supply the guys with two stores, like we supply fastify-rate-limit with two stores. One memory and one redis.

You should neither use redis nor memory because they are the same. Storage always cheaper than Memory and hard to cause a crash to system, that is also why most of the web-server choose it as default.
Caddy once using memory buffering when they implement it and I see an issue about memory leak immediately. If anything need to be improved, I would against buffering as default.

Although, buffering in memory is not a good choice. I still provided it and auto file download in my plugin.
https://github.com/kaka-repo/fastify-multipart/tree/main/lib/storage

This could also mean that when you put files into memory or a database or wherever, it will be filling them before we realize that the data is actually garbage and we dont need it.

I am always waiting https://github.com/tc39/proposal-explicit-resource-management implementing inside v8. It will makes our life easily regarding the resources control here. When it got implemented, we can cleanup the file automatically when the request disposed.

At this stage, I would either auto clean ourselves or mention it inside document.

Can it be defined in the plugin?

binary format already supported by default since fastify@4. And ajv always expecting string which is why all my plugin provide the file tmp path for the body. It helps both documentation and validation easier.

@Uzlopak
Copy link
Contributor

Uzlopak commented Jul 29, 2023

Hmm. Yes, if file is only a string pointing to the path where it is stored, than it is definetly easier. You could even define a protocol like file://tmp/bla.txt or redis://key or mongogridfs://mongoid or memory://bla.txt.

Somehow i like my idea of having them as streams also as it doesnt mean that the implementer has to think about what to do, and can still get the data easily in the handler, by reading stream.key. but this would also mean that if i want to use the data in my other microservice i also need the right readablestream. Not that nice again.

I get your point regarding using the filesystem. Filesystem is definetly cheaper. Maybe also has the benefit that people can activate antivirus programs on the filesystem scan the incoming folder.

@gurgunday
Copy link
Member

I believe everyone is skipping this part of comment (AFAICT most of the web server (e.g. caddy, nginx, apache, php-fpm) download the files to the temporary folder directly and allows the proxy server to consume later.).

It would make things a lot easier and cheaper. I'm just questioning weather it can be depended on since it's not an always-available thing. For instance, I host a monolith on DigitalOcean's App Platform but I'm not sure if this feature exists there, I will check it out though

You should neither use redis nor memory because they are the same. Storage always cheaper than Memory and hard to cause a crash to system, that is also why most of the web-server choose it as default

Agreed in principle, as long as it's simple enough to use at the end of the day. Right now, with attachFieldsToBody, I can simply do request.body.file and it's done. I like Uzlopak's suggestion because it's not hard to point a readable from a map to the body

My only concern is that the limits could cause data loss if a buffer is not used and gotten rid of fast enough. With the current set up, I'm having no difficulties letting people upload files on a 512Mb server

I am always waiting https://github.com/tc39/proposal-explicit-resource-management implementing inside v8. It will makes our life easily regarding the resources control here. When it got implemented, we can cleanup the file automatically when the request disposed.

This is so cool! We should definitely use it when it's implemented

@climba03003
Copy link
Member Author

climba03003 commented Jul 30, 2023

Agreed in principle, as long as it's simple enough to use at the end of the day. Right now, with attachFieldsToBody, I can simply do request.body.file and it's done. I like Uzlopak's suggestion because it's not hard to point a readable from a map to the body

Placing readable inside request.body is not hard, it can be replaced by the hooks system.

However, I would like to split string and readable content. body always string and files always something more complex (e.g. Buffer, Readable)

fastify.post('/', async function(request, reply) {
  // access files
  request.files

  // access body
  // note that file fields will exist in body and it will becomes the file path saved on disk
  request.body
})

Handle stream directly would be advanced users.

fastify.post('/', async function(request, reply) {
  // you can use async iterator for file handling
  for await (const { type, name, value, info } of request.multipart()) {
    switch(type) {
      case 'field': {
        console.log(name, value)
        break
      }
      case 'file': {
        console.log(name, value, info.filename)
        // value is Readable
        value.resume()
        break
      }
    }
  }
})

@mcollina
Copy link
Member

I would recommend we split the logic into two module. One that's low level just parses the incoming multipart data, providing a low level stream API. One that uses that module to provide a high-level API (and stores the files on disk in between).

I solved the "on exit" problem with https://www.npmjs.com/package/on-exit-leak-free, which should allow us to keep a global list of files and remove them in case of a crash.

@Uzlopak
Copy link
Contributor

Uzlopak commented Jul 30, 2023

Dont we have that separation already? We have busboy which does the incomimng multipart data parsing and fastify/multipart for the integration into fastify

@mcollina
Copy link
Member

Dont we have that separation already? We have busboy which does the incomimng multipart data parsing and fastify/multipart for the integration into fastify

I don't think so, integrating it with Fastify is worth some code.

I would not use the API that stores files on disk by default. The old event-based API was enough for all my needs.

@gurgunday
Copy link
Member

I would really like the new API to look like this – based on the browser's request.formData(); instead of using arbitrary names like req.file()

We can keep options like attachFieldsToBody, but by default, it's better to stay as close to the browser as possible

Example usage:

async handler(req, res) {
    const form = await req.formData();
    const email = form.get("email")?.toString();

  // ...
  },

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature to be added
Projects
None yet
Development

No branches or pull requests

6 participants