Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimizing export #347

Open
kirya-dev opened this issue Aug 5, 2020 · 23 comments
Open

Optimizing export #347

kirya-dev opened this issue Aug 5, 2020 · 23 comments

Comments

@kirya-dev
Copy link

For export large data need call flush() in StreamedResponse.
its allow clear buffer and not use extra memory.
Todo as feature of framework?

My solution:

# config/services.yml
services:
    app.export.writer.csv:
        class: App\Writer\FlushWriterDecorator
        decorates: sonata.exporter.writer.csv

    app.export.writer.json:
        class: App\Writer\FlushWriterDecorator
        decorates: sonata.exporter.writer.json

    app.export.writer.xml:
        class: App\Writer\FlushWriterDecorator
        decorates: sonata.exporter.writer.xml

    app.export.writer.xls:
        class: App\Writer\FlushWriterDecorator
        decorates: sonata.exporter.writer.xls
<?php
declare(strict_types=1);

namespace App\Writer;

use Sonata\Exporter\Writer\TypedWriterInterface;

class FlushWriterDecorator implements TypedWriterInterface
{
    private const FLUSH_EVERY = 1000;

    private int $position;
    private TypedWriterInterface $writer;

    public function __construct(TypedWriterInterface $writer)
    {
        $this->writer = $writer;
    }

    public function open(): void
    {
        $this->writer->open();

        $this->position = 0;
    }

    public function write(array $data): void
    {
        $this->writer->write($data);

        $this->position++;
        if (0 === ($this->position % self::FLUSH_EVERY)) {
            flush();
        }
    }

    public function close(): void
    {
        $this->writer->close();
    }

    public function getDefaultMimeType(): string
    {
        return $this->writer->getDefaultMimeType();
    }

    public function getFormat(): string
    {
        return $this->writer->getFormat();
    }
}
@VincentLanglet
Copy link
Member

Hi, this feature could be great.
Wanna start a Pr ? :)

@kirya-dev
Copy link
Author

Yes!
I have next questions:

  1. Is optional decorator?
  2. Value of FLUSH_EVERY pass by parameters for more flexability?
    If null (on default) no need use decorator

@VincentLanglet
Copy link
Member

Yes!

I have next questions:

  1. Is optional decorator?

  2. Value of FLUSH_EVERY pass by parameters for more flexability?

If null (on default) no need use decorator

  1. dont know 😅

  2. this value could come from a sonata exporter config parameter indeed. I'm not sure if we should keep null as default value or if we can abritrary chose a number.

@kirya-dev
Copy link
Author

There have other problem with get large data from doctrine. My solve this problem by using many paginated queries.
Need solve this also in this bundle?

Im my test combine both optimizing dont eat memory more than 60MB.. Exported file size it was 650MB

@kirya-dev
Copy link
Author

kirya-dev commented Aug 5, 2020

This improved version of DoctrineORMQuerySourceIterator

<?php
declare(strict_types=1);

namespace App\Export\Source;

use Doctrine\ORM\Query;
use Sonata\Exporter\Source\DoctrineORMQuerySourceIterator;

final class PaginatedDoctrineORMQuerySourceIterator extends DoctrineORMQuerySourceIterator
{
    private const PAGE_SIZE = 1000;

    private int $page = 0;

    public function __construct(Query $query, array $fields, string $dateTimeFormat = 'r')
    {
        parent::__construct($query, $fields, $dateTimeFormat);

        $this->query->setMaxResults(self::PAGE_SIZE);
        $this->query->setFirstResult(0);
    }

    public function next(): void
    {
        $this->iterator->next();

        if (!$this->iterator->valid()) {
            $this->page++;
            $this->query->setFirstResult($this->page * self::PAGE_SIZE);
            $this->query->getEntityManager()->clear();

            $this->iterator = null;
            $this->rewind();
        }
    }
}

@kirya-dev
Copy link
Author

kirya-dev commented Aug 5, 2020

That screenshot indicates low memory usage when exporting big data. Im apply both optimization.
Downloaded file with size is 637.4 MB

image

@VincentLanglet
Copy link
Member

This improved version of DoctrineORMQuerySourceIterator

<?php

declare(strict_types=1);



namespace App\Export\Source;



use Doctrine\ORM\Query;

use Sonata\Exporter\Source\DoctrineORMQuerySourceIterator;



final class PaginatedDoctrineORMQuerySourceIterator extends DoctrineORMQuerySourceIterator

{

    private const PAGE_SIZE = 1000;



    private int $page = 0;



    public function __construct(Query $query, array $fields, string $dateTimeFormat = 'r')

    {

        parent::__construct($query, $fields, $dateTimeFormat);



        $this->query->setMaxResults(self::PAGE_SIZE);

        $this->query->setFirstResult(0);

    }



    public function next(): void

    {

        $this->iterator->next();



        if (!$this->iterator->valid()) {

            $this->page++;

            $this->query->setFirstResult($this->page * self::PAGE_SIZE);

            $this->query->getEntityManager()->clear();



            $this->iterator = null;

            $this->rewind();

        }

    }

}

You need to take in account the fact that the query may already have a first result and a max result.

For example: If I want all the result from 3 to 2500, you'll have to do 3 to 1000 then 1000 to 2000 then 2000 to 2500.

@VincentLanglet
Copy link
Member

VincentLanglet commented Aug 5, 2020

Doesnt work if there is 3000 result in my example, because you'll export all of them even if I would like to stop at 2500.

In construct
OriginalFirstResult = query->getfirstresult();
OriginalMaxResult = query->getMaxResult()

Query->setMaxResult(min(originalMaxResult, originalFirstResult + page size))

Then, the check would be
If iterator not valid and originalMaxResult > currentMaxResult (where currentMaxResult = query->getMaxResult())

SetMaxResult((min(originalMaxResult, currentMaxResult + page size))
SetFirstResult(currentFirstResult + page size)

With this formula you dont need the page property, neither the originalFirstResult property ; just the originalMaxResult property.

I'm on phone, if it's not clear enough, I'll improve my message in two days.

Edit: did you remove your message or I'm crazy ? 😂

@kirya-dev
Copy link
Author

If we use pagination, then we may not be sure that the next query will return next data.
For example, if you sort in descending order by creation date (from new to old), then at one point it may turn out that there will be several duplicate records in the upload. Also, the database does not guarantee that the data will be selected in the same sequence using limits and offsets.
This can be solved by resetting the ascending sorting ID. And in the next request, take records strictly older than this value, or vice versa.

@kirya-dev
Copy link
Author

I think we just left note in README that for using paginate export need not changed and ordered data.

@VincentLanglet
Copy link
Member

If we use pagination, then we may not be sure that the next query will return next data.
For example, if you sort in descending order by creation date (from new to old), then at one point it may turn out that there will be several duplicate records in the upload. Also, the database does not guarantee that the data will be selected in the same sequence using limits and offsets.
This can be solved by resetting the ascending sorting ID. And in the next request, take records strictly older than this value, or vice versa.

Oh yes indeed. Maybe @greg0ire can help about this kind of problem.
But anyway, currently the export doesn't work when there is too much data, so any improvement is good to take.

If you start a PR, the reviews can lead us to the right way ;)

@greg0ire
Copy link
Contributor

greg0ire commented Aug 6, 2020

You can fix such issues with cursor based pagination: https://use-the-index-luke.com/no-offset

@kirya-dev
Copy link
Author

kirya-dev commented Aug 9, 2020

So:

  1. Validate: Initial query must havnt offset & limit.
  2. Validate: must have order by in initial query.
  3. Add limit (PAGE_SIZE) for query.
  4. Extract seek field from order by.
  5. Fetch first page.
  6. Do export.

If count(results) = PAGE_SIZE:

  1. Add seek clause. If query contains where clause wee need wrap it into parenthesis and add and seek clause. (Avoid repeat this step)
  2. Extract last seek value for previous results. (From last results)
  3. Fetch next results.
  4. Go to 6

@VincentLanglet
Copy link
Member

Why 1 ? Cant you use the max result and the offset provided ?

Why 2 ? Cant you add one order if needed ?

@kirya-dev
Copy link
Author

Why 1 ? Cant you use the max result and the offset provided ?

Extra where in 7 broke offset & limit.

Why 2 ? Cant you add one order if needed ?

We must order results for correct work this method. Yes, we can add order by if needed. But problem: What field? For ex: id can not exists (Or provide this field name in constructor?)

@VincentLanglet
Copy link
Member

@kirya-dev do you want to start a Pr ?

I think it would be easier to discuss about this with code and some tests.

@nowaja
Copy link

nowaja commented Feb 2, 2021

@kirya-dev Hi, I tried to use your solution with latest version of exporter, but export is still crashing on memory (even with 500MB memory and only 15k records). Could you please point me at right direction? Thanks in advance.

@kirya-dev
Copy link
Author

kirya-dev commented Feb 3, 2021

@kirya-dev Hi, I tried to use your solution with latest version of exporter, but export is still crashing on memory (even with 500MB memory and only 15k records). Could you please point me at right direction? Thanks in advance.

Hi! Please ensure that decorators are enabled and you are using custom source interator.

Also can help you https://www.php.net/manual/en/function.flush.php

@VincentLanglet
Copy link
Member

Do you still plan to make some PR in order to optimize the export @kirya-dev ?

I'm getting some Error: Maximum execution time of ... seconds exceeded error when exporting via excel.
So I'm looking for way to fix this.

@kirya-dev
Copy link
Author

Hello! Decide this problem is no simple task.
Good idea was suggest to using database cursor.
Doctrine dosent supports this functionality officcialy. But im found package for resolve this missing https://github.com/paysera/lib-pagination
What do you think about it?

@VincentLanglet
Copy link
Member

https://github.com/paysera/lib-pagination
What do you think about it?

We won't add a dependency to a low maintenance library, but

You can fix such issues with cursor based pagination: https://use-the-index-luke.com/no-offset

Cursor based pagination was the advice of @greg0ire. So we could try something similar.

@kirya-dev
Copy link
Author

kirya-dev commented May 18, 2021

Implementations for many platforms can be different. We must implements this feature for every popular database platforms.. Its a big work

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale label Nov 14, 2021
@phansys phansys added keep and removed stale labels Nov 14, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants