Standardize bulk 1 and 2 APIs expoing "streams" to bulk2.query #1317

AllanOricil · 2023-03-23T04:06:17Z

BEFORE

soql queries that return huge number of rows could throw FATAL Error ... Javascript heap out of memory because bulk2.query stores all records in memory.

AFTER

query results can be retrieved in batches with size equal to the value passed to maxRecords.

obs: there isn't a real stream from Salesforce to jsforce because bulk api v2 does not offer it. These changes are just sending pages to a stream so that the api for v2 is the same as v1.

obs: page results are not stored in memory unless the developer chooses it by concatenation results from every page

import { pipeline } from 'stream/promises';
import { Connection } from 'jsforce';

try {
    const conn = new Connection(CONFIG);
    
    //the default value for maxRecords is 10000
    const queryJob = await conn.bulk2.query(
      `SELECT Id, Name FROM Account`,
    );
    const readStream = queryJob.stream();
    const writeStream = fs.createWriteStream(path.resolve(PATH));


  await pipeline(readStream, writeStream);
}catch(e){
   throw new Error('Something went wrong');
}

or, if developers want to gather all records in memory before doing something with them

import { pipeline } from 'stream/promises';
import { Connection } from 'jsforce';


function getRecords(readStream){
   return new Promise((resolve, reject) => {
       const records = [];
       readStream
        .on('data', (data) => records = records.concat(data))
        .on('error', (error) => reject(e))
        .on('close', () => resolve(records))
   });
}

try {
    const conn = new Connection(CONFIG);
    const records = [];
    const queryJob = await conn.bulk2.query(
      `SELECT Id, Name FROM Account`,
    );
    const readStream = queryJob.stream();
    const records = await getRecords(readStream);
}catch(e){
   throw new Error('Something went wrong');
}

…now the data is piped to a Readable stream. Change the way maxRecords is converted to string

…rmation

- create getRecords to remove code repetition - rename tests with a better description of what they do

…gTable__c object

… a query param in the request url - add a counter for the number of batches - rename the counter that counts the number of records from numberOfRecordsProcessed to numberRecordsRetrieved

amphro · 2023-03-29T16:34:29Z

src/api/bulk.ts


-    while (this.locator !== 'null') {
-      const nextResults = await this.request<Record[]>({
+  private async *getResults(): AsyncGenerator<Record[], void, unknown> {


I would rather you name this method something different, like getResultsGenerator, and still have a public getResults() method that does:

return new Promise((resolve) => { let records = []; this.stream() .on('data', (data) => { records = records.concat(data); }) .on('end', () => { resolve(records); }); });

That way everyone who is used to the getResult method can still access it from the query job and doesn't have to write it themselves. Migrating from v1 -v2 would be easier, as it would only need to change bulk.query() to bulkd.query().getResults()

Keep in mind, I am no longer a maintainer on this project, so I would leave it up to @stomita or @mshanemc

I removed the getRecords() method because bulk v1 does not have it, and because I wanted to make v2's API equal to v1. Another good reason for not creating this method, is because it could cause Fatal error: Javascript heap out of memory if used to query huge data sets without using the LIMIT keyword.

@AllanOricil Even though jsforce v2 is still in beta there's lot of projects using it so I would prefer not break the current bulk v2 implementation. What do you think about deprecating this and add an option to the query job, then make that the default at v2 GA?

maybe a third arg to BulkV2.query that makes it return the job instead of the results could work, default to false.

Np. I'm going to add this arg 👍

cristiand391 · 2024-02-15T15:42:45Z

Hey @AllanOricil, I opened a PR a few days ago to fix this in v3:
#1397

At first I was about to merge your PR and start from there but after playing a bit whit the record stream system in jsforce I decided to start from scratch (we had some planned breaking changes in bulk2 too).

After it gets merged we'll start using jsforce v3 in the CLI but this issue will not be solved because data query --bulk --json includes the records in json output, so it still needs to collect all in memory. Thanks for the help!

AllanOricil · 2024-02-15T15:52:02Z

If u are going to add the whole file in mem, do some additional checks to avoid unnecessary processing:

Something like this:

get file fstats to determine its size
get system max available mem "available" for the runtime
add a threshold using these 2 numbers + window for error
use all to throw an Error before running a job.

If(!canProcess(...)) throw new Error ("sorry you can't use this file because o XYZ. Please do something like bla and then try again")

This way users won't wait for the mem to be all filled to see an error.

add streams support to bulk2.query

f0b0582

AllanOricil marked this pull request as draft March 23, 2023 04:14

AllanOricil marked this pull request as ready for review March 23, 2023 04:15

AllanOricil added 2 commits March 23, 2023 08:29

remove encapsulated data storage for records inside QueryJobV2 since …

43ac9e3

…now the data is piped to a Readable stream. Change the way maxRecords is converted to string

make bulk2.query return the queryJob so that clients can use its info…

659c88b

…rmation

AllanOricil mentioned this pull request Mar 26, 2023

create data:export:bulk command salesforcecli/plugin-data#527

Open

AllanOricil added 3 commits March 27, 2023 13:49

- add a default value for maxRecords

39fbdab

- create getRecords to remove code repetition - rename tests with a better description of what they do

make test assertions dynamic based on the number of records in the bi…

a2436e4

…gTable__c object

- add new test to prove the query works when maxRecords isnt added as…

988ea23

… a query param in the request url - add a counter for the number of batches - rename the counter that counts the number of records from numberOfRecordsProcessed to numberRecordsRetrieved

amphro reviewed Mar 29, 2023

View reviewed changes

AllanOricil changed the title ~~add streams to bulk2.query~~ add "streams" to bulk2.query Apr 29, 2023

AllanOricil closed this May 15, 2023

AllanOricil reopened this Jul 28, 2023

AllanOricil changed the title ~~add "streams" to bulk2.query~~ Standardize bulk 1 and 2 APIs expoing "streams" to bulk2.query Nov 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Standardize bulk 1 and 2 APIs expoing "streams" to bulk2.query #1317

Standardize bulk 1 and 2 APIs expoing "streams" to bulk2.query #1317

AllanOricil commented Mar 23, 2023 •

edited

amphro Mar 29, 2023

amphro Mar 29, 2023

AllanOricil Mar 29, 2023 •

edited

cristiand391 Aug 2, 2023 •

edited

AllanOricil Aug 3, 2023

cristiand391 commented Feb 15, 2024

AllanOricil commented Feb 15, 2024 •

edited

Standardize bulk 1 and 2 APIs expoing "streams" to bulk2.query #1317

Are you sure you want to change the base?

Standardize bulk 1 and 2 APIs expoing "streams" to bulk2.query #1317

Conversation

AllanOricil commented Mar 23, 2023 • edited

BEFORE

AFTER

amphro Mar 29, 2023

Choose a reason for hiding this comment

amphro Mar 29, 2023

Choose a reason for hiding this comment

AllanOricil Mar 29, 2023 • edited

Choose a reason for hiding this comment

cristiand391 Aug 2, 2023 • edited

Choose a reason for hiding this comment

AllanOricil Aug 3, 2023

Choose a reason for hiding this comment

cristiand391 commented Feb 15, 2024

AllanOricil commented Feb 15, 2024 • edited

AllanOricil commented Mar 23, 2023 •

edited

AllanOricil Mar 29, 2023 •

edited

cristiand391 Aug 2, 2023 •

edited

AllanOricil commented Feb 15, 2024 •

edited