[BUG] Slow scan operations #1260

Josefnm · 2021-09-07T13:00:04Z

Summary:

I'm using dynamoose for my project, and overall it works well. However scan operations are several times slower than the same operations using AWS DocumentClient.

Code sample:

Schema

It's the same with all my different schemas. Here is an example of the simplest one. None of them use Buffer type.

const userSchema = new dynamoose.Schema(
  {
    id: {
      type: String,
      required: true,
      hashKey: true,
    },
    phoneNumber: {
      type: String,
      index: {
        project: true,
        global: true,
        name: 'UserPhoneNumberIndex',
      },
    },
    name: String,
    status: String,
    email: String,
    // a few more fields with strings, numbers and booleans
  },
  { timestamps: true }
)

Model

const model = dynamoose.model('myTable', mySchema, {
    create: false,
    waitForActive: false,
  })

General

// dynamoose
MyModel.scan().all().exec()

// DocumentClient
export const scanTable = async <T = unknown>(
  table: string,
  extraParams?: Partial<AWS.DynamoDB.DocumentClient.ScanInput>
): Promise<T[]> => {
  const params: AWS.DynamoDB.DocumentClient.ScanInput = {
    TableName: `${constants.ENV}-${table}`,
    ...extraParams,
  }

  const scanResults = []
  let response: PromiseResult<
    AWS.DynamoDB.DocumentClient.ScanOutput,
    AWS.AWSError
  >
  do {
    // eslint-disable-next-line no-await-in-loop
    response = await getDocumentClient().scan(params).promise()
    scanResults.push(...response.Items)
    params.ExclusiveStartKey = response.LastEvaluatedKey
  } while (typeof response.LastEvaluatedKey !== 'undefined')

  return <T[]>scanResults
}

Current output and behavior (including stack trace):

Example 1:
Scanning ~450 items that are ~1 kb each (according to aws dynamodb console).
Using aws DocumentClient and scanning all: ~400 ms
Using dynamoose scan all: ~4500 ms

Example 2:
Scanning ~23000 items that are ~230 bytes each.
Using aws DocumentClient and scanning all: ~6000 ms
Using dynamoose scan all: ~24000 ms

Expected output and behavior:

Somewhat similar performance

Environment:

Operating System: Amazon Linux
Operating System Version: 2
Node.js version (node -v): v14.x
NPM version: (npm -v): 7.18.1
Dynamoose version: 2.8.1

Other information (if applicable):

AWS lambda using the Serverless framework.

Serverless version: 2.48.0
aws sdk version: 2.952.0
AWS_NODEJS_CONNECTION_REUSE_ENABLED: 1

Other:

[ x ] I have read through the Dynamoose documentation before posting this issue
[ x ] I have searched through the GitHub issues (including closed issues) and pull requests to ensure this issue has not already been raised before
[ x ] I have searched the internet and Stack Overflow to ensure this issue hasn't been raised or answered before
[ x ] I have tested the code provided and am confident it doesn't work as intended
[ x ] I have filled out all fields above
[ x ] I am running the latest version of Dynamoose

The text was updated successfully, but these errors were encountered:

fishcharlie · 2021-09-18T14:36:27Z

Curious about where this is occurring.

You mention this is related to scan operations. Are you confident that is the culprit?

I'm aware of this PR which caused high initialization times for Dynamoose: #1154. Could that potentially be the culprit? You could try the v3 alpha to see if that PR improves it.

I'm also aware of a change made recently to remove a cache that was causing memory leaks. That was related to parsing the document.

There is no doubt that I expect Dynamoose to have more overhead than AWS-SDK itself. More work is required to ensure schema conformance. However this performance hit shouldn't be excessive (as it looks to be in your example).

FanFataL · 2021-10-13T10:25:25Z

Hi,

I've got the same problem using Dynsamoose on AWS Lambda, on localhost ubuntu works perfectly.

After some investigation I have found what cause the issue: this line

dynamoose/lib/DocumentRetriever.ts

Line 59 in a838224

    
           const array: any = (await Promise.all(result.Items.map(async (item) => await new this.internalSettings.model.Document(item, {"type": "fromDynamo"}).conformToSchema({"customTypesDynamo": true, "checkExpiredItem": true, "saveUnknown": true, "modifiers": ["get"], "type": "fromDynamo"})))).filter((a) => Boolean(a));

Processing 150 records took 10s (sic!)

fishcharlie · 2021-10-13T15:34:29Z

@FanFataL Sadly that line is very critical and doesn't really lead to information about what is going wrong.

We kinda need to dive into that method and determine which part specifically is slowing things down.

As mentioned in my previous message, v3 alpha does make some improvements to performance. Are you able to try that and see if the issue persists?

benhegarty · 2021-10-29T02:49:36Z

We were experiencing the same performance issues with a large query request. After upgrading to v3 alpha, performance is now on-par with the native client.

fishcharlie · 2021-11-21T04:17:25Z

Closing since the latest report is that v3 alpha is on-par with native client.

PaulAtST · 2022-06-29T17:35:15Z

Now that I see this issue I am going to give v3 a try, but since I did debug this on v2 here are my findings incase anyone else runs into this.

I noticed as my data set grew my api call was taking significantly longer to return the data set. For reference I am now returning about 2,000 items at a time from DynamoDB. It was now taking abut 7 seconds on my local computer. The latency is on the main thread unfortunately which is locking up my server for processing other request.

First I found the line @FanFataL pointed out which in my case was taking about 5 of the 7 seconds:

const array: any = (await Promise.all(result.Items.map(async (item) => await new this.internalSettings.model.Document(item, {"type": "fromDynamo"}).conformToSchema({"customTypesDynamo": true, "checkExpiredItem": true, "saveUnknown": true, "modifiers": ["get"], "type": "fromDynamo"})))).filter((a) => Boolean(a));

Next the slowdown was coming from:

.conformToSchema({ "customTypesDynamo": true, "checkExpiredItem": true, "saveUnknown": true, "modifiers": ["get"], "type": "fromDynamo" })

Inside that function it was coming from the line:

const expectedObject = await Document.objectFromSchema(document, document.model, settings)

Inside that function it seams to mostly be coming from, which takes abut 4 seconds

if (settings.modifiers) {
    ....
}

With this line taking about a good second in it's self:

const modifierFunction = await schema.getAttributeSettingValue(modifier, key, { "returnFunction": true, typeIndexOptionMap });

PaulAtST · 2022-06-29T18:57:28Z

@fishcharlie Unfortunately I am seeing the same issue with v3, but now it's the lines:

const expectedObject = await Item.objectFromSchema(item, item.getInternalProperties(internalProperties).model, settings);

Speed wise I am using v3 being about 0.5 - 1.0 sec slower then v2 unfortunately. Is there a way to optimize querying for a list of objects? I am concerned this is not going to scale up to what I need for my use case.

EDIT: So I dug a big deeper into v3 if (settings.modifiers) { code block.

In my case the lines

utils_1.default.object.set(returnObject, key, await modifierFunction(value, oldValue));
and
returnObject = await schemaModifier(returnObject);

Never get hit, so my understanding is this whole code block is not helping my use case. Would it be possible to add an optimization that says we don't need to run this code block at all?

Also it looks like the logic inside the promise

 await Promise.all((await Item.attributesWithSchema(returnObject, model)).map(async (key) => {

Gets calculated for each item in the result set. Would it be possible to add a cache of the results for the model being queried so we don't have to calculate it for each item in the result set?

fishcharlie · 2022-07-05T17:41:07Z

Reopening based on @PaulAtST's latest findings. Thanks for all your work to pinpoint this @PaulAtST.

PaulAtST · 2022-07-06T15:17:19Z

@fishcharlie Your welcome.

I did some more digging into performance profiles for v3. Another slow spot I have identified is the function deep_copy(obj) inside the file deep_copy.js. In particular it's the call objectUtils.isCircular(obj, attr) which is slowing it down significantly. In my case removing that check is saving me abut 1.5 seconds.

Is it possible for a return from DynamoDB to be circular? If not a possible solution can be adding a parameter to deep_copy to not do the circular check when making a copy of a return from DynamoDB.

Also it looks like for each item in the return the line "originalObject": utils_1.default.deep_copy(itemObject) gets called. This can be another place to optimize with an item level cache.

I will keep digging and see what else I find.

github-actions · 2022-07-14T02:34:10Z

This issue is stale because it has been open 7 days with no activity. Remove stale label or comment or this will be closed in 3 days.

jpblair · 2022-09-02T16:50:50Z

I'm seeing what I think is similar behavior for at least GetItem operations, possibly others. Even when retrieving a single item, the deserialization process via documentify() can take on the order of seconds, depending on how complex/deeply nested the object is. By contrast, a serialization/deserialization using JSON.parse(JSON.stringify(obj)) only takes 2ms. I only did some rough profiling, but it seems to point to some of the same culprits that @PaulAtST identified:

await Document.attributesWithSchema(returnObject, model).map(key => {
await model.schemaForObject(returnObject)
The settings.modifiers block

Here is my schema:

const schema = new dynamoose.Schema(
  {
    revision_id: {
      type: String,
      required: true,
      hashKey: true,
      default: ArcUUID.create
    },
    collection_id: {
      type: String,
      required: true,
      rangeKey: true
    },
    parent_id: {
      type: String
    },
    org_id: {
      type: String,
      required: true
    },
    document: {
      type: Object,
      schema: {
        type: {
          type: String
        },
        headlines: {
          type: Object
        },
        description: {
          type: Object
        },
        content_aliases: {
          type: [String]
        },
        canonical_website: {
          type: String
        },
        websites: {
          type: Set,
          schema: [String]
        },
        query_backfill: {
          type: String
        },
        pinned_count: {
          type: Number,
          default: 0
        },
        content_elements: {
          type: Array,
          schema: [
            {
              type: Object,
              schema: {
                referent: {
                  type: Object,
                  schema: {
                    id: String,
                    type: String,
                    referent_properties: {
                      type: Object,
                      schema: {
                        headlines: {
                          type: Object
                        },
                        description: {
                          type: Object
                        },
                        promo_items: {
                          type: Object
                        }
                      }
                    }
                  }
                },
                type: String
              }
            }
          ]
        }
      }
    }
  },
  {
    timestamps: {
      createdAt: 'created_date',
      updatedAt: 'last_updated_date'
    },
    saveUnknown: true,
    useDocumentTypes: true,
    useNativeBooleans: true
  }
)

And attached is the sample document I'm trying to retrieve/deserialize: sample.txt

This is pretty major hindrance to our application, so please let me know of any mitigations/workarounds/alternatives/etc. Otherwise, I'd be interested to know what the timeline is for fixing this and if there's anything I can do. Please let me know if a new ticket would be helpful. Thanks!

ptejada · 2022-09-04T01:55:52Z

@jpblair

I know it might defeat the purpose of using a schema. but can you share the stats for getting your document in this two ways?

const document = await Model.get({...})

// VS

const request = await Model.get({...}, {return: 'request'})
const document = await dynamoose.aws.ddb().getItem(request)

tareqx2 · 2022-09-06T14:12:25Z

Having similar performance issues when performing a model.query for ~20k items.
Using a dynamoose query takes around 20 seconds where as using a regular dynamo query takes 2 seconds

jpblair · 2022-09-08T20:20:52Z

@ptejada I tried this, and bypassing Dynamoose seems to shave off 600-700ms.

tranhl · 2024-04-14T04:34:14Z

@fishcharlie I can confirm that deep_copy is taking a significant performance toll when querying data sets and objects. I have provided a repository that provides a small reproducible example using DynamoDB Local:

https://github.com/tranhl/dynamooose-deep-copy-repro

As @PaulAtST has identified in this comment, deep_copy is the primary culprit, due to the objectUtils.isCircular call. While deep_copy is coping the source object, it calls isCircular to check whether encountered objects contain circular references. This is essentially an O(n^2) operation, as the isCircular function will also traverse the object in order to determine whether circular references exist.

Instead of relying on isCircular to detect and omit circular references, we should instead handle this ourselves within deep_copy, so that we only traverse the input object once. The provided reproduction repository contains a patched version of Dynamoose that does this, with the profiling results speaking for themselves.

PaulAtST · 2024-04-15T15:37:14Z

@tranhl Nice work! Sounds like a solid approach to me.

Josefnm added status:triage type:bug labels Sep 7, 2021

fishcharlie mentioned this issue Nov 9, 2021

[BUG] Dynamoose Get method is 50x slower than aws-sdk dynamodb getItem method #1287

Closed

fishcharlie removed the status:triage label Nov 21, 2021

fishcharlie added this to the v3 milestone Nov 21, 2021

fishcharlie closed this as completed Nov 21, 2021

fishcharlie reopened this Jul 5, 2022

github-actions bot added the status:stale label Jul 14, 2022

fishcharlie added status:not-stale and removed status:stale labels Jul 14, 2022

fishcharlie removed this from the v3 milestone Aug 7, 2022

fishcharlie added status:todo and removed status:not-stale labels Aug 7, 2022

vasily-polonsky mentioned this issue Feb 8, 2023

Add skip conform to schema #1572

Open

12 tasks

KanghoonYi mentioned this issue Apr 13, 2023

add readStrict option #1586

Closed

20 tasks

Josefnm mentioned this issue Jun 22, 2023

[BUG] Slow on querying table with GSI. #1585

Open

6 tasks

tranhl linked a pull request Apr 14, 2024 that will close this issue

fix: query perf degradation from deep_copy util #1673

Open

20 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Slow scan operations #1260

[BUG] Slow scan operations #1260

Josefnm commented Sep 7, 2021

fishcharlie commented Sep 18, 2021

FanFataL commented Oct 13, 2021

fishcharlie commented Oct 13, 2021

benhegarty commented Oct 29, 2021

fishcharlie commented Nov 21, 2021

PaulAtST commented Jun 29, 2022

PaulAtST commented Jun 29, 2022 •

edited

fishcharlie commented Jul 5, 2022

PaulAtST commented Jul 6, 2022

github-actions bot commented Jul 14, 2022

jpblair commented Sep 2, 2022 •

edited

ptejada commented Sep 4, 2022

tareqx2 commented Sep 6, 2022

jpblair commented Sep 8, 2022

tranhl commented Apr 14, 2024 •

edited

PaulAtST commented Apr 15, 2024

[BUG] Slow scan operations #1260

[BUG] Slow scan operations #1260

Comments

Josefnm commented Sep 7, 2021

Summary:

Code sample:

Schema

Model

General

Current output and behavior (including stack trace):

Expected output and behavior:

Environment:

Other information (if applicable):

Other:

fishcharlie commented Sep 18, 2021

FanFataL commented Oct 13, 2021

fishcharlie commented Oct 13, 2021

benhegarty commented Oct 29, 2021

fishcharlie commented Nov 21, 2021

PaulAtST commented Jun 29, 2022

PaulAtST commented Jun 29, 2022 • edited

fishcharlie commented Jul 5, 2022

PaulAtST commented Jul 6, 2022

github-actions bot commented Jul 14, 2022

jpblair commented Sep 2, 2022 • edited

ptejada commented Sep 4, 2022

tareqx2 commented Sep 6, 2022

jpblair commented Sep 8, 2022

tranhl commented Apr 14, 2024 • edited

PaulAtST commented Apr 15, 2024

PaulAtST commented Jun 29, 2022 •

edited

jpblair commented Sep 2, 2022 •

edited

tranhl commented Apr 14, 2024 •

edited