Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Slow scan operations #1260

Open
Josefnm opened this issue Sep 7, 2021 · 16 comments · May be fixed by #1673
Open

[BUG] Slow scan operations #1260

Josefnm opened this issue Sep 7, 2021 · 16 comments · May be fixed by #1673

Comments

@Josefnm
Copy link

Josefnm commented Sep 7, 2021

Summary:

I'm using dynamoose for my project, and overall it works well. However scan operations are several times slower than the same operations using AWS DocumentClient.

Code sample:

Schema

It's the same with all my different schemas. Here is an example of the simplest one. None of them use Buffer type.

const userSchema = new dynamoose.Schema(
  {
    id: {
      type: String,
      required: true,
      hashKey: true,
    },
    phoneNumber: {
      type: String,
      index: {
        project: true,
        global: true,
        name: 'UserPhoneNumberIndex',
      },
    },
    name: String,
    status: String,
    email: String,
    // a few more fields with strings, numbers and booleans
  },
  { timestamps: true }
)

Model

const model = dynamoose.model('myTable', mySchema, {
    create: false,
    waitForActive: false,
  })

General

// dynamoose
MyModel.scan().all().exec()

// DocumentClient
export const scanTable = async <T = unknown>(
  table: string,
  extraParams?: Partial<AWS.DynamoDB.DocumentClient.ScanInput>
): Promise<T[]> => {
  const params: AWS.DynamoDB.DocumentClient.ScanInput = {
    TableName: `${constants.ENV}-${table}`,
    ...extraParams,
  }

  const scanResults = []
  let response: PromiseResult<
    AWS.DynamoDB.DocumentClient.ScanOutput,
    AWS.AWSError
  >
  do {
    // eslint-disable-next-line no-await-in-loop
    response = await getDocumentClient().scan(params).promise()
    scanResults.push(...response.Items)
    params.ExclusiveStartKey = response.LastEvaluatedKey
  } while (typeof response.LastEvaluatedKey !== 'undefined')

  return <T[]>scanResults
}

Current output and behavior (including stack trace):

Example 1:
Scanning ~450 items that are ~1 kb each (according to aws dynamodb console).
Using aws DocumentClient and scanning all: ~400 ms
Using dynamoose scan all: ~4500 ms

Example 2:
Scanning ~23000 items that are ~230 bytes each.
Using aws DocumentClient and scanning all: ~6000 ms
Using dynamoose scan all: ~24000 ms

Expected output and behavior:

Somewhat similar performance

Environment:

Operating System: Amazon Linux
Operating System Version: 2
Node.js version (node -v): v14.x
NPM version: (npm -v): 7.18.1
Dynamoose version: 2.8.1

Other information (if applicable):

AWS lambda using the Serverless framework.

Serverless version: 2.48.0
aws sdk version: 2.952.0
AWS_NODEJS_CONNECTION_REUSE_ENABLED: 1

Other:

  • [ x ] I have read through the Dynamoose documentation before posting this issue
  • [ x ] I have searched through the GitHub issues (including closed issues) and pull requests to ensure this issue has not already been raised before
  • [ x ] I have searched the internet and Stack Overflow to ensure this issue hasn't been raised or answered before
  • [ x ] I have tested the code provided and am confident it doesn't work as intended
  • [ x ] I have filled out all fields above
  • [ x ] I am running the latest version of Dynamoose
@fishcharlie
Copy link
Member

Curious about where this is occurring.

You mention this is related to scan operations. Are you confident that is the culprit?

I'm aware of this PR which caused high initialization times for Dynamoose: #1154. Could that potentially be the culprit? You could try the v3 alpha to see if that PR improves it.

I'm also aware of a change made recently to remove a cache that was causing memory leaks. That was related to parsing the document.

There is no doubt that I expect Dynamoose to have more overhead than AWS-SDK itself. More work is required to ensure schema conformance. However this performance hit shouldn't be excessive (as it looks to be in your example).

@FanFataL
Copy link

Hi,

I've got the same problem using Dynsamoose on AWS Lambda, on localhost ubuntu works perfectly.

After some investigation I have found what cause the issue: this line

const array: any = (await Promise.all(result.Items.map(async (item) => await new this.internalSettings.model.Document(item, {"type": "fromDynamo"}).conformToSchema({"customTypesDynamo": true, "checkExpiredItem": true, "saveUnknown": true, "modifiers": ["get"], "type": "fromDynamo"})))).filter((a) => Boolean(a));

Processing 150 records took 10s (sic!)

@fishcharlie
Copy link
Member

@FanFataL Sadly that line is very critical and doesn't really lead to information about what is going wrong.

We kinda need to dive into that method and determine which part specifically is slowing things down.

As mentioned in my previous message, v3 alpha does make some improvements to performance. Are you able to try that and see if the issue persists?

@benhegarty
Copy link

We were experiencing the same performance issues with a large query request. After upgrading to v3 alpha, performance is now on-par with the native client.

@fishcharlie
Copy link
Member

Closing since the latest report is that v3 alpha is on-par with native client.

@PaulAtST
Copy link

Now that I see this issue I am going to give v3 a try, but since I did debug this on v2 here are my findings incase anyone else runs into this.

I noticed as my data set grew my api call was taking significantly longer to return the data set. For reference I am now returning about 2,000 items at a time from DynamoDB. It was now taking abut 7 seconds on my local computer. The latency is on the main thread unfortunately which is locking up my server for processing other request.

First I found the line @FanFataL pointed out which in my case was taking about 5 of the 7 seconds:

const array: any = (await Promise.all(result.Items.map(async (item) => await new this.internalSettings.model.Document(item, {"type": "fromDynamo"}).conformToSchema({"customTypesDynamo": true, "checkExpiredItem": true, "saveUnknown": true, "modifiers": ["get"], "type": "fromDynamo"})))).filter((a) => Boolean(a)); 

Next the slowdown was coming from:

.conformToSchema({ "customTypesDynamo": true, "checkExpiredItem": true, "saveUnknown": true, "modifiers": ["get"], "type": "fromDynamo" })

Inside that function it was coming from the line:

const expectedObject = await Document.objectFromSchema(document, document.model, settings)

Inside that function it seams to mostly be coming from, which takes abut 4 seconds

if (settings.modifiers) {
    ....
}

With this line taking about a good second in it's self:

const modifierFunction = await schema.getAttributeSettingValue(modifier, key, { "returnFunction": true, typeIndexOptionMap });

@PaulAtST
Copy link

PaulAtST commented Jun 29, 2022

@fishcharlie Unfortunately I am seeing the same issue with v3, but now it's the lines:

const expectedObject = await Item.objectFromSchema(item, item.getInternalProperties(internalProperties).model, settings);

Speed wise I am using v3 being about 0.5 - 1.0 sec slower then v2 unfortunately. Is there a way to optimize querying for a list of objects? I am concerned this is not going to scale up to what I need for my use case.


EDIT: So I dug a big deeper into v3 if (settings.modifiers) { code block.

In my case the lines

utils_1.default.object.set(returnObject, key, await modifierFunction(value, oldValue));
and
returnObject = await schemaModifier(returnObject);

Never get hit, so my understanding is this whole code block is not helping my use case. Would it be possible to add an optimization that says we don't need to run this code block at all?

Also it looks like the logic inside the promise

 await Promise.all((await Item.attributesWithSchema(returnObject, model)).map(async (key) => {

Gets calculated for each item in the result set. Would it be possible to add a cache of the results for the model being queried so we don't have to calculate it for each item in the result set?

@fishcharlie
Copy link
Member

Reopening based on @PaulAtST's latest findings. Thanks for all your work to pinpoint this @PaulAtST.

@fishcharlie fishcharlie reopened this Jul 5, 2022
@PaulAtST
Copy link

PaulAtST commented Jul 6, 2022

@fishcharlie Your welcome.

I did some more digging into performance profiles for v3. Another slow spot I have identified is the function deep_copy(obj) inside the file deep_copy.js. In particular it's the call objectUtils.isCircular(obj, attr) which is slowing it down significantly. In my case removing that check is saving me abut 1.5 seconds.

Is it possible for a return from DynamoDB to be circular? If not a possible solution can be adding a parameter to deep_copy to not do the circular check when making a copy of a return from DynamoDB.

Also it looks like for each item in the return the line "originalObject": utils_1.default.deep_copy(itemObject) gets called. This can be another place to optimize with an item level cache.

I will keep digging and see what else I find.

@github-actions
Copy link
Contributor

This issue is stale because it has been open 7 days with no activity. Remove stale label or comment or this will be closed in 3 days.

@jpblair
Copy link

jpblair commented Sep 2, 2022

I'm seeing what I think is similar behavior for at least GetItem operations, possibly others. Even when retrieving a single item, the deserialization process via documentify() can take on the order of seconds, depending on how complex/deeply nested the object is. By contrast, a serialization/deserialization using JSON.parse(JSON.stringify(obj)) only takes 2ms. I only did some rough profiling, but it seems to point to some of the same culprits that @PaulAtST identified:

  • await Document.attributesWithSchema(returnObject, model).map(key => {
  • await model.schemaForObject(returnObject)
  • The settings.modifiers block

Here is my schema:

const schema = new dynamoose.Schema(
  {
    revision_id: {
      type: String,
      required: true,
      hashKey: true,
      default: ArcUUID.create
    },
    collection_id: {
      type: String,
      required: true,
      rangeKey: true
    },
    parent_id: {
      type: String
    },
    org_id: {
      type: String,
      required: true
    },
    document: {
      type: Object,
      schema: {
        type: {
          type: String
        },
        headlines: {
          type: Object
        },
        description: {
          type: Object
        },
        content_aliases: {
          type: [String]
        },
        canonical_website: {
          type: String
        },
        websites: {
          type: Set,
          schema: [String]
        },
        query_backfill: {
          type: String
        },
        pinned_count: {
          type: Number,
          default: 0
        },
        content_elements: {
          type: Array,
          schema: [
            {
              type: Object,
              schema: {
                referent: {
                  type: Object,
                  schema: {
                    id: String,
                    type: String,
                    referent_properties: {
                      type: Object,
                      schema: {
                        headlines: {
                          type: Object
                        },
                        description: {
                          type: Object
                        },
                        promo_items: {
                          type: Object
                        }
                      }
                    }
                  }
                },
                type: String
              }
            }
          ]
        }
      }
    }
  },
  {
    timestamps: {
      createdAt: 'created_date',
      updatedAt: 'last_updated_date'
    },
    saveUnknown: true,
    useDocumentTypes: true,
    useNativeBooleans: true
  }
)

And attached is the sample document I'm trying to retrieve/deserialize: sample.txt

This is pretty major hindrance to our application, so please let me know of any mitigations/workarounds/alternatives/etc. Otherwise, I'd be interested to know what the timeline is for fixing this and if there's anything I can do. Please let me know if a new ticket would be helpful. Thanks!

@ptejada
Copy link
Contributor

ptejada commented Sep 4, 2022

@jpblair

I know it might defeat the purpose of using a schema. but can you share the stats for getting your document in this two ways?

const document = await Model.get({...})

// VS

const request = await Model.get({...}, {return: 'request'})
const document = await dynamoose.aws.ddb().getItem(request)

@tareqx2
Copy link

tareqx2 commented Sep 6, 2022

Having similar performance issues when performing a model.query for ~20k items.
Using a dynamoose query takes around 20 seconds where as using a regular dynamo query takes 2 seconds

@jpblair
Copy link

jpblair commented Sep 8, 2022

@ptejada I tried this, and bypassing Dynamoose seems to shave off 600-700ms.

@tranhl
Copy link

tranhl commented Apr 14, 2024

@fishcharlie I can confirm that deep_copy is taking a significant performance toll when querying data sets and objects. I have provided a repository that provides a small reproducible example using DynamoDB Local:

https://github.com/tranhl/dynamooose-deep-copy-repro

As @PaulAtST has identified in this comment, deep_copy is the primary culprit, due to the objectUtils.isCircular call. While deep_copy is coping the source object, it calls isCircular to check whether encountered objects contain circular references. This is essentially an O(n^2) operation, as the isCircular function will also traverse the object in order to determine whether circular references exist.

Instead of relying on isCircular to detect and omit circular references, we should instead handle this ourselves within deep_copy, so that we only traverse the input object once. The provided reproduction repository contains a patched version of Dynamoose that does this, with the profiling results speaking for themselves.

@tranhl tranhl linked a pull request Apr 14, 2024 that will close this issue
20 tasks
@PaulAtST
Copy link

@tranhl Nice work! Sounds like a solid approach to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants