Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue when accessing objects by index with latest Mongo versions #2748

Open
maudetes opened this issue Mar 31, 2023 · 7 comments
Open

Issue when accessing objects by index with latest Mongo versions #2748

maudetes opened this issue Mar 31, 2023 · 7 comments

Comments

@maudetes
Copy link

maudetes commented Mar 31, 2023

Hello !
We've upgraded to the latest mongo, pymongo and mongoengine versions recently.
It seems that we sometimes have duplicates results when paginating using slices or index on objects with an order_by clause.
FYI, we use the pagination from flask-mongoengine, that uses slices to build page items.

It seems that accessing an object by slice (or index) is not consistent when the order_by clause is applied on non-existent (or sometimes empty) field.

Here is a small script to show the index iteration issue, by counting the number of occurrences of the same object when iterating.

from collections import Counter
from mongoengine import connect, Document, StringField

connect('tumblelog')


class User(Document):
    email = StringField(required=True)


# Clean previous users if any and create new ones
for user in User.objects():
    user.delete()
assert User.objects.count() == 0
for i in range(20):
    User(email=f'{i}@mail.com').save()

# Test iteration with a for loop
# It returns each result only once
users = User.objects.order_by('foo')
c = Counter()
for user in users:
    c.update([user.email])
print(f'Iterate with for loop: {c.most_common()}')

# Test iteration with index
# It returns some results more than once
users = User.objects.order_by('foo')
c = Counter()
for i in range(users.count()):
    c.update([users[i].email])
print(f'Iterate using slice: {c.most_common()}')

Results show that using index, the same elements are returned multiple times instead of being returned once only (as in the for loop case).

Iterate with for loop: [('0@mail.com', 1), ('1@mail.com', 1), ('2@mail.com', 1), ('3@mail.com', 1), ('4@mail.com', 1), ('5@mail.com', 1), ('6@mail.com', 1), ('7@mail.com', 1), ('8@mail.com', 1), ('9@mail.com', 1), ('10@mail.com', 1), ('11@mail.com', 1), ('12@mail.com', 1), ('13@mail.com', 1), ('14@mail.com', 1), ('15@mail.com', 1), ('16@mail.com', 1), ('17@mail.com', 1), ('18@mail.com', 1), ('19@mail.com', 1)]
Iterate using index: [('6@mail.com', 7), ('14@mail.com', 6), ('2@mail.com', 3), ('0@mail.com', 1), ('1@mail.com', 1), ('5@mail.com', 1), ('13@mail.com', 1)]

Tested with mongoengine 0.20.0 and 0.27.0.
It seemed to return inconsistent pagination results since Mongo 4.4 only.


Thank you for your support.
Please, let me know if this is not a mongoengine issue or if I'm using something incorrectly.

@bagerard
Copy link
Collaborator

bagerard commented Apr 1, 2023

Weird indeed, I'll have to look into it but in the meantime, note that you can cast the queryset into a list as a workaround

users = list(User.objects.order_by('foo'))
c = Counter()
for i in range(len(users)):
    print(i, users[i].email)
    c.update([users[i].email])
print(f'Iterate using slice: {c.most_common()}') 

@bagerard
Copy link
Collaborator

Just looked into this and issue is not in MongoEngine, it is in Pymongo. When using the index access, MongoEngine relies on PyMongo cursor behavior.

In this particular snippet you are sorting by a field that doesn't exist on the documents and this seems to be causing the inconsistencies, see below.

image

It's unusual for a cursor to be iterated by the index. I understand it's an odd behavior but usually you just fire the query, use skip & limit and then just iterate in the results.

@ShaneHarvey Could you elaborate on this? Is it an expected behavior?

@bagerard
Copy link
Collaborator

bagerard commented Mar 2, 2024

@ShaneHarvey do you have any idea / is it a known behavior?

@ShaneHarvey
Copy link
Contributor

ShaneHarvey commented Mar 4, 2024

I agree that this looks like a bug in the server but it's possible it could be a known behavior change. I've reported it here and am waiting for their response: https://jira.mongodb.org/browse/SERVER-87430

Here's my repro using only pymongo, not mongoengine:

from pymongo import MongoClient

client = MongoClient()
coll = client.test.test
version = client.server_info()['version']
print(f'MongoDB version: {version}')

coll.drop()
coll.insert_many([{"_id": i} for i in range(20)])

print('Find docs with a single query:')
print([doc["_id"] for doc in coll.find(sort={'missing': 1})])

print('Find docs with the same query with skip+limit:')
docs = []
for i in range(20):
    docs.append(coll.find_one(sort={'missing': 1}, skip=i))
print([doc["_id"] for doc in docs])

print('Find docs using aggregation with skip+limit:')
docs = []
for i in range(20):
    docs.append(list(coll.aggregate([{"$sort": {'missing': 1}}, {"$skip": i}, {"$limit": 1}]))[0])
print([doc["_id"] for doc in docs])

On MongoDB <=4.4:

$ python repro2748.py
MongoDB version: 4.2.24
Find docs with a single query:
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]
Find docs with the same query with skip+limit:
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]
Find docs with the aggregation with skip+limit:
[0, 1, 1, 3, 3, 3, 3, 7, 7, 7, 7, 7, 7, 7, 7, 15, 15, 15, 15, 15]

On MongoDB 4.4+

$ python repro2748.py
MongoDB version: 4.4.19
Find docs with a single query:
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]
Find docs with the same query with skip+limit:
[0, 1, 1, 3, 3, 3, 3, 7, 7, 7, 7, 7, 7, 7, 7, 15, 15, 15, 15, 15]
Find docs with the aggregation with skip+limit:
[0, 1, 1, 3, 3, 3, 3, 7, 7, 7, 7, 7, 7, 7, 7, 15, 15, 15, 15, 15]

@ShaneHarvey
Copy link
Contributor

I've done a little more investigation and found that, unfortunately, this was an intentional change in MongoDB 4.4 (see SERVER-51498). The behavior is documented here:

MongoDB does not store documents in a collection in a particular order. When sorting on a field which contains duplicate values, documents containing those values may be returned in any order.

If consistent sort order is desired, include at least one field in your sort that contains unique values. The easiest way to guarantee this is to include the _id field in your sort query.

https://www.mongodb.com/docs/manual/reference/operator/aggregation/sort/#sort-consistency

Taking the advice in the docs you would need to add "_id" to the sort, like this:

users = User.objects.order_by('foo', '_id')

@maudetes
Copy link
Author

Oh indeed! Thank you a lot for taking time to look into this issue!
Seems like adding _id seems like the way to go when sorting on fields with duplicate (or empty) values then.

Maybe it would be worth adding a note on MongoEngine queryset doc, but else I think my issue has been answered.

Thank you both again.

@bagerard
Copy link
Collaborator

bagerard commented Mar 15, 2024

I may be misunderstanding but in this case the issue is not that the order of documents returned is not predictable when there are duplicates values on the sorted field. (Unless you are saying this may occur within the same cursor instance)

The same instance of the cursor is returning the same documents multiple times when we use the index access on a cursor i.e cursor[j] (and collection isn't being altered while we iterate).

Indeed sorting on _id will fix it but current user experience is quite unexpected

(It's rather odd to use the index access so I m not necessarily worried or looking for a fix but I wanted to make sure we were aligned on the observation)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants