Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove MongoJack and consolidate MongoDB utils #837

Open
wants to merge 44 commits into
base: dev
Choose a base branch
from

Conversation

trevorgerhardt
Copy link
Member

@trevorgerhardt trevorgerhardt commented Nov 8, 2022

A process that started many years ago, this removes the MongoJack dependency and completes the incremental replacement of the Persistence module and its collections with the AnalysisDB and AnalysisCollection types.

In light of a possible database switch in the near future, it seemed more straightforward to continue to use String based IDs everywhere, instead of switching to a MongoDB specific ObjectID type.

Our average collection already uses String IDs, but this will require a migration for aggregation areas, data sources, and data groups.

Other notes:

  • Switched from arrays to Lists in MongoDB model classes.
  • Added a DbResultWriter that sets completed: true on regional analyses when they are complete.
  • Made two OSMCache methods static (cleanId and getKey) so that the BundleController no longer requires it as a component dependency.
  • Updated the MongoDB driver from v3.11.0 to v4.7.2. MongoDB currently provides documentation for v4.3 to v4.8.
  • Removed unused HTTP endpoints (and their handlers) in the AggregationAreaController, BundleController, DataSourceController, and RegionalAnalysisController.
  • Refactored BundleController's created method, breaking it up into smaller steps.
  • Refactored RegionalAnalysisController's create method. Some benefits:
    • The FileStorage and AnalysisDB components no longer need to be injected through the Broker to the MultiOriginAssembler. Result writers are created in the controller with what they need.
    • Moved the RegionalTask creation into the AnalysisRequest object, where much of the logic already lived. More refactoring can certainly be done here, but it is a good first step.
    • Removed any usage of the RegionalTask stored on a RegionalAnalysis. This means that we only need to serialize it into MongoDB, we don't need to handle deserializing it.

This will resolve #758.

A process that started many years ago, this removes the MongoJack dependency and migrates the `Persistence` module and its collections to the `AnalysisDB` and `AnalysisCollection` types that were set to replace it.

In light of a possible database switch in the near future, it seemed more straightforward to continue to use `String` based IDs everywhere, instead of switching to a MongoDB specific `ObjectID` type.

Our average collection already uses `String` IDs, but this will require a migration for aggregation areas, data sources, and data groups.
@trevorgerhardt trevorgerhardt added cleanup dependencies Pull requests that update a dependency file t0 Time level 0: think hours labels Nov 8, 2022
@trevorgerhardt trevorgerhardt changed the title Remove MongoJack and consolidate utils Remove MongoJack and consolidate MongoDB utils Nov 10, 2022
@trevorgerhardt trevorgerhardt marked this pull request as ready for review November 10, 2022 02:34
@trevorgerhardt trevorgerhardt enabled auto-merge (squash) November 10, 2022 02:35
@trevorgerhardt trevorgerhardt marked this pull request as draft November 11, 2022 22:46
auto-merge was automatically disabled November 11, 2022 22:46

Pull request was converted to draft

Copy link
Member

@ansoncfit ansoncfit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's great to consolidate database interactions to a single, better supported MongoDB driver (farewell, MongoJack cursors). For context, I believe R5 historically used MongoJack, and analysis-server used the MongoDB driver, and they coexisted in this repo after combining the two projects. Edit: maybe both used MongoJack, and our intention was to consolidate but we left it at only at the first step until now.

I directly edited the initial comment in this PR to clarify a couple points (let me know if I mischaracterized anything). In-line comments/suggestions are below. Other comments based on a review and our discussion:

Do the changes of arrays to Lists, Strings to Sets, etc. require any migrations (e.g. of presets)? Edit: yes, this would be a breaking change: stored requests will need migration, and API users will need to update their requests. Note, we have a general principle of preferring one-time migrations to added conditionals/complexity in code.

It looks like this PR removes the custom (de)serializer for ModeSet but keeps them for Transit and Leg. Is it worth keeping the former writing a new AnalysisRequest custom (de)serializer to avoid the need to switch EnumSets to Sets, or more generally using translation classes between pure JSON representation from the database and our internal uses? It feels a bit strange to me to have a PR nominally about database drivers touching core routing code (NetworkPreloader, TravelTimeComputer, PerTargetPropagater etc.)

This PR makes wider use of BsonDiscriminator: when deserializing from Mongo to a base class, this annotation indicates how to tell which subclass to use. A discussion suggested we previously implemented this approach for decay functions, but it looks like this PR adds annotations to the decay functions too, and changes codec registration for decay functions?

If part of the motivation for this PR is enabling an eventual shift away from MongoDB, should we avoid adding MongoDB specific classes such as

  • DeleteResult
  • GeoJSON Position?

Comment on lines -152 to -153
// Legacy system for storing Java objects, this functionality is now provided by the MongoDB driver itself.
implementation 'org.mongojack:mongojack:2.10.1'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👋

Comment on lines 86 to 90
new GtfsController(database, gtfsCache),
new BundleController(database, fileStorage, gtfsCache, taskScheduler),
new OpportunityDatasetController(fileStorage, taskScheduler, censusExtractor, database),
new RegionalAnalysisController(broker, fileStorage),
new RegionalAnalysisController(broker, database, fileStorage),
new AggregationAreaController(fileStorage, database, taskScheduler),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we use a convention for consistent ordering here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When there is no obvious order, I usually just go for alphabetical.

Any ideas for another convention?

if (findJob(templateTask.jobId) != null) {
LOG.error("Someone tried to enqueue job {} but it already exists.", templateTask.jobId);
throw new RuntimeException("Enqueued duplicate job " + templateTask.jobId);
public synchronized void enqueueTasksForRegionalJob(Job job, MultiOriginAssembler assembler) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any synchronization concerns we should keep in mind?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not that I can think of here. This method was already synchronized.

* TODO Why is all this detail added after the Persistence call?
* We don't want to store all the details added below in Mongo?
*/
private RegionalTask templateTaskFromRegionalAnalysis (RegionalAnalysis regionalAnalysis) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +72 to +73
var job = new Job(templateTask, WorkerTags.fromRegionalAnalysis(regionalAnalysis));
var assembler = new MultiOriginAssembler(job, new ArrayList<>());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we update our style guide re: use of var keyword?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Although I haven't looked at our style guide in ages. My short opinion / addition to the style guide is: prefer using var for all non-primitive types except in cases where we want to explicitly show a type.

In the examples above, writing the types instead of var is redundant information on the same line: Job, and MultiOriginAssembler are already there.

A case where we might want to be more explicit, is when a method returns a value and we want to distinguish that type in relation to neighboring types.

trevorgerhardt and others added 2 commits December 7, 2022 04:10
…egationAreaDerivation.java

Co-authored-by: Anson Stewart <astewart@conveyal.com>
Co-authored-by: Anson Stewart <astewart@conveyal.com>
@trevorgerhardt trevorgerhardt marked this pull request as ready for review December 9, 2022 09:40
@trevorgerhardt trevorgerhardt marked this pull request as draft December 15, 2022 13:45
Make `GTFSCache.getFileKey` static to enable `BundleController` to operate without depending on the `GTFSCache` component.
Note: this endpoint is currently unused.
@trevorgerhardt trevorgerhardt marked this pull request as ready for review December 30, 2022 11:16
@trevorgerhardt trevorgerhardt enabled auto-merge (squash) January 21, 2023 06:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cleanup dependencies Pull requests that update a dependency file t0 Time level 0: think hours
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ObjectId vs String _id consistency
2 participants