New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove MongoJack and consolidate MongoDB utils #837
base: dev
Are you sure you want to change the base?
Conversation
A process that started many years ago, this removes the MongoJack dependency and migrates the `Persistence` module and its collections to the `AnalysisDB` and `AnalysisCollection` types that were set to replace it. In light of a possible database switch in the near future, it seemed more straightforward to continue to use `String` based IDs everywhere, instead of switching to a MongoDB specific `ObjectID` type. Our average collection already uses `String` IDs, but this will require a migration for aggregation areas, data sources, and data groups.
Pull request was converted to draft
- Prefer `Set` vs `EnumSet` (BSON parser can handle `Set` by default) - Prefer `List`s over arrays.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's great to consolidate database interactions to a single, better supported MongoDB driver (farewell, MongoJack cursors). For context, I believe R5 historically used MongoJack, and analysis-server used the MongoDB driver, and they coexisted in this repo after combining the two projects. Edit: maybe both used MongoJack, and our intention was to consolidate but we left it at only at the first step until now.
I directly edited the initial comment in this PR to clarify a couple points (let me know if I mischaracterized anything). In-line comments/suggestions are below. Other comments based on a review and our discussion:
Do the changes of arrays to Lists, Strings to Sets, etc. require any migrations (e.g. of presets)? Edit: yes, this would be a breaking change: stored requests will need migration, and API users will need to update their requests. Note, we have a general principle of preferring one-time migrations to added conditionals/complexity in code.
It looks like this PR removes the custom (de)serializer for ModeSet but keeps them for Transit and Leg. Is it worth keeping the former writing a new AnalysisRequest custom (de)serializer to avoid the need to switch EnumSets to Sets, or more generally using translation classes between pure JSON representation from the database and our internal uses? It feels a bit strange to me to have a PR nominally about database drivers touching core routing code (NetworkPreloader, TravelTimeComputer, PerTargetPropagater etc.)
This PR makes wider use of BsonDiscriminator
: when deserializing from Mongo to a base class, this annotation indicates how to tell which subclass to use. A discussion suggested we previously implemented this approach for decay functions, but it looks like this PR adds annotations to the decay functions too, and changes codec registration for decay functions?
If part of the motivation for this PR is enabling an eventual shift away from MongoDB, should we avoid adding MongoDB specific classes such as
DeleteResult
- GeoJSON
Position
?
// Legacy system for storing Java objects, this functionality is now provided by the MongoDB driver itself. | ||
implementation 'org.mongojack:mongojack:2.10.1' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👋
new GtfsController(database, gtfsCache), | ||
new BundleController(database, fileStorage, gtfsCache, taskScheduler), | ||
new OpportunityDatasetController(fileStorage, taskScheduler, censusExtractor, database), | ||
new RegionalAnalysisController(broker, fileStorage), | ||
new RegionalAnalysisController(broker, database, fileStorage), | ||
new AggregationAreaController(fileStorage, database, taskScheduler), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we use a convention for consistent ordering here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When there is no obvious order, I usually just go for alphabetical.
Any ideas for another convention?
if (findJob(templateTask.jobId) != null) { | ||
LOG.error("Someone tried to enqueue job {} but it already exists.", templateTask.jobId); | ||
throw new RuntimeException("Enqueued duplicate job " + templateTask.jobId); | ||
public synchronized void enqueueTasksForRegionalJob(Job job, MultiOriginAssembler assembler) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any synchronization concerns we should keep in mind?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not that I can think of here. This method was already synchronized
.
* TODO Why is all this detail added after the Persistence call? | ||
* We don't want to store all the details added below in Mongo? | ||
*/ | ||
private RegionalTask templateTaskFromRegionalAnalysis (RegionalAnalysis regionalAnalysis) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replaced by Job.templateTaskFromRegionalTask
(https://github.com/conveyal/r5/pull/837/files#diff-cb2f7bd76fde9cde3a8702c32be5ae84d08f7b789a0da852588e6f374da86895R160)
Note that scenario.json is now created and saved to S3 at RegionalAnalysisController.storeScenarioJson
( https://github.com/conveyal/r5/pull/837/files#diff-675769f1c1f75dae94398599a44ff5db4cb90ab443f717d1c2d81a30caa23c24R425)
var job = new Job(templateTask, WorkerTags.fromRegionalAnalysis(regionalAnalysis)); | ||
var assembler = new MultiOriginAssembler(job, new ArrayList<>()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we update our style guide re: use of var
keyword?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Although I haven't looked at our style guide in ages. My short opinion / addition to the style guide is: prefer using var
for all non-primitive types except in cases where we want to explicitly show a type.
In the examples above, writing the types instead of var
is redundant information on the same line: Job
, and MultiOriginAssembler
are already there.
A case where we might want to be more explicit, is when a method returns a value and we want to distinguish that type in relation to neighboring types.
src/main/java/com/conveyal/analysis/controllers/RegionalAnalysisController.java
Show resolved
Hide resolved
src/main/java/com/conveyal/analysis/controllers/BundleController.java
Outdated
Show resolved
Hide resolved
src/main/java/com/conveyal/analysis/controllers/RegionalAnalysisController.java
Outdated
Show resolved
Hide resolved
src/main/java/com/conveyal/analysis/controllers/RegionalAnalysisController.java
Outdated
Show resolved
Hide resolved
src/main/java/com/conveyal/r5/analyst/cluster/TravelTimeSurfaceTask.java
Outdated
Show resolved
Hide resolved
…egationAreaDerivation.java Co-authored-by: Anson Stewart <astewart@conveyal.com>
Co-authored-by: Anson Stewart <astewart@conveyal.com>
…isController.java Co-authored-by: Anson Stewart <astewart@conveyal.com>
…isController.java Co-authored-by: Anson Stewart <astewart@conveyal.com>
…isController.java Co-authored-by: Anson Stewart <astewart@conveyal.com>
…o remove-mongojack
Simplify the creation of MongoDB codecs by using annotations instead.
Fixes an Intellij warning about unchecked generics.
Include the bundle ID and scenario ID so that the request object does not need to be fetched and deserialized.
Make `GTFSCache.getFileKey` static to enable `BundleController` to operate without depending on the `GTFSCache` component.
Note: this endpoint is currently unused.
A process that started many years ago, this removes the MongoJack dependency and completes the incremental replacement of the
Persistence
module and its collections with theAnalysisDB
andAnalysisCollection
types.In light of a possible database switch in the near future, it seemed more straightforward to continue to use
String
based IDs everywhere, instead of switching to a MongoDB specificObjectID
type.Our average collection already uses
String
IDs, but this will require a migration for aggregation areas, data sources, and data groups.Other notes:
List
s in MongoDB model classes.DbResultWriter
that setscompleted: true
on regional analyses when they are complete.OSMCache
methods static (cleanId
andgetKey
) so that theBundleController
no longer requires it as a component dependency.AggregationAreaController
,BundleController
,DataSourceController
, andRegionalAnalysisController
.BundleController
's created method, breaking it up into smaller steps.RegionalAnalysisController
's create method. Some benefits:FileStorage
andAnalysisDB
components no longer need to be injected through theBroker
to theMultiOriginAssembler
. Result writers are created in the controller with what they need.RegionalTask
creation into theAnalysisRequest
object, where much of the logic already lived. More refactoring can certainly be done here, but it is a good first step.RegionalTask
stored on aRegionalAnalysis
. This means that we only need to serialize it into MongoDB, we don't need to handle deserializing it.This will resolve #758.