Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#5325 Store and show way ids #5968

Open
wants to merge 11 commits into
base: master
Choose a base branch
from

Conversation

nnseva
Copy link

@nnseva nnseva commented Feb 22, 2021

Issue #5325

  • Features:
  • API:
    • Fixed item type for nodes section of the annotations flatbuffers output format

@akashihi
Copy link
Contributor

@nnseva Thank you for your contribution :) Let's wait for more opinions before merging.

I'm also curious, have you estimated, what is the dataset size change?

@nnseva
Copy link
Author

nnseva commented Feb 24, 2021

@akashihi

I'm also curious, have you estimated, what is the dataset size change?

The unpacked dataset size should be increased to 8*N bytes, where N is a number of indexes in the segments file. The segments file will contain an additional section /osm_ways with the way IDs array corresponding to the segments array.

I didn't estimate the real packed dataset file size.

Copy link
Member

@mjjbell mjjbell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, some minor code comments.

Not sure if it matters for your use-case, but you'll currently get duplicates returned in the annotation if the way can't be fully compressed (e.g. one of the nodes is an intersection).

For example:

    Scenario: way annotations
        Given the query options
            | annotations | true |

        Given the node map
            """
            a--b--c
               |
               d
            """

        And the ways
            | nodes |
            | abc   |
            | bd    |


        When I route I should get
          | from | to | route    | 
          |    a |  c | abc,abc  |

The response will contain:

"ways": [ 5, 5 ],

Deduplication could be performed client-side as an alternative.

With regards to data size, you can get an idea of segment count by looking at the R-tree segs column from this spreadsheet, as it should be almost identical.

For some of the bigger map extracts, unpacked size will be:

  • Germany ~250MB
  • North America ~2.3 GB
  • Europe ~2.5 GB

I don't have numbers on hand for how much RAM is currently needed for routing on those extracts, it's probably adding a few %?

include/extractor/segment_data_container.hpp Outdated Show resolved Hide resolved
include/extractor/segment_data_container.hpp Outdated Show resolved Hide resolved
src/extractor/compressed_edge_container.cpp Outdated Show resolved Hide resolved
src/extractor/node_based_graph_factory.cpp Outdated Show resolved Hide resolved
src/extractor/compressed_edge_container.cpp Outdated Show resolved Hide resolved
@mjjbell
Copy link
Member

mjjbell commented Feb 24, 2021

Not sure if it matters for your use-case, but you'll currently get duplicates returned in the annotation if the way can't be fully compressed (e.g. one of the nodes is an intersection).

In fact, we'll get this even if the way is fully compressed. If a way contains N edges, we'll get that way returned N times in a row in the annotation.

@nnseva
Copy link
Author

nnseva commented Feb 25, 2021

@mjjbell

"ways": [ 5, 5 ],

Sure, every pair of sequentially neighbor nodes will have a separate member in the ways array of the annotations:

    ...
    "nodes": [1, 2, 3],
    "ways": [5, 5]
    ...

(meaning that nodes 1, 2, 3 all in the way 5 and connected sequentially).

Such a duplication allows recognizing the path structure without additional analysis of the OSM structure. The same role plays an additional sign of the returned way ID - positive means that the path segment direction corresponds to the way direction, while negative means that the path direction is opposite to the way direction. I.e. the response for the opposite direction request with the same OSM data might look like:

   ...
   "nodes": [3, 2, 1],
   "ways": [-5, -5]
   ...

I should note also that the current code returns annotations always containing all nodes along a found path, without any compression.

I've only added way IDs (with direction sign) between every sequential node pairs in the returned node sequence.

@jcoupey
Copy link

jcoupey commented Feb 25, 2021

I don't have numbers on hand for how much RAM is currently needed for routing on those extracts, it's probably adding a few %?

I just did an extract/contract pass on the latest germany pbf using v5.24.0 and the default car profile with excludables disabled. Then osrm-routed uses about 3.3G of RAM at startup so that sounds like a ~7.5% increase in RAM usage in that case.

Given the impact here and the fact that RAM consumption has always been a touchy subject (often a bottleneck when using OSRM), maybe we could further investigate the impact? I can report some figures on a bigger extract such as Europe when the preprocessing is done.

Maybe a silly question (I did not look up the implementation): would it be possible to make this change optional? There are definitely scenarios where this is super handy, but in other situations it would be just a pain to require extra RAM for an unused feature.

UPDATE 2: RAM consumption usage for osrm-routed on europe-210208 with the same setup is 26,2G for master and 29G for this PR, so it's more than a 10% increase in memory requirement. Quite significant.

@mjjbell
Copy link
Member

mjjbell commented Feb 26, 2021

UPDATE 2: RAM consumption usage for osrm-routed on europe-210208 with the same setup is 26,2G for master and 29G for this PR, so it's more than a 10% increase in memory requirement. Quite significant.

Given this, I'm inclined to agree that this should be stored as a separate file that can be optionally loaded.

@nnseva
Copy link
Author

nnseva commented Mar 8, 2021

@jcoupey @mjjbell

UPDATE 2: RAM consumption usage for osrm-routed on europe-210208 with the same setup is 26,2G for master and 29G for this PR, so it's more than a 10% increase in memory requirement. Quite significant.

I should notice that the memory used for the additional segment data (Way IDs vector) is really included in mem-mapped storage (see https://github.com/Project-OSRM/osrm-backend/pull/5968/files#diff-ac92f7a34eb9d5b8b41e6a930db0f2e7356af695724842b390032f1e10842dc2R138). I'm not sure if we can rely on measuring memory consumption done this way.

@eveyeti
Copy link

eveyeti commented Mar 14, 2021

Hi @nnseva ,

A try to create a docker image with your PR but when it compile and always have the same error.

c++: fatal error: Killed signal terminated program cc1plus
compilation terminated.
make[2]: *** [CMakeFiles/EXTRACTOR.dir/build.make:323: CMakeFiles/EXTRACTOR.dir/src/extractor/scripting_environment_lua.cpp.o] Error 1
make[1]: *** [CMakeFiles/Makefile2:717: CMakeFiles/EXTRACTOR.dir/all] Error 2
make: *** [Makefile:152: all] Error 2

how to fix the compilation error?

CreatingDockerImage.log

Regards
Etienne

@nnseva
Copy link
Author

nnseva commented Mar 15, 2021

Hi @eveyeti ,

c++: fatal error: Killed signal terminated program cc1plus
compilation terminated.

I don't know exactly, but usually such a problem means not enough memory to compile. Try to increase memory and/or reduce number of processor cores available for the docker in your dockerfile.

Check your dockerfile against an OSRM master branch.

@nnseva
Copy link
Author

nnseva commented Mar 17, 2021

@jcoupey @mjjbell

I've implemented skipping osm way ids in memory.

The extractor now can skip storing OSM Way IDs, the --skip-osm-ways optional parameter of the extractor turns the feature on.

If the OSM Way IDs section is skipped, the router will skip showing them in annotations.

The correspondent data section is not filled (the 0-size vector is created instead) and doesn't spend memory space.

@frodrigo
Copy link
Member

Since this feature requires more memory than previous and not required in main use case. It is not preferable to have an option to enable it, rather than one to disable it?

@nnseva
Copy link
Author

nnseva commented Mar 17, 2021

@frodrigo ,

Since this feature requires more memory than previous and not required in main use case. It is not preferable to have an option to enable it, rather than one to disable it?

  1. Notice that the list of OSM Node IDs in the annotation is not required in the main use case also, but even more, is always on.
  2. The annotation section list is determined by the enumeration values in the compile time. It would be not a desirable behavior if one of the sections will never appear by default.

@mjjbell
Copy link
Member

mjjbell commented Mar 19, 2021

I agree that this should be opt-in rather than opt-out (that was the initial proposal in #5325).
I expect most users will use the default processing setup, so we should optimise for that and give advanced users the option to customise.

Notice that the list of OSM Node IDs in the annotation is not required in the main use case also, but even more, is always on.

That's a fair point. However, there is interest in an alternative file loading mechanism that reduces memory footprint without a performance penalty (#5974, #5879, #5838). In which case, I can see OSM nodes also being an optionally loaded dataset in the not too distant future.

@jcoupey
Copy link

jcoupey commented Mar 20, 2021

there is interest in an alternative file loading mechanism that reduces memory footprint without a performance penalty

That's a very interesting prospect indeed, which goes beyond the case of this PR. @mjjbell as you pointed out there are already a few tickets about various specific aspects of this question, maybe we should open a meta-ticket on the core principle of a new loading mechanism?

@eveyeti
Copy link

eveyeti commented Mar 20, 2021

Hi @eveyeti ,

c++: fatal error: Killed signal terminated program cc1plus
compilation terminated.

I don't know exactly, but usually such a problem means not enough memory to compile. Try to increase memory and/or reduce number of processor cores available for the docker in your dockerfile.

Check your dockerfile against an OSRM master branch.

Hi @nnseva thanks a lot, it works your advice!

Regards
Etienne.-

@nnseva
Copy link
Author

nnseva commented Mar 24, 2021

@jcoupey

there is interest in an alternative file loading mechanism that reduces memory footprint without a performance penalty

That's a very interesting prospect indeed, which goes beyond the case of this PR. @mjjbell as you pointed out there are already a few tickets about various specific aspects of this question, maybe we should open a meta-ticket on the core principle of a new loading mechanism?

I would like to notice that there is some open future in requirements to annotate nodes and segments along the found route.

Looking for this future it might be appropriate to have a customizable annotations feature that might have an influence on memory consumption. This feature should allow customizing not only values (as it is now available) but keys of annotations.

Such a customizable annotation feature should allow customizing a set of available annotation keys, as well as some customizable procedures to get values for such annotations. Such a customizable procedure may be a scripting procedure calling on the extraction phase, as well as just a special open data file format associating source OSM IDs (for nodes) and OSM ID pairs (for segments) with annotated values.

Having such a feature, the deployment staff will have options to control memory consumption allowing only necessary annotation keys for the deployed OSRM instance.

@nnseva
Copy link
Author

nnseva commented Mar 24, 2021

I agree that this should be opt-in rather than opt-out (that was the initial proposal in #5325).
I expect most users will use the default processing setup, so we should optimise for that and give advanced users the option to customise.

Please take your attention to that the annotation key set is defined on the compilation time like:

    enum class AnnotationsType
    {
        None = 0,
        Duration = 0x01,
        Nodes = 0x02,
        Distance = 0x04,
        Weight = 0x08,
        Datasources = 0x10,
        Speed = 0x20,
        Ways = 0x40,
        All = Duration | Nodes | Distance | Weight | Datasources | Speed | Ways
    };

The unsophisticated user using the default options values will probably expect that all available annotation keys including ways will be shown in case of all annotations are requested (annotations=true request parameter). Why the "ways" section (only!) is not appear? The optimization is not a good explanation for such a missing section from my point of view.

@jcoupey
Copy link

jcoupey commented Mar 24, 2021

Why the "ways" section (only!) is not appear?

That would indeed be somehow inconsistent. The remarks about opt-in are rather a stand on a status quo position resulting from a concern on RAM requirements for standard usage. Of course you're right the same could (should?) apply to other datasets, but we have to deal with things as they are now and none of the rest is optional.

I think the core question here is whether we have the ability to add a generic mechanism to avoid loading a range of datasets. If this can be done realistically mid-term, we can probably merge this and live with the temporal increase until we provide opt-out mechanisms.

Probably others with more knowledge of the codebase and implications can comment on that last part?

@mjjbell
Copy link
Member

mjjbell commented Mar 24, 2021

The unsophisticated user using the default options values will probably expect that all available annotation keys including ways will be shown in case of all annotations are requested (annotations=true request parameter). Why the "ways" section (only!) is not appear? The optimization is not a good explanation for such a missing section from my point of view.

@nnseva I see your point. I was thinking it would it would have to return an error indicating the annotation was disabled. But if this means we're building a quasi optional data loader, then maybe it's better to design it properly to work for all optional datasets.

I think the core question here is whether we have the ability to add a generic mechanism to avoid loading a range of datasets. If this can be done realistically mid-term, we can probably merge this and live with the temporal increase until we provide opt-out mechanisms.

Probably others with more knowledge of the codebase and implications can comment on that last part?

There is an existing ticket (#5218) that outlines the steps required. It was a while ago, but @danpat maybe you have a sense for the effort required? I don't think the data loading code has changed much since then.

@@ -494,7 +494,7 @@ class RouteAPI : public BaseAPI
return anno.datasource;
});
}
std::vector<uint32_t> nodes;
std::vector<uint64_t> nodes;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this really be uint64 here? In Flatbuffers they are defined as uint32

Edit:
Have to correct myself, it was redefined to uint64, why?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants