Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updated port names should break existing tests #142

Open
myxie opened this issue Apr 19, 2022 · 6 comments
Open

Updated port names should break existing tests #142

myxie opened this issue Apr 19, 2022 · 6 comments

Comments

@myxie
Copy link
Collaborator

myxie commented Apr 19, 2022

The change to 'port names' described in 01d11f4 should be updated in test cases, but existing test cases pass. This demonstrates a defect in the coupling and test management of the translator.

The previous result in translating from a Logical Graph (LG) to a Physical Graph Template (PGT) led to:

    "producers": [
      "1_-13_0/3/1/1"
    ],
    "consumers": [
      "1_-19_0/3/1/1"
    ]

Now the approach is:

"producers": 
    [{"1_-13_0/0/0/0": "event"}],
"consumers": 
    [{"1_-19_0/0/0/0": "event"}]

This appears to pass all tests in the translator, but this is because test_pg_gen.py does not test the output of the translation. In fact, most of the tests do not even run assert, meaning all they do is confirm that no runtime errors occur.

The reason for this failure is that the test cases use older graphs (daliuge-engine/test/graphs). The only places I can see where "consumers"/"producers" are invoked are in the daliuge-engine code base (e.g. test_dm.py). However, these drops are created as dictionaries rather than created as mocked drop objects based on dropdict in daliuge-common. If there is a change to a foundational class like this, it should be reflected in failing tests across the codebases.

For example, an example hard-coded reflection of dropdict in test_dm.py passes just fine:

      {
          "oid": "C",
          "type": "plain",
          "storage": Categories.MEMORY,
          "producers": ["B"],
      },

However, these tests should fail given that they use a now-outdated approach to managing "ports".

(As an aside, is it possible to entertain the idea of changing the collective term for producers and consumers "ports"? Given the codebase also contains network and server operations in which ports have a functionally different meaning, it seems like a better naming convention could be devised.)

@myxie
Copy link
Collaborator Author

myxie commented Apr 19, 2022

An additional comment on the usability of the new approach.

In daliuge_common.dlg.commont.__init__.dropdict, we have the following method that has led to this issue:

def _addSomething(self, other, key, IdText=None):
        if key not in self:
            self[key] = []
        if other["oid"] not in self[key]:
            append = {other["oid"]:IdText} if IdText else other["oid"]
            self[key].append(append)

This leads to the following competing structures, previous and updated:

# Previous 
"producers":  ["1_-13_0/0/0/0", "1_-15_0/0/0/0"]
# Updated
"producers": [{"1_-13_0/0/0/0": "event"}, {"1_-15_0/0/0/0": "event"},"..."]

If one wants to iterate over all "producers" in the previous example, one can do the following:

for producer in producers:
    nx.graph.nodes(producer) # or something like this

However, the new approach requires the following:

for producer in producers:
    prod_keys = producer.keys() # create a list of size 1
    nx.graph.nodes(prod_keys[0]) 

This is costly and makes the code much more obfuscated.

@myxie
Copy link
Collaborator Author

myxie commented May 4, 2022

An update here is that this leads to a serious performance hit: iteration through ~8000 node graph that now takes 80+ seconds. This is because not everything in the producers list is a dictionary; some are simply string elements, which makes it necessary to do the following:

            for uentry in element['producers']:
                try:
                    uentry.keys()
                except AttributeError:
                    u = uentry
                else:
                    u = list(uentry.keys())[0]

The data coupling here is odd, too, as evidenced by the requirement to find retrieve the keys of each producer. If "producers" is a list of individual producers, where a single producer is stored in a dict, then the following structure is much more appropriate:

 "producers":[{"id": "1_-13_0/0/0/0", "type:"event"}, {"id":"1_-15_0/0/0/0","type":"event"},...]

This would make ID access O(1), and it also makes it explicit that there is only 1 producer ID for each "producer" element (the same goes for consumers).

@awicenec
Copy link
Contributor

awicenec commented May 4, 2022

just did a test using the following:

values = list(range(8000))
keys = [str(v) for v in values]
kv = [{key:value} for key,value in dict(zip(keys,values)).items()]
rr = np.randint(0,8000,2000)
kv1 = [kv[r] if list(kv[r].values())[0]==r else list(kv[r].keys())[0] for r in rr]

and then run

%timeit kkeys = [list(k.keys())[0] if isinstance(k, dict) else k for k in kv1]
506 µs ± 25.7 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

i.e. kv1 is a list of dictionaries with the same structure as we are using now. I'm not quite sure where the issue of not having always a dictionary in the producers list is coming from, out of my head, but anyway at least for this, I don't see a performance hit at all. What I really would like to have is a dictionary of dictionaries of the form:

"producers":{"1_-13_0/0/0/0":{"type:"event"}, "1_-15_0/0/0/0":{"type":"event"},...}

@myxie
Copy link
Collaborator Author

myxie commented May 4, 2022

Apologies - the performance hit was mostly due to an logger.debug call (logging level is set to warning though...)

There is still a definite performance hit because it is not just 8,000 nodes, but traversing edges (so north of 10000/a nested for loop). The following has a x10 performance hit between approaches:

    for element in dlg_json_dict:
        if ('storage' in element.keys()) and (
                'producers' in element and 'consumers' in element
        ):
            for uentry in element['producers']:
                a = uentry
                for ventry in element['consumers']:
                    b = ventry
# runtime=0.015253782272338867
    for element in dlg_json_dict:
        if ('storage' in element.keys()) and (
                'producers' in element and 'consumers' in element
        ):
            for uentry in element['producers']:
                try:
                    uentry.keys()
                except AttributeError:
                    u = uentry
                else:
                    u = list(uentry.keys())[0]
                for ventry in element['consumers']:
                    try:
                        ventry.keys()
                    except AttributeError:
                        v = ventry
                    else:
                        v = list(ventry.keys())[0]

# runtime=0.1401374340057373

The runtime is much more negligent overall, but having to check for different types does lead to a performance degradation.

The previous comments describe how daliuge_common.dlg.commont.__init__.dropdict._addSomething has behaviour that adds either a str element, or a dict element to the list:

append = {other["oid"]:IdText} if IdText else other["oid"]

It looks like else may be better replaced with {other["oid"]:None}, to make it explicit that each element of "producers" is an individual node, and every element is guaranteed to be a dict.

@rtobar
Copy link
Contributor

rtobar commented May 4, 2022

I'm not familiar with the whole story here, but you can replace list(x.keys())[0] by next(x.keys()) to save creating a list each time, hopefully reducing your runtime costs.

@awicenec
Copy link
Contributor

awicenec commented May 4, 2022 via email

rtobar added a commit that referenced this issue May 5, 2022
Log message formatting should be left to the logging system instead of
being carried out by users; this is because in many cases, specially in
the lower logging levels like DEBUG, messages will actually be filtered
out and never make it to any of the sinks. This is sadly a very common
mistake one sees in the wild, and can cause real performance issues.

These might have been the slow logging reported in #142, but even if it
isn't the change is beneficial.

Signed-off-by: Rodrigo Tobar <rtobar@icrar.org>
awicenec pushed a commit that referenced this issue May 19, 2022
Log message formatting should be left to the logging system instead of
being carried out by users; this is because in many cases, specially in
the lower logging levels like DEBUG, messages will actually be filtered
out and never make it to any of the sinks. This is sadly a very common
mistake one sees in the wild, and can cause real performance issues.

These might have been the slow logging reported in #142, but even if it
isn't the change is beneficial.

Signed-off-by: Rodrigo Tobar <rtobar@icrar.org>
pritchardn pushed a commit that referenced this issue May 20, 2022
Log message formatting should be left to the logging system instead of
being carried out by users; this is because in many cases, specially in
the lower logging levels like DEBUG, messages will actually be filtered
out and never make it to any of the sinks. This is sadly a very common
mistake one sees in the wild, and can cause real performance issues.

These might have been the slow logging reported in #142, but even if it
isn't the change is beneficial.

Signed-off-by: Rodrigo Tobar <rtobar@icrar.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants