Updated port names should break existing tests #142

myxie · 2022-04-19T04:15:43Z

The change to 'port names' described in 01d11f4 should be updated in test cases, but existing test cases pass. This demonstrates a defect in the coupling and test management of the translator.

The previous result in translating from a Logical Graph (LG) to a Physical Graph Template (PGT) led to:

    "producers": [
      "1_-13_0/3/1/1"
    ],
    "consumers": [
      "1_-19_0/3/1/1"
    ]

Now the approach is:

"producers": 
    [{"1_-13_0/0/0/0": "event"}],
"consumers": 
    [{"1_-19_0/0/0/0": "event"}]

This appears to pass all tests in the translator, but this is because test_pg_gen.py does not test the output of the translation. In fact, most of the tests do not even run assert, meaning all they do is confirm that no runtime errors occur.

The reason for this failure is that the test cases use older graphs (daliuge-engine/test/graphs). The only places I can see where "consumers"/"producers" are invoked are in the daliuge-engine code base (e.g. test_dm.py). However, these drops are created as dictionaries rather than created as mocked drop objects based on dropdict in daliuge-common. If there is a change to a foundational class like this, it should be reflected in failing tests across the codebases.

For example, an example hard-coded reflection of dropdict in test_dm.py passes just fine:

      {
          "oid": "C",
          "type": "plain",
          "storage": Categories.MEMORY,
          "producers": ["B"],
      },

However, these tests should fail given that they use a now-outdated approach to managing "ports".

(As an aside, is it possible to entertain the idea of changing the collective term for producers and consumers "ports"? Given the codebase also contains network and server operations in which ports have a functionally different meaning, it seems like a better naming convention could be devised.)

The text was updated successfully, but these errors were encountered:

myxie · 2022-04-19T04:57:13Z

An additional comment on the usability of the new approach.

In daliuge_common.dlg.commont.__init__.dropdict, we have the following method that has led to this issue:

def _addSomething(self, other, key, IdText=None):
        if key not in self:
            self[key] = []
        if other["oid"] not in self[key]:
            append = {other["oid"]:IdText} if IdText else other["oid"]
            self[key].append(append)

This leads to the following competing structures, previous and updated:

# Previous 
"producers":  ["1_-13_0/0/0/0", "1_-15_0/0/0/0"]
# Updated
"producers": [{"1_-13_0/0/0/0": "event"}, {"1_-15_0/0/0/0": "event"},"..."]

If one wants to iterate over all "producers" in the previous example, one can do the following:

for producer in producers:
    nx.graph.nodes(producer) # or something like this

However, the new approach requires the following:

for producer in producers:
    prod_keys = producer.keys() # create a list of size 1
    nx.graph.nodes(prod_keys[0])

This is costly and makes the code much more obfuscated.

myxie · 2022-05-04T06:32:51Z

An update here is that this leads to a serious performance hit: iteration through ~8000 node graph that now takes 80+ seconds. This is because not everything in the producers list is a dictionary; some are simply string elements, which makes it necessary to do the following:

            for uentry in element['producers']:
                try:
                    uentry.keys()
                except AttributeError:
                    u = uentry
                else:
                    u = list(uentry.keys())[0]

The data coupling here is odd, too, as evidenced by the requirement to find retrieve the keys of each producer. If "producers" is a list of individual producers, where a single producer is stored in a dict, then the following structure is much more appropriate:

 "producers":[{"id": "1_-13_0/0/0/0", "type:"event"}, {"id":"1_-15_0/0/0/0","type":"event"},...]

This would make ID access O(1), and it also makes it explicit that there is only 1 producer ID for each "producer" element (the same goes for consumers).

awicenec · 2022-05-04T08:59:48Z

just did a test using the following:

values = list(range(8000))
keys = [str(v) for v in values]
kv = [{key:value} for key,value in dict(zip(keys,values)).items()]
rr = np.randint(0,8000,2000)
kv1 = [kv[r] if list(kv[r].values())[0]==r else list(kv[r].keys())[0] for r in rr]

and then run

%timeit kkeys = [list(k.keys())[0] if isinstance(k, dict) else k for k in kv1]
506 µs ± 25.7 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

i.e. kv1 is a list of dictionaries with the same structure as we are using now. I'm not quite sure where the issue of not having always a dictionary in the producers list is coming from, out of my head, but anyway at least for this, I don't see a performance hit at all. What I really would like to have is a dictionary of dictionaries of the form:

"producers":{"1_-13_0/0/0/0":{"type:"event"}, "1_-15_0/0/0/0":{"type":"event"},...}

myxie · 2022-05-04T09:38:32Z

Apologies - the performance hit was mostly due to an logger.debug call (logging level is set to warning though...)

There is still a definite performance hit because it is not just 8,000 nodes, but traversing edges (so north of 10000/a nested for loop). The following has a x10 performance hit between approaches:

    for element in dlg_json_dict:
        if ('storage' in element.keys()) and (
                'producers' in element and 'consumers' in element
        ):
            for uentry in element['producers']:
                a = uentry
                for ventry in element['consumers']:
                    b = ventry
# runtime=0.015253782272338867

    for element in dlg_json_dict:
        if ('storage' in element.keys()) and (
                'producers' in element and 'consumers' in element
        ):
            for uentry in element['producers']:
                try:
                    uentry.keys()
                except AttributeError:
                    u = uentry
                else:
                    u = list(uentry.keys())[0]
                for ventry in element['consumers']:
                    try:
                        ventry.keys()
                    except AttributeError:
                        v = ventry
                    else:
                        v = list(ventry.keys())[0]

# runtime=0.1401374340057373

The runtime is much more negligent overall, but having to check for different types does lead to a performance degradation.

The previous comments describe how daliuge_common.dlg.commont.__init__.dropdict._addSomething has behaviour that adds either a str element, or a dict element to the list:

append = {other["oid"]:IdText} if IdText else other["oid"]

It looks like else may be better replaced with {other["oid"]:None}, to make it explicit that each element of "producers" is an individual node, and every element is guaranteed to be a dict.

rtobar · 2022-05-04T09:58:18Z

I'm not familiar with the whole story here, but you can replace list(x.keys())[0] by next(x.keys()) to save creating a list each time, hopefully reducing your runtime costs.

awicenec · 2022-05-04T10:32:53Z

I kept the original list because of the backwards compatibility also with the tests. But will have a look into this again. The point I'm not sure about is why you sem to be getting a mixed list, with both dictionaries and just IDs as elements. That really should not happen. The code you are referring to suggests that it might be possible, but if the system is driven from a graph, the list should only contain the dictionaries. Also the list comprehension I've used with the if else statement is less expensive than relying on a try exception block. The next() call for dict_keys does not work because these are not iterators. However, you can ommit the list conversion completely, if you are using list comprehension: kkeys = [k.keys() if isinstance(k, dict) for k in kv]

…

On Wed., 4 May 2022, 17:58 rtobar, ***@***.***> wrote: I'm not familiar with the whole story here, but you can replace list(x.keys())[0] by next(x.keys()) to save creating a list each time, hopefully reducing your runtime costs. — Reply to this email directly, view it on GitHub <#142 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAJVMSDUNVAKJ5HQL5DHFBDVIJC4NANCNFSM5TXRPVEQ> . You are receiving this because you commented.Message ID: ***@***.***>

Log message formatting should be left to the logging system instead of being carried out by users; this is because in many cases, specially in the lower logging levels like DEBUG, messages will actually be filtered out and never make it to any of the sinks. This is sadly a very common mistake one sees in the wild, and can cause real performance issues. These might have been the slow logging reported in #142, but even if it isn't the change is beneficial. Signed-off-by: Rodrigo Tobar <rtobar@icrar.org>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Updated port names should break existing tests #142

Updated port names should break existing tests #142

myxie commented Apr 19, 2022 •

edited

myxie commented Apr 19, 2022 •

edited

myxie commented May 4, 2022 •

edited

awicenec commented May 4, 2022

myxie commented May 4, 2022 •

edited

rtobar commented May 4, 2022

awicenec commented May 4, 2022 via email

Updated port names should break existing tests #142

Updated port names should break existing tests #142

Comments

myxie commented Apr 19, 2022 • edited

myxie commented Apr 19, 2022 • edited

myxie commented May 4, 2022 • edited

awicenec commented May 4, 2022

myxie commented May 4, 2022 • edited

rtobar commented May 4, 2022

awicenec commented May 4, 2022 via email

myxie commented Apr 19, 2022 •

edited

myxie commented Apr 19, 2022 •

edited

myxie commented May 4, 2022 •

edited

myxie commented May 4, 2022 •

edited