Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added skeleton of package for python support #81

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open

added skeleton of package for python support #81

wants to merge 3 commits into from

Conversation

skewballfox
Copy link

I added a skeleton of a python library that can be fleshed out to handle the currently supported dataTypes. If I need to make any changes I can, this isn't meant to be working at the current stage so much as provide a bit of structure that makes contributions covering one aspect of functionality easier.

from pyDataExtraction.commonTypes.Graph import Graph

if __name__ == "__main__":
graph_data1 = {"A": ["B", "C"], "B": ["A,C"], "C": ["A,D"], "D": ["A"]}
Copy link

@insolor insolor Sep 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think "A,C" and "A,D" must be separated?

graph_data1 = {"A": ["B", "C"], "B": ["A", "C"], "C": ["A", "D"], "D": ["A"]}

Copy link

@mvoitko mvoitko Sep 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multiline formatting of dictionary seems to be more readable

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've mainly be using black for formatting, and it seems to condense it down if a line is less than 88 characters.

also, the main is current just representing how this would be used python side, but this will be where typescript interfaces with the code, passing information and returning information to the typescript for visualization, so this is going to be reworked extensively once I can get a look at the data format passed from the PyEvaluationEngine.ts file.

def __repr__(self):
pass
"""
# NOTE: ran into issue Node object is not json serializable when ecapsulating in own class
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can add a method in Node class to serialize the data into json.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, I'm just now starting to actually learn about serialization due to this project and a personal rust project.

could you point me in the direction of some documentation on how to do so?

Copy link

@insolor insolor Sep 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@skewballfox you can simply create a dictionary of the desired form and then serialize it into JSON:

def to_json(self):
    return json.dumps({'from': self.from_node, 'to': self.to})

(this is an example for the Edge class, you can do this for Node class too)

This is a crude approach, which is inconvenient for classes with many fields, but in this case, I think it is quite applicable. Also such a way you can use from_node field name for the 'from' JSON key.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to_node would also be more readable.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@insolor this is similar to what I was doing in __repr__ but that definitely helps fix that issue in a way that doesn't involve mangling the language. I don't know why overloading from was my first guess.

also I think I figured out why Node wasn't serializable. It was because the __repr__ function doesn't seem to be recursive. as an example When calling print(graph) that doesn't automatically behave as if print(node) were called.

either I could come up with a generic method that does this in DataType to avoid having to define it in every current and future dataType that uses other objects as components(if present), or we have to overload the function in those cases.

I think I might have an idea using __slots__, I'll have to see though

@@ -0,0 +1,10 @@
from json import dumps
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest using import json to follow "Explicit is better than implicit" principle.


class Graph(DataType):
"""

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why have empty lines here?

super().__init__()
self.kind["graph"] = True
# TODO get working for both a dictionary and an nxn array
# self.nodes = [Node(node) for node in graph data]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why have commented out code?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I commented it out mainly to have a reference to the other possible implementation, and because I kept going back and forth between different implementations.

I would rather have Nodes and Edges encapsulated in their own class, in case we want to implement different types of graphs.

the commented out code was meant to be deleted after the class was complete. I can go ahead and delete it now though

# self.nodes = [Node(node) for node in graph data]
for node in graph_data:
self.nodes.append({"id": str(node)})
# TODO change prints to log statements
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like a too minor change to left a TODO for it and not implement it right away

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's partially because my experience implementing logs across a library is none, I've only implemented them in a single file before, and didn't try to make it where the log statement was different across classes.

# print("edge: ", graph_data[node][edge_i])
self.edges.append({"from": node, "to": edge})
# edge_i += 1
# edge_i = 0
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why keep commented out code?

@@ -0,0 +1,24 @@
from json import dumps
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest just import json for the sake of readability

@@ -0,0 +1,24 @@
from json import dumps
from abc import ABC, abstractmethod
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is abstractmethod used somewhere?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, I didn't realize I had left that in there.

Copy link

@mvoitko mvoitko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments mainly about code style

graph_data1 = {"A": ["B", "C"], "B": ["A,C"], "C": ["A,D"], "D": ["A"]}
graph_data2 = {1: [2, 3], 2: [1, 3], 3: [1, 4], 4: [1]}
graph = Graph(graph_data1)
print(graph)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use logging with respective log level

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what you mean. I was mainly using __main__ to verify that the graph implementation works as I was writing it, though that should be move to a separate directory for testing, as seems to be the standard practice for python libraries.

I'm not sure if we could use the same test directory as node without pytest throwing errors due to the presence of js files. my experience with pytest and how it works is limited.

class Edge:

def __init__(self,from: str, to: str,):
self.fromnode
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please use self.from_node

"""
class Edge:

def __init__(self,from: str, to: str,):
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add space after comma

self.fromnode

def __repr__(self):
pass
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why have an empty method?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because it needs to be implemented. This was mainly meant to be a skeleton of a library, make it easier for people to contribute to the development of one set of dataTypes. Also we can't use the dataType's __repr__ method for Edge or Node.

for one, json complains Node objects aren't serializable when calling __repr__ on graph.

two, from is a syntax token, and also what is expected to be inside the json representation of a graph. so in Edge's case in particular, it's going to require us to either:

  1. do something hacky to use from as a keyword (not my preferred option),
  2. parse the json representation of it's dict and change every instance of whatever variable we used to represent from
  3. handle all the edge related stuff inside graph(would rather avoid that)

def __init__(self, id: Union[int, str], label: Optional[str] = None):
super().__init__()
self.id = id
if label is None:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest simplifying to:
self.lable = lable or id

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that is actually a hella useful feature I didn't know about. Thanks for that.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, given the implementation, and what we are passing the information to, this likely wouldn't work, as the attribute would still be listed in the json output(at least with the method currently used to produce json.

I'm not sure if this would be the case, but I'm trying to avoid all of the nodes being labeled None in the visualizer. So, I set it up to where attributes were only present in the case where a value was assigned.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you give an example? I'm not able to understand.
As from what I think if we go with
self.label = label or id
If anyone doesn't pass label the default is None and the self.label will be populated by id.
Correct?

# self.nodes = [Node(node) for node in graph data]
self.nodes = []
self.edges = []
if isinstance(graph_data, dict):
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not check the type here. We expect graph data to be a dictionary otherwise we should have exception not silence it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well, no, the reason being is not counting classes for graphs(just too much trouble to worry about at this moment), you can have a graph represented as a dict or 2d array.

I intended after getting the dictionary implementation working to check else isinstance(graph_data, list) or something similar for explicitly a list of list.

# print(edge_i)
# print("edge: ", graph_data[node][edge_i])
self.edges.append({"from": node, "to": edge})
# edge_i += 1
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for node, edge in graph_data.items():
     self.node.append({"from": node, "to": edge})
     self.edges.append({"from": node, "to": edge})

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tested in ipython

for node, edge in graph_data1.items():
     print(node," : ", edge)

output:

A  :  ['B', 'C']
B  :  ['A', 'C']
C  :  ['A', 'D']
D  :  ['A']

you could do so with a second loop for edge in edges



Args:
DataType (Union[Dict[str,list],Dict[int,list]]):
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add speces after commas

from typing import Union, Dict, Optional
from abc import ABC, abstractmethod
from pyDataExtraction.commonTypes.Graph import Graph

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add newline

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can run flake8(for linting) and black(code formatting) in local(console) to do all such changes automatically and consistency will be maintained for all the contributors.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I currently use black for formatting, though I've never used flake8, nor have I used black from the commandline. I know there's a precommit plugin for git that would call both to on attempted commits to make sure that they are used.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You just have to install black in local(or in the virtual environment of python) and run black <file_name> This will do the formatting for the file.
And about flake8, It's just that we can have a consistent style guide for python over this project which follows pep-8 guidelines.
As some of the comments by @mvoitko is regarding this only.
Just a suggestion 😄

Copy link

@mvoitko mvoitko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More comments

self.kind["text"] = True
self.text = text_data
if mimeType is None:
self.mimeType = mimeType
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what are you trying to achieve here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, that was probably a missing not

self.text = text_data
if mimeType is None:
self.mimeType = mimeType
if fileName is None:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest you wanted the opposite:

if file_name:
     self.file_name = file_name

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it may be when dealing with edge cases far removed from this or a leftover from python 2, but from what I keep reading, using is None seems to be preferred when dealing with values that may be None

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, while editing this, I just remembered why I checked. because this library currently relies on json.dumps(self.__dict__) to print the json representation of an object in a format the plugin expects, I was only conditionally instantiating these variables.

they weren't part of the object unless they were explicitly added to the object.

this may be another reason to either find a better method of creating json objects, or use super to get the inherited json representation and return a version altered in some way.

def __init__(
self,
text_data: str,
mimeType: Optional[str] = None,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's keep it to the pep-8 python style guide:

mime_type,
file_name

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the variable names are mainly because of the use case and the method currently for doing so.

right now I'm directly converting the respective dataTypes to json objects when they are printed or called as a string, and variables are the displayed variables inside the json string, so I've been matching the case of the specification listed on the Readme.

Though since we already need to do some manipulations on the json representation before returning in the case of the Edge class, doing so here shouldn't be a major issue.

btw if we got rid of the reliance on __dict__, we could use __slots__ to slightly improve performance and reduce memory usage.

def __init__(
self,
text_data: str,
mimeType: Optional[str] = None,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please name params in snake case:
mime_type, file_name

def __init__(
self,
text_data: str,
mimeType: Optional[str] = None,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please use snake case:
mime_type, file_name

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants