Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distance between vertices in the same connected component #421

Open
CarlaFernandez opened this issue Jan 9, 2023 · 0 comments
Open

Distance between vertices in the same connected component #421

CarlaFernandez opened this issue Jan 9, 2023 · 0 comments

Comments

@CarlaFernandez
Copy link

I know this is a long shot, but I've been stuck with this problem for a while now and still found no answer. Let's see if any of you can figure it out

Problem

I have a Graphframes graph, from which I've obtained the connected components. Now, I would like to find the distance from a source node to a target node, both pertaining to the same component.

id_src id_dst component
123 657 1
234 876 2
876 567 2

I would like to calculate the distance from id_src to id_dst for each row in this DataFrame, so the result would look like:

id_src id_dst component distance
123 657 1 4
234 876 2 2
876 567 2 2

I know I need to use the BFS function from Graphframes, but can't find the way to make it parallel and provide the source and destination id for each row.

What I've tried

  1. I've tried to do it through a UDF, with no luck.
from math import floor
from pyspark.sql.functions import udf
from pyspark.sql.types import *

@udf
def shortest_path(x):
    if x[1] is not None:
        path = g.bfs(f"id = '{x[0]}'", f"id = '{x[1]}'", maxPathLength=4)
        return floor((len(path.columns)-2)/2) + 1

result = vars.select(shortest_path(F.struct("id_src", "id_dst")))

This results in the following exception, I understand it's because I can't parallelize an already parallel function:

Traceback (most recent call last):
  File "/databricks/spark/python/pyspark/serializers.py", line 476, in dumps
    return cloudpickle.dumps(obj, pickle_protocol)
  File "/databricks/spark/python/pyspark/cloudpickle/cloudpickle_fast.py", line 72, in dumps
    cp.dump(obj)
  File "/databricks/spark/python/pyspark/cloudpickle/cloudpickle_fast.py", line 540, in dump
    return Pickler.dump(self, obj)
TypeError: cannot pickle '_thread.RLock' object
  1. I also thought about using a non-parallel library like networkx or igraph, creating a single graph from each connected component. The problem is I don't know how to generate these single graphs and then reference them from the udf.

Any ideas are appreciated, thank you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant