Distance between vertices in the same connected component #421

CarlaFernandez · 2023-01-09T09:54:11Z

I know this is a long shot, but I've been stuck with this problem for a while now and still found no answer. Let's see if any of you can figure it out

Problem

I have a Graphframes graph, from which I've obtained the connected components. Now, I would like to find the distance from a source node to a target node, both pertaining to the same component.

id_src	id_dst	component
123	657	1
234	876	2
876	567	2

I would like to calculate the distance from id_src to id_dst for each row in this DataFrame, so the result would look like:

id_src	id_dst	component	distance
123	657	1	4
234	876	2	2
876	567	2	2

I know I need to use the BFS function from Graphframes, but can't find the way to make it parallel and provide the source and destination id for each row.

What I've tried

I've tried to do it through a UDF, with no luck.

from math import floor
from pyspark.sql.functions import udf
from pyspark.sql.types import *

@udf
def shortest_path(x):
    if x[1] is not None:
        path = g.bfs(f"id = '{x[0]}'", f"id = '{x[1]}'", maxPathLength=4)
        return floor((len(path.columns)-2)/2) + 1

result = vars.select(shortest_path(F.struct("id_src", "id_dst")))

This results in the following exception, I understand it's because I can't parallelize an already parallel function:

Traceback (most recent call last):
  File "/databricks/spark/python/pyspark/serializers.py", line 476, in dumps
    return cloudpickle.dumps(obj, pickle_protocol)
  File "/databricks/spark/python/pyspark/cloudpickle/cloudpickle_fast.py", line 72, in dumps
    cp.dump(obj)
  File "/databricks/spark/python/pyspark/cloudpickle/cloudpickle_fast.py", line 540, in dump
    return Pickler.dump(self, obj)
TypeError: cannot pickle '_thread.RLock' object

I also thought about using a non-parallel library like networkx or igraph, creating a single graph from each connected component. The problem is I don't know how to generate these single graphs and then reference them from the udf.

Any ideas are appreciated, thank you

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distance between vertices in the same connected component #421

Distance between vertices in the same connected component #421

CarlaFernandez commented Jan 9, 2023

Distance between vertices in the same connected component #421

Distance between vertices in the same connected component #421

Comments

CarlaFernandez commented Jan 9, 2023

Problem

What I've tried