Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DOT and GML format writers are extremely slow when there are many attributes #2567

Open
szhorvat opened this issue Apr 9, 2024 · 8 comments
Assignees
Labels
fuzz Issues found using fuzzing, or related to fuzzing
Milestone

Comments

@szhorvat
Copy link
Member

szhorvat commented Apr 9, 2024

The DOT format writer is surprisingly slow, causing timeouts in the write_all_gml and write_all_graphml fuzz targets. Originally, I believed that the performance issues were due to Flex (which is also an existing problem!), but it seems this is a different thing. Commenting this out makes the problem go away:

file = tmpfile();
IGRAPH_ASSERT(file != NULL);
CHECK_ERR(igraph_write_graph_dot(&g, file));
fclose(file);

The corresponding OSS-Fuzz issue, with test cases, is at:

@szhorvat szhorvat added the fuzz Issues found using fuzzing, or related to fuzzing label Apr 9, 2024
@szhorvat
Copy link
Member Author

szhorvat commented Apr 9, 2024

My hunch is that the issue is the constant memory (re)allocations in dot_escape(). Memory allocations are expensive in general, and particularly expensive with sanitizers enabled.

@ntamas ntamas self-assigned this Apr 9, 2024
@ntamas
Copy link
Member

ntamas commented Apr 9, 2024

Okay, this bothers me and it's a fuzzing issue so let's try to tackle it. I'll make a quick patch with an internal "resizable string" data type and re-write dot_escape() to use that. If it solves the problem, we can do it properly.

ntamas added a commit that referenced this issue Apr 9, 2024
…e memory all the time when escaping strings, refs #2567
@ntamas
Copy link
Member

ntamas commented Apr 9, 2024

Okay, check out the commit above. it rewrites dot_escape() to use an igraph_vector_char_t that is grown only when needed.

@szhorvat
Copy link
Member Author

szhorvat commented Apr 9, 2024

I'm sorry, I seem to have made both a mistake and an incorrect guess. It turns out that the bad performance is both in write_dot() and write_gml(). It happens to be the case that both have similar escaping functions, but the bottleneck is not that, and eliminating reallocation didn't seem to improve the performance. The test case runs in about 87 seconds on my machine, both before and after the commit.

What I should have done form the get go was of course to profile. Profiling points to igraph_i_cattribute_find(), which searching for a string by iterating through a list and comparing each element one by one.

image

Both ExecuteCallback and TryDetectingAMemoryLeak seem to call the same functions, so for our purposes they are equivalent.

image

Digging further down always leads to igraph_i_cattribute_find()

image

In summary, this is not a simple fix, and if we do fix it, it should be on develop, not master, where some of the attribute handling code is already refactored.

@ntamas
Copy link
Member

ntamas commented Apr 10, 2024

Okay, but then why is this not a problem for GraphML? The same construct is being used there (i.e. loop over all vertices/edges, then within each vertex/edge loop over all attributes and look them up with igraph_i_attribute_get_string_vertex_attr()).

@szhorvat
Copy link
Member Author

I was thinking about the same too last night, but I was too tired. It is possible that the string attribute values have characters that are forbidden in XML, and the GraphML writer fails early. See the code here:

file = tmpfile();
IGRAPH_ASSERT(file != NULL);
// Do not check error when writing GraphML because string attributes
// may contain forbidden control characters.
if (igraph_write_graph_graphml(&g, file, false) == IGRAPH_SUCCESS) {
rewind(file);
// Do not check error when reading because strings may not be
// in a valid encoding, which confuses libxml2.
if (igraph_read_graph_graphml(&g2, file, 0) == IGRAPH_SUCCESS) {
igraph_destroy(&g2);
}
}

@szhorvat
Copy link
Member Author

szhorvat commented Apr 11, 2024

@ntamas So the problem is that attribute lookup uses a dumb linear search in the C attribute handler. Does this affect the Python attribute handler? @krlmlr Do you know if this affects the R attribute handler?

If only C is affected, I propose that we defer this to 2.0, as that's when the big attribute handling refactoring will happen (assuming we get funding for it...)

@ntamas
Copy link
Member

ntamas commented Apr 11, 2024

It's only the C attribute handler, which, in its current state, is yet another example of a temporary solution that turned out to be permanent :-D It was never meant to be performant, we just needed something simple to enable us to use attributes from C in unit tests, and then things escalated quickly. Neither the R nor the Python attribute handler are affected. The Python attribute handler uses Python dicts.

@ntamas ntamas added this to the 2.0 milestone Apr 11, 2024
@szhorvat szhorvat changed the title DOT format writer is extremely slow DOT and GML format writers are extremely slow when there are many attributes Apr 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fuzz Issues found using fuzzing, or related to fuzzing
Projects
None yet
Development

No branches or pull requests

2 participants