Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance of fact propagation through the Rete graph #200

Open
snikolayev opened this issue Sep 17, 2019 · 21 comments
Open

Improve performance of fact propagation through the Rete graph #200

snikolayev opened this issue Sep 17, 2019 · 21 comments

Comments

@snikolayev
Copy link
Member

Reduce the number and extent of allocations during fact propagation.
Also improve performance of Rete graph, particularly when joining facts.

@larryPlayabl
Copy link

I've done a bit of analysis using the Unity3D profiler and our NRules codebase.

This is running NRules as of:

branch: develop
SHA: 96baec8
Commit: "Fixes error handling with Yield actions. #198"

Our codebase has approximately 110 rules spread among multiple NRules sessions. Many have Ors, multiplying the complexity of the Rete nets.

At the point of this snapshot, there were on the order of hundreds of facts in the sessions. All sessions contain the same set of facts.

As you can see by looking at the profile images, the largest offender for allocations is JoinNode.PropogateUpdate, which in this snapshot allocated 1.1MB.

overview

BetaMemoryNode PropogateUpdate

Drilling into that a bit, the most allocation-intensive spots are:

  • JoinNode.MatchesConditions.BetaCondition.IsSatisfiedBy

BetaConditionOverview

  • JoinNode.MatchesConditions use of List.GetEnumerator

ListAllocationOverview

  • JoinNode.PropagateUpdate's use of List.Add in TupleFactList.Add and BinaryBetaNode.CrossJoin()

ListAddAllocations

BetaCondition.IsSatisfiedBy() is definitely the most intensive point of allocations in JoinNode.PropagateUpdate, NotNode.PropagateUpdate and ExistsNode.PropagateUpdate.

In this instance, JoinNode.PropagateUpdate.MatchesConditions allocates 247.1KB in its calls to BetaCondition.IsSatisfiedBy().

Of that, 105KB is allocated in calls to Tuple.get_Facts(), which constructs new Tuple instances.

The remainder is not specifically detected as can be seen by the unspecified GC.Alloc line of 142.1KB. Unity's profiler is not able to precisely identify which of the other calls the allocations are associated with.

JoinNode PropogateUpdate

NotNode and ExistsNode show a similar pattern of allocations:

NotNode PropogateUpdate

ExistsAllocations

@larrybehaviorlanguage
Copy link

@snikolayev I'm bringing our conversation from gitter over here for context, and I've switched accounts to one that's more appropriate for me.

snikolayev:
I think it's a great idea to explore resource pooling. But I don't think Tuples is a good candidate. Tuples in NRules are stateful and long lived. They are also small and not expensive to create. The candidates I'm thinking about are arrays of facts created during expression evaluation, which are allocated and then collected right away. Another candidate is fact enumerator in a tuple, which is created numerous times during fact propagation through the graph. Also, fact lists and tuple lists created during propagation are good candidates for pooling; though these are harder to tackle due to the recursive nature of propagation. One other idea I have is to convert fact propagation from recursive to iterative, at which point pooling fact/tuple lists should be much easier.

larry:
Thanks for the pointers. Yes, it makes sense to start with the easiest options that offer the greatest optimization for amount of effort and risk. I'll openly admit that I'm not especially well versed in this kind of optimization, and would definitely want to keep you in the loop as I experiment to make sure I'm going in the right direction. I know there can be subtle details in the implementation that may make all the difference in how much allocation savings we're actually reaping.

The object arrays created in the LHSExpression Invoke methods look like they'd be pretty straightforward to pool. Do you believe that swapping these arrays out for List that are cleared and repopulated each time would offer an allocation advantage?

If that won't help significantly, I could also imagine caching a Dictionary<ITuple, object[]> to be used by these LHSExpression methods. I believe this would work because the object[] will always be the same size for the Tuple. Obviously, this cache would need to be updated whenever a given tuple's facts change. That could either be done by iterating through them and updating them on each call to Invoke, or by checking for a "dirty" flag on the tuple which would indicate that the facts have changed since the last invoke, though that'd be more complex.

For the Tuple.Facts IEnumerable, I could give each Tuple a List that's cleared and repopulated by walking the tuples on each call to get the IEnumerable. However, again, maybe there'd be a more efficient way to do that and only update the Facts list as needed. Think that'd be possible?

Actually, I assume it's possible that the Facts property for a given Tuple will be called while another caller is already iterating through its facts. So that'd mean that clearing the same reusable list with each call wouldn't be safe.

snikolayev:
regarding LhsExpression.Invoke, swapping arrays with Lists will likely not change anything, as that's still an allocation (and same array behind the scenes). How about ArrayPool? The rub here is that this is not compatible with .net framework 4.5, so that would significantly change what frameworks NRules targets. Regarding Dictionary caching - hard to say w/o prototyping. My hunch is that stateful stuff that lives together with the Tuple is likely going to be not suitable for pooling. Regarding Tuple.Facts - materializing lists of facts will likely just make it worse. At least right now enumerator is ephemeral. But it is the enumerator itself that's instantiated over and over again, and so I was thinking about pooling the enumerators. It's a bit hard to have a good back and forth here on gitter. How about discussing it here: #200 ?

@larrybehaviorlanguage
Copy link

larrybehaviorlanguage commented Apr 21, 2020

Regarding LhsExpression.Invoke, currently each call to invoke calls

args = new object[_compiledExpression.ArrayArgumentCount];

If I understand correctly, this means that with each call to invoke, an amount of stack memory will be allocated equal to (object pointer size) * (args.Length).

Then, at the end of the execution of Invoke(), that memory is free to be garbage-collected.

It seems that instead, the entire class could have just one reusable object[] that's instantiated in the constructor like this:

ReusableArgsArray = new object[_compiledExpression.ArrayArgumentCount];

Then, on each call to invoke it could be repopulated like this:

int index = tuple.Count - 1;
foreach (var tupleFact in tuple.Facts)
{
   IndexMap.SetElementAt(ReusableArgsArray, _factMap[index], tupleFact.Value);
   index--;
}
 IndexMap.SetElementAt(ReusableArgsArray, _factMap[tuple.Count], fact.Value);

I believe this would mean that the memory would now be allocated only once for the lifetime of the LhsExpression object. That sounds like a major win, but I may be totally off on this, or there may be a more optimal solution.

The compatibility issue of ArrayPool does seem problematic. I work primarily with the Unity 3D game/interactive application engine, and it doesn't seem to have include the System.Buffers namespace by default. However I was just able to get the System.Buffers dll via NuGet and it seems to work at first glance. So not sure how much of an issue this is.


Regarding Tuple.Facts, pooling enumerators sounds interesting. I haven't explored that before, but sounds like a solid solution. It is my understanding that a List doesn't allocate any memory on a call to List.GetEnumerator(). Supposedly since the enumerator is a struct it's allocated on the stack, not the heap, but I'd like to verify this. If that is the case, then it does also seem that an implementation where each Tuple reuses a List, and updates that List only when the facts list needs to be updated would be an overall optimization. But it sounds more complicated to support that. In the current implementation, where exactly is the memory being allocated in the Facts getter? Is it purely in boxing/unboxing?

Also, to clarify, is it possible that, for a given Tuple, Tuple.Facts can be accessed concurrent with another iteration through Tuple.Facts?

@larrybehaviorlanguage
Copy link

For what it's worth, I was able to verify via memory profiling that this implementation allocates no memory on calls to the Facts property:

private readonly List<Fact> facts = new List<Fact>();

private void PopulateFacts()
{
    facts.Clear();

    var tuple = this;
    while (tuple != null)
    {
        if (tuple.RightFact != null)
        {
            facts.Add(tuple.RightFact);
        }

        tuple = tuple.LeftTuple;
    }
}

public List<Fact> Facts
{
    get
    {
        PopulateFacts();
        return facts;
    }
}

@larrybehaviorlanguage
Copy link

I also confirmed that using an ArrayPool doesn't allocate additional memory on Rent/Return. So totally agree that would be a good option if the compatibility issues are solvable.

@snikolayev
Copy link
Member Author

snikolayev commented Apr 23, 2020

@larrybehaviorlanguage I think it's a promising idea to pre-allocate the array in LhsExpression. Interestingly, in benchmarks, allocating a new array is faster than clearing an existing array with Array.Clear. But, I'm verifying if maybe I don't need to clear the array, given that each expression is using the same slots in the array and is guaranteed to overwrite them.
I think I should be able to commit a fix for this shortly.

@snikolayev
Copy link
Member Author

Regarding tuple enumerator - tuple in NRules is meant to be a recursive data structure, similar to lists in functional languages, where it's incrementally constructed by adding new facts to an existing tuple, incrementally, w/o materializing the whole list at every level. Essentially every tuple object creates an N+1 tuple out of an N-tuple. Creating a list of facts inside each tuple defeats that purpose. It will allocate an array of all N+1 facts, compounding at each level. And so, even though it will allocate it only once per tuple, it will still destroy performance.
The idea of enumerator for the tuple is to never materialize that collection.
Internally in NRules a tuple may only be enumerated by a single thread. Externally, exposed via ITuple interface, it can be enumerated in any way the user wants. One solution is to have a separate enumerator internally vs externally.

@snikolayev
Copy link
Member Author

@larrybehaviorlanguage one thing I wanted to ask you is why is it so important for you to reduce allocations. Gen 0 allocations that are short-lived are actually very efficient from the GC perspective. What's your overall worry? Also, are there specific problems or perf/memory issues you are experiencing and believe are due to allocations?

@larrybehaviorlanguage
Copy link

@snikolayev Great - reusing the array in Invoke sounds even better if you're always guaranteed to overwrite it per call.

@larrybehaviorlanguage
Copy link

As for the tuple enumerator, that totally makes sense. IEnumerator pooling certainly sounds preferable if possible. I'll think on that and try some more experiments to see if I can come up with a non-allocating implementation. Again, I fear that currently the allocations may actually be partially from boxing/unboxing. So any solution would need to take care of that problem as well.

I suppose another option would be to pool lists and use them for the enumeration as shown above. clearing and populating them on demand. In this way, the tuples would share the resources, instead of creating one List per Tuple. I'll investigate that as well, though pooling just the enumerators definitely sounds better if it can work.

@larrybehaviorlanguage
Copy link

larrybehaviorlanguage commented Apr 23, 2020

@snikolayev As for why I'm so concerned about allocations, I may be in a minority where my use of NRules is for realtime, performance-intensive graphical applications. Specifically, games using the Unity3D game engine.

In such applications, the graphics thread needs to run at a constant framerate, or the game will feel unresponsive and laggy. Typically that's something on the order of 16 milliseconds to let all game components update, including the graphics rendering to achieve a framerate of 60 frames/second.

In my applications, NRules runs evaluation totally asynchronously so it can't block the main thread, but garbage collection must run on the main thread. So, it's very important to minimize garbage collection as much as possible, even if it's efficient.

Currently, when I trigger a major NRules fact update, I'll see frame-rate drop because of garbage collection.

In a different application, these levels of allocation would likely be completely fine.

@larrybehaviorlanguage
Copy link

larrybehaviorlanguage commented Apr 23, 2020

@snikolayev After doing some more digging, I have confirmed that the reason we're seeing allocations for Tuple.Facts enumeration is purely because this is a value type that needs to be boxed to be usable as an IEnumerator reference type. This is the same type of boxing allocation you'll see if you store a value type like an int in an object, like this:

int i = 9;
object o = i; // allocates memory on the stack to "box" the value type

If we follow the pattern set forth in List, which uses a struct enumerator and relies on C# "duck typing" vs the IEnumerator interface, this eliminates all allocations. I did a quick test in a sandbox and confirmed that this causes no garbage-collectable allocations since everything is on the stack. This solution also addresses any concerns about asynchronous access since each access gets its own enumerator struct, just like with a List.

public class Tuple
{
    public Fact RightFact;
    public Tuple LeftTuple;
    public string name;

    public Tuple(string name)
    {
        this.name = name;
    }

    public TupleFactsEnumerable Facts
    {
        get => new TupleFactsEnumerable(this);
    }
}

public struct TupleFactsEnumerable
{
    private Tuple tuple;

    public TupleFactsEnumerable(Tuple tuple)
    {
        this.tuple = tuple;
    }

    public TupleEnumerator GetEnumerator()
    {
        return new TupleEnumerator(tuple);
    }
}

public struct TupleEnumerator
{
    public Tuple declaringTuple;

    private Fact currentFact;
    private Tuple currentTuple;

    public Fact Current => currentFact;

    public bool MoveNext()
    {
        if (currentTuple == null)
        {
            currentTuple = declaringTuple;
        }
        else
        {
            currentTuple = currentTuple.LeftTuple;
        }

        if (currentTuple != null && currentTuple.RightFact != null)
        {
            currentFact = currentTuple.RightFact;
            return true;
        }
        else
        {
            return false;
        }
    }

    public TupleEnumerator(Tuple declaringTuple)
    {
        this.declaringTuple = declaringTuple;
        currentTuple = null;
        currentFact = null;
    }
}

The TupleFactsEnumerable could be eliminated altogether if the GetEnumerator() method were just added directly to Tuple in place of the Facts. It's more or less a wrapper to stay consistent with the current iteration implementation of:

`foreach (Fact fact in tuple.Facts) {}'

Otherwise, it could be updated to allow iteration as:

`foreach (Fact fact in tuple) {}'

One thing this doesn't address is the case of a Tuple's "list" of facts being altered during iteration. However, I don't believe the current implementation addresses this either. Perhaps this is not possible in NRules?

There are various discussions about this use of "duck typing" for efficient enumeration around online. Obviously it violates some contracts of C# as a strongly typed language and requires a mutable struct, but allows for iteration with no allocation, so seems like a reasonable trade-off for special cases.

Here's one such article.

@snikolayev
Copy link
Member Author

snikolayev commented Apr 24, 2020

@larrybehaviorlanguage something does not add up here. Boxing/unboxing would only happen if the enumerator was a struct (like you wrote in your prototype). But in NRules Tuple.Facts is a generator, which is compiled by Roslyn into a private class. Here how it looks, as I decompiled the binary:

.class nested private sealed auto ansi beforefieldinit
    '<get_Facts>d__30'
      extends [mscorlib]System.Object
      implements class [mscorlib]System.Collections.Generic.IEnumerable`1<class NRules.Rete.Fact>, [mscorlib]System.Collections.IEnumerable, class [mscorlib]System.Collections.Generic.IEnumerator`1<class NRules.Rete.Fact>, [mscorlib]System.IDisposable, [mscorlib]System.Collections.IEnumerator
{
//...
}

As you can see it extends Object, not ValueType, so it's a class, not a struct, and there should be no boxing/unboxing happening.
But of course there will be an allocation.

@larrybehaviorlanguage
Copy link

Ahh, very interesting. I incorrectly assumed it must be a value type under the hood, but this makes sense. Thanks for explaining.

Regardless, it seems like the struct option eliminates the heap allocation problem.

@larrybehaviorlanguage
Copy link

Though pooling of these would likely also be an option if you still prefer that.

@snikolayev
Copy link
Member Author

Yep, I'm going to look at both options and see what works best

@snikolayev
Copy link
Member Author

@larrybehaviorlanguage unfortunately reusing array arguments for expression evaluation didn't work. A rete network (and expressions are part of the rete network) is shared between one or more sessions. So, if there are multiple sessions used from different threads, these shared arrays will lead to data corruption. I reverted the change and will now look at other mechanisms for reducing allocations.

@larrybehaviorlanguage
Copy link

@snikolayev Ah, gotcha. I've never used asynchronous sessions for the same rete, but that totally makes sense.

Would it instead be possible to give each expression a synchronized Stack<object[]>? During Invoke, you could try to pop an array from the stack to use. If none are available, just instantiate one. Then, at the end of Invoke, push it back onto the stack.

@snikolayev
Copy link
Member Author

@larrybehaviorlanguage I'm going to experiment with ArrayPool and thread local storage. I want to reduce allocations, but at the same time I want to make sure this does not happen at the expense of performance, so need to carefully navigate this line.

@larrybehaviorlanguage
Copy link

@snikolayev That makes total sense. I'd imagine that replacing any allocations with reuse of arrays or collections would in general be a performance improvement with the side benefit of eliminating the allocation itself. However, I may be off on that, or there may be exceptions. Regardless, you would hope that Microsoft themselves would make the most optimal solution with ArrayPool vs a homegrown solution with a synchronized Stack as I suggested. I just know that you were concerned about depending on ArrayPool limiting the number of possible platform targets for NRules.

snikolayev added a commit that referenced this issue May 14, 2020
- Optimize tuple looping by using a struct iterator and eliminating allocations
- Optimize LHS expression by directly weaving tuple unwrapping into expression, eliminating array allocation
- Don't call expression evaluation hooks if no subscribers, eliminating result boxing
TODO:
- Lazily populate arguments array on expression evaluated subscription
- Optimizer RHS and filter expressions
snikolayev added a commit that referenced this issue May 23, 2020
- Weave tuple unwrapping into optimized expression, to eliminate array allocation
- Weave dependency resolution into optimized expression if rule uses dependencies
- Eliminate terminal node and index map at the activation level and bake it into the compiled rule and action expressions
snikolayev added a commit that referenced this issue May 24, 2020
- Only pay the cost of allocation and evaluation for expression arguments if there are subscribers to lifecycle events, or if accessing via action invocation
snikolayev added a commit that referenced this issue May 24, 2020
* Optimize expressions #200

- Optimize tuple looping by using a struct iterator and eliminating allocations
- Optimize LHS expression by directly weaving tuple unwrapping into expression, eliminating array allocation
- Don't call expression evaluation hooks if no subscribers, eliminating result boxing

- Weave tuple unwrapping into optimized expression, to eliminate array allocation
- Weave dependency resolution into optimized expression if rule uses dependencies
- Eliminate terminal node and index map at the activation level and bake it into the compiled rule and action expressions

_ Eliminate terminal node

- Only pay the cost of allocation and evaluation for expression arguments if there are subscribers to lifecycle events, or if accessing via action invocation
@snikolayev
Copy link
Member Author

@larrybehaviorlanguage I managed to get rid of all allocations around expressions evaluation, eliminate boxing and also improved performance. Details under #227. It ended up more complex than you and I discussed, but I'm very happy with the result.
This overall story is far from done, but this, I think, is a step in the right direction.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants