Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fetch<dynamic> caching #657

Open
SimonHurrNZ opened this issue Sep 29, 2022 · 33 comments
Open

Fetch<dynamic> caching #657

SimonHurrNZ opened this issue Sep 29, 2022 · 33 comments

Comments

@SimonHurrNZ
Copy link

SimonHurrNZ commented Sep 29, 2022

We have a stored procedure that can return a variety of different data, which are defined via parameters.

I've been using Fetch to support this, as the calling code has no idea what the stored procedure will return.
However, when two calls to the stored procedure return the same number of columns, PetaPoco is returning the column names of the first call for both.

To reproduce:

CREATE TABLE tmpA (a int, b int, c nvarchar(50), d bit)
CREATE TABLE tmpB (e int, f int)
CREATE TABLE tmpC (g int, h datetime, i money, j nvarchar(50))
CREATE TABLE tmpD (k datetime, l datetime)

INSERT INTO tmpA (a,b,c,d) VALUES (1, 2, 'test', 0)
INSERT INTO tmpB (e,f) VALUES (3, 4)
INSERT INTO tmpC (g,h,i,j) VALUES (1, GetDate(), 1.5,'test2')
INSERT INTO tmpD (k,l) VALUES (GetDate(),GetDate())
GO

CREATE PROCEDURE GetData(@selector int) AS
	IF @selector = 1 SELECT * FROM tmpA
	IF @selector = 2 SELECT * FROM tmpB
	IF @selector = 3 SELECT * FROM tmpC
	IF @selector = 4 SELECT * FROM tmpD

PetaPoco code:

for (int i = 1; i < 5; i++)
{
    var proc = Sql.Builder.Append(";EXEC GetData @@selector = @selector", new { selector = i });
    var data = db.Fetch<dynamic>(proc);
    Console.WriteLine(JsonConvert.SerializeObject(data));
}

Results from output - as can be seen, the column names for last 2 calls are those from first two calls.

[{"a":1,"b":2,"c":"test","d":false}]
[{"e":3,"f":4}]
[{"a":1,"b":"2022-09-29T12:57:43.303","c":1.5000,"d":"test2"}]
[{"e":"2022-09-29T12:57:43.303","f":"2022-09-29T12:57:43.303"}]
@pleb
Copy link
Member

pleb commented Sep 29, 2022

@SimonHurrNZ I would have thought you'd need to pass a generic argument to db.Fetch(proc);?

I suspect the Generic argument is being kept the same for each outcome of the stored proc. PetaPoco uses the Generic argument for deciding how to map the results from the DB to the Poco.

image

@SimonHurrNZ
Copy link
Author

Apologies, my simple example didn't include the generic argument.
Fetch causes the output shown.

@SimonHurrNZ
Copy link
Author

Ah, it's the markdown excluding the dynamic argument. Fetch<dynamic>(proc)

@pleb
Copy link
Member

pleb commented Sep 29, 2022

@SimonHurrNZ I suspect the dynamic types are wrongly being treated as equal based on the number of properties they contain. Seems to be how .Net works. We would have to try and find a workaround for this.

In the meantime, you could try adding PocoData.FlushCaches(); before the Fetch (as shown below)

for (int i = 1; i < 5; i++)
{
    var proc = Sql.Builder.Append(";EXEC GetData @@selector = @selector", new { selector = i });
    PocoData.FlushCaches();
    var data = db.Fetch<dynamic>(proc);
    Console.WriteLine(JsonConvert.SerializeObject(data));
}

Can you let me know if ☝️ works? Also, the version of PetaPoc and the Database you're using would help too.

Might be some time before I can look at this. I'm unsure of @asherber availability, but hopefully, one of us will get to it. Unless you're feeling keen to jump into the internals to see if you can resolve it yourself.

@SimonHurrNZ
Copy link
Author

I'm on v6.0.441 at present, which doesn't have that FlushCaches() method exposed.
I'll see if I can figure out what version I need to have that method available, and I'll give it a try.
The database is SQL Server.

Thanks for looking at this so promptly!

@asherber
Copy link
Collaborator

I'm not able to reproduce using Fetch<dynamic> with two consecutive SELECT statements that return different columns, so it might have something to do with the stored proc as intermediary. I'll take a closer look tomorrow.

@asherber
Copy link
Collaborator

Yes, I can reproduce using a stored proc, even if I'm using FetchProc() instead of Fetch() with EXEC. It's very curious that this happens with stored procs but not with SELECT.

@asherber
Copy link
Collaborator

I think the issue is here. When we create and cache a factory method for turning the IDataReader into a poco, it's keyed by the SQL statement and the field count (plus a couple other things that aren't relevant). When you're calling a stored proc like this, the SQL and the field count are the same, so the second call will reuse the first factory method. (The SQL params are different, but that's not taken into account.) It works with SELECT because the SQL is different between the two calls.

I'm not sure right now what the best fix is for this, will need to think on it. It seems like a bit of an edge case -- I think the general assumption is that a stored proc will always return the same set of columns. And if we add the params to the cache key, that will sort of defeat the general intent of the cache.

@SimonHurrNZ
Copy link
Author

I can confirm that upgrading to v6.0.480 and using FlushCaches() has given us a usable workaround.

I'm not certain how much of a performance hit there will be with the cache constantly being flushed by use of this, but I'm very grateful to have a viable path for now. We'll start testing this tomorrow.

@pleb
Copy link
Member

pleb commented Sep 29, 2022

I think the issue is here. When we create and cache a factory method for turning the IDataReader into a poco, it's keyed by the SQL statement and the field count (plus a couple other things that aren't relevant). When you're calling a stored proc like this, the SQL and the field count are the same, so the second call will reuse the first factory method. (The SQL params are different, but that's not taken into account.) It works with SELECT because the SQL is different between the two calls.

I'm not sure right now what the best fix is for this, will need to think on it. It seems like a bit of an edge case -- I think the general assumption is that a stored proc will always return the same set of columns. And if we add the params to the cache key, that will sort of defeat the general intent of the cache.

@asherber I think you're correct. Great find. I agree that a proc returning different result sets seems a bit strange. But I have to admit I don't use procs all that much.

I wonder if the following would address the issue whilst still being performant?

            // Factor in stored procs that return different result sets
            var resultFields = string.Empty;
            if (sql.StartsWith(";"))
                resultFields = string.Join(",", Enumerable.Range(0, reader.FieldCount).Select(reader.GetName).ToArray());

            // Check cache
            var key = Tuple.Create<string, string, string, int, int>(sql, connectionString, resultFields, firstColumn, countColumns);

@asherber
Copy link
Collaborator

I don't think that using ; to determine what to do will work -- you don't need to start with ; in order to call a stored proc, and in fact you could start any sql statement with one if you wanted. We could use the field names as part of the cache key for all queries, a slight performance hit but probably not too bad. 

In general, with all respect to @SimonHurrNZ, my initial reaction is to say that this is enough of an edge case that maybe we don't need to account for it.

@SimonHurrNZ
Copy link
Author

We'll monitor to see to what degree flushing the whole application's cache before each call to that method impacts overall performance for clients using that particular feature.

Is there perhaps scope to add an overload / parameter / setting that would bypass the cache build/lookup for a given query, instead of flushing the entire cache before making the call? That would be a perfect outcome from our requirements point of view, as these queries are essentially ad-hoc and thus there is no basis for caching the factory method.

@pleb
Copy link
Member

pleb commented Sep 30, 2022

I agree that using the ; isn't foolproof. I did steal it from here though 👀

Also agree that if there isn't a concrete way of supporting stored procs with differing result sets, then maybe it's not worth supporting at all. I wonder if the comprise is a way to disable cache for a particular operation 🤔

@pleb
Copy link
Member

pleb commented Sep 30, 2022

@SimonHurrNZ Sorry. I didn't see your comment before replying. It seems we have come to the same conclusion

@asherber
Copy link
Collaborator

Well, the library does look for the ; to determine whether or not to attempt to add the SELECT part of the query. So right, an EXEC statement would need to start with a semicolon -- unless db.EnableAutoSelect is false.

I'm still not sure this is something the library needs to support. The user could, for example, design his stored proc so that it always returns the same set of columns, even if some of them or null. Something like:

CREATE PROCEDURE GetData(@selector int) AS
	IF @selector = 1 SELECT tmpA.*, null as e, null as f FROM tmpA
	IF @selector = 2 SELECT tmpB.*, null as a, null as b, null as c, null as d FROM tmpB

Now the proc will always return columns a-f. You could even add in each SELECT an ignore column containing a comma-delimited list of the columns in the result set that should be ignored; the calling code could use that to determine which columns in the result set to pay attention to.

The user could also move the logic into code, so that based on the value of the selector, different stored procs are called.

If the library did start supporting bypassing the factory cache, I think there's a question as to whether it's supported on the operation level (similar to passing in a custom mapper) or on the database level (similar to EnableAutoSelect and EnabledNamedParams).

@Curlack
Copy link
Contributor

Curlack commented Sep 30, 2022

Hi guys,
When I saw this issue, I had deja vu. When we were still using the old version, we solved it as below. Before you ask, I can't remember which version, but I know it was when Peta Poco was one file called DatabaseContext.cs. I've also added some reference code that may help the trained eye identify which version I'm referring to.
Apologies in advance for the long post, not sure if I should attach large pieces of code another way.

It took me a while to find the code, hence me posting this late. So here it is:

  // Create factory function that can convert a IDataReader record into a POCO
  public Delegate GetFactory(string sql, string connString, int firstColumn, int countColumns, IDataReader r) {
    ...
    il.Emit(OpCodes.Dup);                // obj, obj

    //Old
    il.Emit(OpCodes.Ldstr, r.GetName(i));// obj, obj, fieldname
  
    //New
    il.Emit(OpCodes.Ldarg_0);            // obj, obj, rdr
    il.Emit(OpCodes.Ldc_I4, i);          // obj, obj, rdr, i
    //read column names from IDataReader instead of Cache
    il.Emit(OpCodes.Callvirt, fnGetName);// obj, obj, fieldname
    ...
  }

  static List<Func<object, object>> _converters = new List<Func<object, object>>();
  //New
  static MethodInfo fnGetName = typeof(IDataRecord).GetMethod("GetName", new Type[] { typeof(int) });
  static MethodInfo fnGetValue = typeof(IDataRecord).GetMethod("GetValue", new Type[] { typeof(int) });

I haven't done any benchmarking to prove/disprove whether this change causes any performance issues, but has worked well for us at the time. I've used the following code to test it:

[TestMethod]
public void PetaPocoShouldNotCacheDynamicObjectColumnNames()
{
   //Arrange
  var db = new PersistanceManager("TestConnection");
  db.Execute(Scripts.Drop.ColumnNameCacheTest);
  ExecuteBatch(db, Scripts.Create.ColumnNameCacheTest);

  var expectedFirst = new Dictionary<string, object>
  {
    { "One", 1 },
    { "Two", 2 },
    { "Three", 3 }
  };
  var expectedSecond = new Dictionary<string, object>
  {
    { "Ten", 10 },
    { "Twenty", 20 },
    { "Thirty", 30 }
  };

  //Act
  var actualFirst = (IDictionary<string, object>)db.Fetch<dynamic>("EXEC dbo.CSP_ColumnsTest @0", 0).First(); //<--Must output one, two, three
  var actualSecond = (IDictionary<string, object>)db.Fetch<dynamic>("EXEC dbo.CSP_ColumnsTest @0", 1).First(); //<--Must output ten, twenty, thirty

  try
  {
    //Assert
    foreach (var item in actualFirst)
    {
      Assert.IsTrue(expectedFirst.ContainsKey(item.Key), $"First result: '{item.Key}' property was unexpected");
      Assert.AreEqual(expectedFirst[item.Key], item.Value, $"First result: '{item.Key}' property value was unexpected.");
    }

    //The second result previously had the correct values, but first result's keys
    foreach (var item in actualSecond)
    {
      Assert.IsTrue(expectedSecond.ContainsKey(item.Key), $"Second result: '{item.Key}' property was unexpected");
      Assert.AreEqual(expectedSecond[item.Key], item.Value, $"Second result: '{item.Key}' property value was unexpected.");
    }
  } finally
  {
    db.Execute(Scripts.Drop.ColumnNameCacheTest);
  }
}

public static class Scripts
{
  public static class Create
  {
    public const string ColumnNameCacheTest = @"SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
CREATE PROCEDURE [dbo].[CSP_ColumnsTest] @@i INT = NULL
AS
    BEGIN
        IF(@@i = 0)
            BEGIN
                SELECT 1 AS One, 
                       2 AS Two, 
                       3 AS Three;
            END;
        ELSE
            BEGIN
                SELECT 10 AS Ten, 
                       20 AS Twenty, 
                       30 AS Thirty;
            END;
    END;";
  }

  public static class Drop
  {
    public const string ColumnNameCacheTest = @"IF (OBJECT_ID('[dbo].[CSP_ColumnsTest]') IS NOT NULL) DROP PROCEDURE [dbo].[CSP_ColumnsTest];";
  }
}

Not sure if this solution is still viable or implementable (if there's such a word) in the new version of Peta Poco, but hopefully it can help @SimonHurrNZ with his issue.

@asherber
Copy link
Collaborator

@Curlack Interesting, thanks! 

So I think this doesn't bypass the factory cache entirely; it's just that when the generic type is object (or dynamic, which behaves the same way), the cached factory method doesn't rely on cached column names.

This would solve for the current edge case, but at the expense of the much more common case (slight performance hit for reading the column names each time).

Current code is here.

@Curlack
Copy link
Contributor

Curlack commented Sep 30, 2022

@asherber, @pleb I would be more than happy to do the work if you guys consider the proposed solution to be a candidate. I can add an option like EnableAutoSelect, maybe call it DisableDynamicColumnCache (eish! variable naming...). It would be false by default off course. When true, it will use this new feature. I can also do a benchmark to determine how much of a performance hit it is. We can discuss the metrics for the benchmark, 'cause I'm not sure how to 'purify' the result to ignore things like database latency. Maybe I can use Benchmark.Net ootb.

@asherber
Copy link
Collaborator

One problem is that PocoData and its factory caches live outside of the databases. If db1 has EnableCache = true and makes a call, then the column names will be cached in the factory. If db2 with EnableCache = false then makes a call, they're going to get the same column names from the previously cached factory.

@Curlack
Copy link
Contributor

Curlack commented Sep 30, 2022 via email

@SimonHurrNZ
Copy link
Author

Thanks @Curlack - if nothing else, it's nice to know I'm not the only developer on the planet who has had this issue!

In response to the question of what level a 'bypass cache' flag would be held, we know exactly which call(s) require it, as in out implementation it is only going to need to be set for a call that returns a dynamic type.

@asherber - I can see that implementing a bypass flag in pocodata could be problematic, although if it would be possible to reduce it down to a flag like 'DisableDynamicColumnCaching' that only checked and bypassed the cache for calls returning type <dynamic> I would assume that would minimise the performance impact.

Ideally it would be a query-level overload or parameter, but I can see the issue of polluting the interface to cater for an edge case like this.

@asherber
Copy link
Collaborator

asherber commented Oct 3, 2022

I ran a few quick benchmarks, and on 10K selects of a single row with 59 columns, the difference between caching the field names or not was about 32 sec vs. 37 sec. That's not a small difference, percentage-wise, even though the absolute overall effect may not be a big deal. 

I'm not crazy about making this change globally, and personally I'm still not convinced that it's a problem that needs fixing. If we want to do it, my suggestion would be a static flag on PocoData, and changing the flag should flush the cache, to avoid the issue I mentioned above.

I also think this needs more testing. So far, I think we've all been looking at simple Fetch() calls, but we need to make sure it doesn't affect multi poco and multi result set queries.

@Curlack
Copy link
Contributor

Curlack commented Oct 3, 2022

Wow! 5 seconds! I agree with @asherber, it's too much of a performance hit. It's a shame actually. I do have one last suggestion for @SimonHurrNZ though. What if we solve it outside of Peta Poco?
I came across this article a while back: Creating a dynamic, extensible C# Expando Object). I had to add some meat to make it production ready for our purposes, but the gist is that you can have a strongly typed class inherit from Expando while still treating it like dynamic and it's castable to IDictionary<string,object>. Then only Peta Poco would "know" the type, but the rest of your project wouldn't be the wiser.
The only drawback on your side would be that the logic in the stored proc will need to be duplicated in the code, so the fetch can use the correct type. An adapter so to speak.

@SimonHurrNZ
Copy link
Author

Unfortunately @Curlack, that's simply not within the purview of our API. Our API only has the job of authenticating, building an SP query based on a request, executing it and outputting the result set. There's no knowledge of any given SP within the API itself. The SPs involved have many different potential responsibilities, and our clients work differently enough from each other that it's more straightforward to have a generic API call custom SPs to make sure each specific client executes the specific set of business logic they need. This is why the API has no idea what an SP is going to return, and relies on the dynamic data type to get an object it can serialize and output.

I'm in complete agreement that a general implementation of making dynamic result sets work within the caching system is not really called for. Unfortunately, our API has other responsibilities, which is why flushing the cache completely is likely to cause something of a performance hit. That is what has me wondering if being able to set a flag telling the factory to temporarily bypass cache lookup for dynamic result sets would be both feasible and performant. In this scenario, we would disable the caching, execute the query, and re-enable caching again immediately after the query completed.

If this cannot be achieved, we'll change our design to minimize the performance hit of the flushcaches() call, which will be about all we can do without a fairly radical design change to certain aspects of our API.

@pleb
Copy link
Member

pleb commented Oct 4, 2022

@SimonHurrNZ, maybe a better workaround, though a bit hacky, might be:

for (int i = 1; i < 5; i++)
{
    var proc = Sql.Builder.Append($";EXEC GetData @@selector = @selector -- {i}", new { selector = i });
    var data = db.Fetch<dynamic>(proc);
    Console.WriteLine(JsonConvert.SerializeObject(data));
}

The cache key includes the SQL statement, giving a new key per selector.

Note: I haven't tried it, and be careful of potential SQL injection attacks

@asherber
Copy link
Collaborator

asherber commented Oct 4, 2022

@pleb That's a great idea! If the selector is always an int, then there shouldn't be any issues with injection, but if it's a string you'd want to be careful.

@asherber
Copy link
Collaborator

asherber commented Oct 4, 2022

Possible way to deal with strings:

$";EXEC GetData @@selector = @selector -- {i.GetHashCode()}", new { selector = i }

These methods make the cache work for you, by letting the cached factory store the correct field names for each parameterized call to the stored proc. You could also completely bust the cache by using the same technique, but with a random number at the end of each call. This would essentially make every call to the stored proc look completely unique, whether or not it reuses a previous selector. The drawback here is that the cache would grow for every single call, which would eventually cause memory issues (unless you periodically flush the cache).

@pleb
Copy link
Member

pleb commented Oct 4, 2022

Another way to tackle this and make the support more formal would be to expose a noCacheQueryHint prop on the DB Provider and check in the SQL string.

var db = DatabaseConfiguration
    .Build()
    .UsingCommandTimeout(180)
    .WithAutoSelect()
    .WithNoCacheQueryHint()
    .Create();

// PocoData.GetFactory

var cacheFactory = true

if (db.EnableNoCacheQueryHint && sql.contains(provider.noCacheQueryHint)) 
{
  cacheFactory = false
}

This should have little impact on existing users while providing an official way to skip caching.

Though, maybe with the workaround above and given it's not a common issue, it might not be of much value

@SimonHurrNZ
Copy link
Author

@pleb, is that suggesting that the db instantiation sets a property to allow for the option of bypassing cache on a query by query basis, and an individual query then uses a query hint to disable the cache lookup for that specific query?

If so, that sounds like it would be an ideal solution.

Failing that, we may be able to implement a variation on the workaround suggested by you and @asherber above, although I should confess that the code snippet I supplied isn't the actual production code, but rather a minimal implementation that can reproduce the underlying issue. I wasn't entirely sure what to expect when I submitted it!

@asherber
Copy link
Collaborator

asherber commented Oct 5, 2022

PocoData.GetFactory() doesn't know anything about the db that's calling it -- you'd have to change the signature on this public method in order to pass in the db. I would vote against this.

@pleb I'm not clear on what you're saying about the provider having a property that can control this. Are you suggesting that there should be a magic string that tells PocoData to bypass the cache, along the lines of a leading semicolon bypassing AutoGenerateSelect? And if so, what does the provider have to do with it?

@pleb
Copy link
Member

pleb commented Oct 6, 2022

@asherber I thought the provider would be needed because I assumed varying comment syntax between DB. A quick search seems to suggest all support the -- comment style consistently

If the comment style isn't an issue, I guess the code could be simplified to

var db = DatabaseConfiguration
    .Build()
    .UsingCommandTimeout(180)
    .WithAutoSelect()
    .WithNoCacheQueryHint()
    .Create();

// PocoData.GetFactory

var cacheFactory = true

if (checkForSkipCache && sql.EndsWith('-- nocahe'))
{
  cacheFactory = false
}

Yes, an additional argument would be needed on the GetFactory method

@asherber
Copy link
Collaborator

asherber commented Oct 7, 2022

Ah, okay. I still don't like the idea. I feel like it's enough of an edge case, and there are enough other ways to work around it, that adding this to the library isn't necessary.

@Michael-P-Bost
Copy link

Michael-P-Bost commented Mar 29, 2024

Not sure if this will help anyone but I stumbled across this post dealing with this same issue for concrete types. The issue being the same: the results returned from the stored procedure has dynamic columns.

We already have our own class inheriting from Database to do pre and post command work, so I leveraged that to build a system allowing for additional cache data to be passed in.

  • Create overloads for QueryProc / FetchProc
  • Prepend SQL with hashed / serialized cache data, and store the new key and old key in a dictionary for lookup later.
  • OnExecutingCommand - check dictionary against command text. If an entry is found reset the command text to the original value.

So now I can gather any properties or data that affects the shape of the stored procedure results and have that unique collection be part of my sql cache string.

In short: Make the SQL used by PetaPoco for caching unique, and then reset it before sending to the DB.

//all below code is within class inheriting Database

private static ConcurrentDictionary<string, string> _cacheKeyToProcName = new ConcurrentDictionary<string, string>();

//combines the name and key, and also stores in our dictionary
private string AppendCacheValue(string storedProcedureName, string key)
{
    var newName = $"{storedProcedureName} cache=||{key}||";

    _cacheKeyToProcName.TryAdd(newName, storedProcedureName);

    return newName;
}

//other overloads like fetch just use this. 
//of course the string cache key can easily be serialized JSON
//I left any of that responsibility outside of this.
public IEnumerable<T> QueryProc<T>(string storedProcedureName, string cacheKey, params object[] args)
{
    storedProcedureName = AppendCacheValue(storedProcedureName, cacheKey);

    return base.QueryProc<T>(storedProcedureName, args);
}

public override void OnExecutingCommand(IDbCommand cmd)
{
    //see if this was an updated sp command via cache buster 
    if (_cacheKeyToProcName.TryGetValue(cmd.CommandText, out var originalName))
    {
        cmd.CommandText = originalName;
    }          

    //.... other code and stuff

    base.OnExecutingCommand(cmd);
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants