Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What the heck is "isvariablelength" for? Nanoseconds wat? #325

Open
goodboy opened this issue May 28, 2020 · 1 comment
Open

What the heck is "isvariablelength" for? Nanoseconds wat? #325

goodboy opened this issue May 28, 2020 · 1 comment

Comments

@goodboy
Copy link
Contributor

goodboy commented May 28, 2020

I'm looking at the relevant server code sections:

Here we write the data payload of the buffer to the end of the data file Prior to writing the new data, we fetch any previously written data and prepend it to the current data. This implements "append"

You know what's handy, putting the name Append somewhere in the func name 😉

Secondly,

Also it's probably worth mentioning that there's some kind of relationship with the Nanoseconds column. I got real confused when using the client to do stuff and got weird different numbers back depending on whether I wrote the Nanoseconds field and used isvariablelength (look, the unit tests and me are the same 🥰).

That is, if isvariablelength is set:

We have to add a column named nanoseconds to the datashapes for a variable record type. This is true because the read() function for variable types inserts a 32-bit nanoseconds column.

Ok so let's stop and think here.

We're removing Nanoseconds because, before writing to disk, we convert ColumnSeries -> RowSeries without passing through the rowType flag, which would make NewRowSeries add the 'Nanoseconds' DataShape which we apparently need:

because the read() function for variable types inserts a 32-bit nanoseconds column.

But really it's because we already got the Nanoseconds out and are passing it as a []time.Time to Writer.WriteRecords()?

Uh, ok so I guess because the read() means when GetTime() get's called, or?

Again, comment says we need this Nanoseconds field for "reading" and GetTime() seems to need it for generating a []time.Time output, if there is a Nanoseconds column. Well that's good because (as mentioned in last bullet ^) we are calling it then handing it to Writer.WriteRecords().

Let's note that ColumnSeriesMap.FilterColumns() method requires Nanoseconds as part of the index.

Ok where are we again?

  • ColumnSeriesMap needs Nanoseconds for the index but ColumnSeries doesn't
  • when we're writing to disk, loop through all the ColumnSeries in the ColumnSeriesMap and remove the Nanoseconds fields because when converting to a RowSeries we don't pass through the rowType flag which would add that DataShape for Nanoseconds
  • But before all that we call times := cs.GetTime() and eventually pass that to the writer routine whilst documenting in that method that we need Nanoseconds in for reading

🤯. So this all seems pretty circular.

Alright let's go back to what we were doing. Right, WriteCSM(), we're writing our ColumnSeriesMap to disk!

tbi, err := cDir.GetLatestTimeBucketInfoFromKey(&tbk)

Ok so if a tbi don't exist and isVariableType is set, we're gonna pass recordType=io.VARIABLE to io.NewTimeBucketInfo().

So we have a ColumnSeries with no Nanoseconds DataShape (if isVariableLength is set) and we're making a new TimeBucketInfo with a "variable length" meaning this stuff gets set:

} else if f.recordType == VARIABLE {
f.recordLength = 24 // Length of the indirect data pointer {index, offset, len}
f.variableRecordLength = 0

Cool, let's go back to WriteCSM()...

// Check if the previously-written data schema matches the input
columnMismatchError := "unable to match data columns (%v) to bucket columns (%v)"
dbDSV := tbi.GetDataShapesWithEpoch()
csDSV := cs.GetDataShapes()
if len(dbDSV) != len(csDSV) {
return fmt.Errorf(columnMismatchError, csDSV, dbDSV)
}
missing, coercion := GetMissingAndTypeCoercionColumns(dbDSV, csDSV)
if missing != nil || coercion != nil {
return fmt.Errorf(columnMismatchError, csDSV, dbDSV)
}
/*
Create a writer for this TimeBucket
*/
w, err := NewWriter(tbi, ThisInstance.TXNPipe, cDir)
if err != nil {
return err
}
w.WriteRecords(times, rowdata)
}

So if the columns in the TimeBucketInfo and the ColumnSeries match, we're golden and ready to write to disk the new RowSeries we just rendered.

Ok so now Writer.Write() gets called with the RowSeries and sends a command to another channel to write the data to disk.

So everything should be fine? Nanoseconds is written to disk when isVariableLength is set but that's because it always is even if isVariableLength is false?

That seems to fit with the testing comments minus some mysterious precision problem.

But then I found this rewritebuffer.go and started getting worried:

// RewriteBuffer converts variable_length records to the result buffer.
//
// variable records in a file: [Actual Data (VarRecLen-4 byte) , Interval Ticks(4 byte) ]
// rewriteBuffer converts the binary data to [EpochSecond(8 byte), Actual Data(VarRecLen-4 byte), Nanoseconds(4 byte) ] format.
//
// buffer
// +-----------------------VarRecLen [byte]---+-----------------------+
// + Actual Data(Ask,Bid, etc.) | IntevalTicks(4byte) |
// +------------------------------------------+------------------------+
//
// ↓ rewriteBuffer
//
// rbTemp (= temporary result buffer)
// +--------------------+--VarRecLen + 8 [byte]-----+-------------------+
// + EpochSecond(8byte) | Actual Data(Ask,Bid, etc) | Nanosecond(4byte) |
// +--------------------+----------------------------+------------------+

Oh man there's more isVariableLength stuff 😿

It turns out that's used when reading back data for queries...that explains that test that doesn't work.

So as far as I can tell (which is really really questionable) it looks like Nanoseconds written by the client are always written by marketstore to disk despite isvariablelength, (still unclear why that is) and when you read back those same records, the re-write buffer is calculating it's own Nanoseconds (if it needs to ?), but iff isvariablelength=True do you always read back a Nanoseconds field despite whether you wrote on in the first place?

Summary

  • maybe mention that isvariablelength should be documented as an append operation and then maybe even make that a separate Client.append() method?
  • Nanoseconds are always written as a field if you use isvariablelength=True (despite the comments and server code making it super confusing..:cry:)
  • Nanoseconds values are written by some client tests storing tick data
  • a bunch of feeders also write Nanoseconds for orders tracking

PS

Sorry about the long write up but I tend to want to get to know the projects I'm eyeing up seriously for production use 👍

@goodboy
Copy link
Contributor Author

goodboy commented May 28, 2020

Maybe also a small demonstration of the client behaviors:

  • isvariablelength=False aka not an append but writing an explicit Nanoseconds:
nav] In [32]: client.write(np.array([(pd.Timestamp("2016-01-01 10:00:00").value/10**9, 3, 140000)], dtype=[('Epoch', 'i8'), ('
          ...: Bid', 'f4'), ('Nanoseconds', 'i4')]),'Monkey/1Sec/TICK', isvariablelength=False),                               
Out[32]: ({'responses': None},)

[nav] In [33]: client.query(pymarketstore.Params('Monkey', '1Sec', 'TICK')).first().df()                                       
Out[33]: 
                           Bid  Nanoseconds
Epoch                                      
2016-01-01 10:00:00+00:00  3.0       140000

[nav] In [34]: client.write(np.array([(pd.Timestamp("2016-01-01 10:00:00").value/10**9, 3, 140001)], dtype=[('Epoch', 'i8'), ('
          ...: Bid', 'f4'), ('Nanoseconds', 'i4')]),'Monkey/1Sec/TICK', isvariablelength=False),                               
Out[34]: ({'responses': None},)

[nav] In [35]: client.query(pymarketstore.Params('Monkey', '1Sec', 'TICK')).first().df()                                       
Out[35]: 
                           Bid  Nanoseconds
Epoch                                      
2016-01-01 10:00:00+00:00  3.0       140001

So if you never wrote Nanoseconds and isvariablelength=False then you don't get it magically created:

[ins] In [45]: client.write(np.array([(pd.Timestamp("2016-01-01 10:00:00").value/10**9, 3)], dtype=[('Epoch', 'i8'), ('Bid', 'f
          ...: 4'),]),'Monkey_NO_NANO/1Sec/TICK', isvariablelength=False),                                                     
Out[45]: ({'responses': None},)

[ins] In [46]: client.query(pymarketstore.Params('Monkey_NO_NANO', '1Sec', 'TICK')).first().df()                               
Out[46]: 
                           Bid
Epoch                         
2016-01-01 10:00:00+00:00  3.0
  • isvariablelength=True aka known as an append
[nav] In [37]: client.write(np.array([(pd.Timestamp("2016-01-01 10:00:00").value/10**9, 3, 140000)], dtype=[('Epoch', 'i8'), ('
          ...: Bid', 'f4'), ('Nanoseconds', 'i4')]),'APPEND/1Sec/TICK', isvariablelength=True)                                 
Out[37]: {'responses': None}
[ins] In [39]: client.write(np.array([(pd.Timestamp("2016-01-01 10:00:00").value/10**9, 3, 140001)], dtype=[('Epoch', 'i8'), ('
          ...: Bid', 'f4'), ('Nanoseconds', 'i4')]),'APPEND/1Sec/TICK', isvariablelength=True),                                
Out[39]: ({'responses': None},)
[ins] In [40]: client.query(pymarketstore.Params('APPEND', '1Sec', 'TICK')).first().df()                                       
Out[40]: 
                           Bid  Nanoseconds
Epoch                                      
2016-01-01 10:00:00+00:00  3.0       140000
2016-01-01 10:00:00+00:00  3.0       140001

But wait let's continue with that and find our magic Nanoseconds created for us always:

[nav] In [41]: client.write(np.array([(pd.Timestamp("2016-01-01 10:00:00").value/10**9, 3)], dtype=[('Epoch', 'i8'), ('Bid', 'f
          ...: 4'),]),'APPEND/1Sec/TICK', isvariablelength=True),                                                              
Out[41]: ({'responses': None},)

[nav] In [42]: client.query(pymarketstore.Params('APPEND', '1Sec', 'TICK')).first().df()                                       
Out[42]: 
                           Bid  Nanoseconds
Epoch                                      
2016-01-01 10:00:00+00:00  3.0            0
2016-01-01 10:00:00+00:00  3.0       140000
2016-01-01 10:00:00+00:00  3.0       140001

^ That doesn't happen if you use isvariablelength=False:

[ins] In [44]: client.write(np.array([(pd.Timestamp("2016-01-01 10:00:00").value/10**9, 3)], dtype=[('Epoch', 'i8'), ('Bid', 'f
          ...: 4'),]),'Monkey/1Sec/TICK', isvariablelength=True),                                                              
Out[44]: 
({'responses': [{'error': 'unable to match data columns ([{Epoch INT64} {Bid FLOAT32}]) to bucket columns ([{Epoch INT64} {Bid FLOAT32} {Nanoseconds INT32}])',
    'version': '34352c9738c9164d7c65264a532d99341c57fae2'}]},)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant