Cep27 #259

jbae11 · 2017-09-14T19:31:16Z

Abstract

@scopatz
This CEP proposes to restructure the |cyclus| output database structure in order to
reduce the number of tables and redundancy of data, and ultimately reduce the number
of joins required for data analysis. Doing so would reduce the computing time
for end-user analysis, and allow for a clearer, more concise output database.

Updates to CEP27

gonuke

It's probably useful to explicitly mention the need for a major version bump due to this.

gonuke · 2017-09-15T01:34:20Z

source/cep/cep27.rst

+The **ExplicitInventory** table and **ExplicitInventoryCompact**
+table should be merged to a single table, called **Inventories**,
+with the following columns:
+


Since I'm unfamiliar with the previous tables, can you clarify what is changing here by showing the old table layout? (as you did with the others)

I am not sure I completely understand the question, but the old table layout is shown so that the reader can clearly understand what is being modified / removed with the proposed change.

The Explicitinventory table corresponds to a table with all the different inventory in all the different facilities:
SimId -- AgentId -- Time -- InventoryName -- NucId -- Quantity

where the ExplicitInventoryCompact looks more like the new "Inventory"
SimId -- AgentId -- Time -- InventoryName -- Quantity -- Composition

@bam241 answered my question - the old table layout was NOT shown for this case and perhaps should be...

gonuke · 2017-09-15T01:36:38Z

source/cep/cep27.rst

+============  ==========
+SimId           uuid
+Recipes         string
+Composition     map<int,double>


The previous layout anticipated the desire to select on nuclide in the query, and hence a different column for each NucId. Perhaps this has not emerged in the wild, but it seems that a consequence of this change would make this no longer possible.

Hello, that is a valid point.
Maybe it's necessary for us to define what
For example, if one wants the timeseries mass of Pu239,
the query would be like the following:

SELECT sum(massfrac*quantity) FROM resources INNER JOIN compositions ON resources.qualid = compositions.qualid WHERE nucid=942390000 GROUP BY timecreated

in the newer database structure, it would be:

SELECT quantity, timecreated, composition FROM materials

followed by a script that processes the result:

Get query results loop through every row of query results for every row, look in composition for nucid 94239, get its massfrac value multiply that by quantity add that to the list of pu239 value timeseries list

So I do assume that it would take a longer time to accomplish
what you mentioned ( and also needs additional scripting outside of the sqlite query)...

You probably know much more than me, but @scopatz and my initial thought was that
this would have more benefit than loss. What do you think?

I believe this will not be optimized for a large calculation (1000-10000 facilities), if you want to see the plutonium inventory in the fleet, you will need to load all the composition, get the informations you need and then re-generate a table.

I would prefer a system that allow us to filter using facility's name and nucid, but I am not sure it is possible without having a gigantic table :(

bam241 · 2017-09-17T20:25:18Z

There is a table that is natively in CLASS output file, that might be to consider here:
We have tables that regroup information by prototype definition(sum over the AgendId if they are the same prototypes).

I know that will create redondant informations, but it will also allows analyst to access the inventory of all the PWR_UOX without any computation.

something like:
SimId -- Prototype -- Time -- InventoryName -- NucId -- Quantity

In the general case, this will not be a problem because one usually have few different archetype specialization (I usually have 5-20 different facilities in my calculations)...
This will be an issue when you have a specific prototype declaration for each agents you are using.

This light InventoryTable could provide a quick access to the informations that are need 80% of the time -- at least for me (I generally never access specific agent informations)

I know and understand why we want to avoid having informations in 2 places, or to avoid to store informations that we can retrieve with some computations. But I also believe that the user analysis experience needs to be fast for the simple query (like show me the Pu inventory in my calculation) and I am not sure we can achieve that on a typical US-fleet simulation

jbae11 · 2017-09-18T14:17:16Z

@bam241 if I understood correctly, would there be multiple rows of nucid& quantity for the same inventoryname&time?

SimId -- PWR_UOX -- 1 -- fuel -- 922350000 -- 123
SimId -- PWR_UOX -- 1 -- fuel -- 922380000 -- 1245
SimId -- PWR_UOX -- 1 -- fuel -- 942390000 -- 124
and so on?

bam241 · 2017-09-18T14:21:21Z

Yes, but I am not sure everybody will agree on this one, but having the full inventory could allow quick analysis, and restricting it to Prototypes could limits its size.

maybe also have the capability to disable the table from the input file could be nice, (or maybe making it optional)

jbae11 · 2017-09-18T14:25:32Z

yeah I was thinking of a case where there are ~30 prototypes (predicting the past), and each of them would get a table. So having it optional would be necessary, I think. Do you want to maybe make some edits to the CEP to pitch that in? (If that's allowed, I'm not sure)

bam241 · 2017-09-18T14:28:11Z

I'll wait a bit to see how other react to it. If they agree I'll PR an edit.

30 prototypes is not that much, I have actually about 3000 agent on my FCO simulations, if you track less than 100 nuclides your table will be smaller than the actual ExplicitInventory...

scopatz · 2017-09-26T01:35:06Z

Hello All, I'll discuss the filtering by nuclide question here. There are a few alternatives to what is proposed here.

The current situation, ie a NucId column
Have a Material (or Composition) datatype in type system, and use this instead of map<int, double>
Have a column for each nuclide tracked.

The current situation

Advantages: The storage is still sparse, in the sense that it doesn't store nuclides that don't exist in the material. This method also does allow direct filtering.

Disadvantages: Currently, we store the SimId (uuid, 20 bytes) and the QualId (int, 4 bytes) for every NucId (int, 4 bytes) and MassFrac (double, 8 bytes) that we store. Thus, the overhead from the SimId & QualId is 2/3rds of the total storage size. So in addition to issue of having to roll through a huge number of rows to find the material of interest, the storage is also inefficient.

Material Datatype

We have a fairly sophisticated type system that allows us to store custom classes as unique types. Nothing stops us from having cyclus::Material, cyclus::Product, Resource Buffers, etc from being first class members of the type system.

Advantages: In fact, they already are! These types were added to aid in the C++/Python inter-language communication. Using Material as the type instead of map<int, double> would allow the backends to determine how they wish to store the data in these classes. This could be done in such a way that enables direct filtering by SQLite, etc.

Disadvantages: Someone would actually have to implement these datatypes (or at least Material) for the backends. I could work on the HDF5 one, but that still leaves SQLite. Additionally, I suspect that if the SQLite backend was implemented in such a way as to allow direct filtering with SQL queries that it will end up having similar data overheads as in the current situation option.

A column for every nuclide

Instead of having a long table, we could have a wider table where every nuclide (H1, H2, ... U235, U236...) gets its own column.

Advantages: This has relatively low SimId & QualId overhead compared to the current situation. It allows direct querying of individual nuclides by picking out the column of interest.

Disadvantages: This is not sparse; nuclides that are not present in the Material get a concrete 0.0 value stored in the database. Additionally, this may necessitate a global tracked nuclide set, whereby nuclides in not in this set will not be valid in Cyclus. This is a little antithetical to Cyclus' notion of being robust to whatever an archetype developer wants to supply. Also, for a sufficiently comprehensive set of tracked nuclides (1000 - 4000), there could be technical issues in the backends. SQLite might have a column limit. HDF5 has 64 kb of header data per table, which I have exceeded with the strategy and 1500 nuclide columns (I think) .

proposal

So in summary, the proposed version went in as it did because:

map<int, double> is available and implemented in all backends
it is sparse,
minimizes SimId and QualId overhead.

Yes, it does this at the cost of not being able to directly filter based on NucId. But all things considered, I think this is a worthy trade off. Doing a secondary filter after the material has been pulled out isn't hard, as @gonuke said this feature isn't used much, and it is expensive when it is used.

We can think about other mitigation strategies, such as convenience functions for material filtering or putting in (cy)metrics that do this filtering.

I am open to other options, of course, or if people prefer the material or the column strategy listed here, that would be great to know!

bam241 · 2017-09-26T02:46:46Z

this seems reasonable to me.

I would also like to suggest to add an option where we don't track agents and we only differentiate by prototype.
I don't know for you but in most of my simulation, I don't need to track individual agent by only prototypes.
this could reduce a lot the size of the database, and make it cheaper to analyze the output.

even in prediction the past I don't think you need a agent tracking.
(for me, it actually should the default, but I am probably bias by my own applications)

scopatz · 2017-09-26T02:52:12Z

I would also like to suggest to add an option where we don't track agents and we only differentiate by prototype.

You wouldn't be able to restart, but this is probably OK. In general, this is a good suggestion :) Which table do you propose be dropped or altered?

(for me, it actually should the default, but I am probably bias by my own applications)

I think the rule of thumb for the defaults should be that we store everything because we don't know what will be useful. Then if you know what you are doing (or are prepared for the consequences _), you can start turning things off ;)

bam241 · 2017-09-26T03:20:12Z

the tables that would need to be altered would be:

Resources
Materials
Products
Transactions
ResCreators
Inventories

the idea would be to replace any AgentId/SenderId/ReceiverId/ParentX into:
Prototype name.
this will imply to recompute the recipes and the compositions of the different materials.

We can also imagine the possibility to have a cohabitation between the Agent's tables and the Prototype's table....

I can submit a PR.

gonuke · 2024-02-07T03:08:37Z

@jbae11 - is this still relevant? If so, can you update the CEP number to CEP29. I think we accidentally took CEP27 so this will need a new number and CEP28 is already in the queue.

We'll also a need a tutorial on what it's about and where the discussion stands.

jbae11 and others added 3 commits September 13, 2017 11:16

cep27 rough draft

069dfe8

updates to CEP27

3230670

Merge pull request #1 from scopatz/cep27

eec6cbc

Updates to CEP27

gonuke reviewed Sep 15, 2017

View reviewed changes

jbae11 added 3 commits September 18, 2017 08:52

added explicit inventory tables for reference

1d67bca

past tables

b63c111

spelling..

b06a941

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cep27 #259

Cep27 #259

jbae11 commented Sep 14, 2017

gonuke left a comment

gonuke Sep 15, 2017

jbae11 Sep 17, 2017

bam241 Sep 17, 2017 •

edited

gonuke Sep 18, 2017

jbae11 Sep 18, 2017

gonuke Sep 15, 2017

jbae11 Sep 17, 2017

bam241 Sep 17, 2017

bam241 commented Sep 17, 2017 •

edited

jbae11 commented Sep 18, 2017

bam241 commented Sep 18, 2017 •

edited

jbae11 commented Sep 18, 2017

bam241 commented Sep 18, 2017

scopatz commented Sep 26, 2017

bam241 commented Sep 26, 2017 •

edited

scopatz commented Sep 26, 2017

bam241 commented Sep 26, 2017 •

edited

gonuke commented Feb 7, 2024

Cep27 #259

Are you sure you want to change the base?

Cep27 #259

Conversation

jbae11 commented Sep 14, 2017

Abstract

gonuke left a comment

Choose a reason for hiding this comment

gonuke Sep 15, 2017

Choose a reason for hiding this comment

jbae11 Sep 17, 2017

Choose a reason for hiding this comment

bam241 Sep 17, 2017 • edited

Choose a reason for hiding this comment

gonuke Sep 18, 2017

Choose a reason for hiding this comment

jbae11 Sep 18, 2017

Choose a reason for hiding this comment

gonuke Sep 15, 2017

Choose a reason for hiding this comment

jbae11 Sep 17, 2017

Choose a reason for hiding this comment

bam241 Sep 17, 2017

Choose a reason for hiding this comment

bam241 commented Sep 17, 2017 • edited

jbae11 commented Sep 18, 2017

bam241 commented Sep 18, 2017 • edited

jbae11 commented Sep 18, 2017

bam241 commented Sep 18, 2017

scopatz commented Sep 26, 2017

The current situation

Material Datatype

A column for every nuclide

proposal

bam241 commented Sep 26, 2017 • edited

scopatz commented Sep 26, 2017

bam241 commented Sep 26, 2017 • edited

gonuke commented Feb 7, 2024

bam241 Sep 17, 2017 •

edited

bam241 commented Sep 17, 2017 •

edited

bam241 commented Sep 18, 2017 •

edited

bam241 commented Sep 26, 2017 •

edited

bam241 commented Sep 26, 2017 •

edited