Food for thought: SymbolId can carry a pointer to the owning SymbolTable #3054

hs-apotell · 2022-06-16T09:06:22Z

SymbolId in itself is just a number and to resolve it correctly it has to know which instance of SymbolTable it needs to use. This is very error prone and mistakes aren't very obvious. (I actually ended up introducing a bug and running the processes ended up creating files and directories in folders that were hard to get rid of because of special characters in them).

If SymbolId were to carry a pointer against which to resolve, it will eliminate lot of these issues. Basically, treating the SymbolId more like a handle then as an id.

Simplifies the API and usage of SymbolId and SymbolTable. Instead of symbolTable->getSymbol(symbolId) one would use symbolId.value()
Avoids copying the same symbol into multiple SymbolTables. For instance, symbols from CompilerDesign's::SymbolTable gets copied to ErrorContainer::SymbolTable for reporting an error. Instead one would just add the SymbolId (and thus different symbols within the errors list could point to different instances of SymbolTable).

I can't envision a scenario where something like this would interfere with multithreading and/or multiprocessing since these are read operations and SymbolTable is guaranteeing location persistence.

@alaindargelas @hzeller Thoughts??

The text was updated successfully, but these errors were encountered:

alaindargelas · 2022-06-16T18:18:56Z

Sounds like a waste of memory (2x for the AST) and potentially on disk too if you persist that field along the IDs.

hs-apotell · 2022-06-17T05:14:18Z

Is memory increase really a concern here? Even for a reasonable sized design, this wouldn't be any substantial increase.

Currently, VObject is 48 bytes of which 16 bytes is two SymbolIds. It adds another 16 bytes to each instance, rounding it to 64 bytes each instance. Runtime memory lost here is easily gained by not having to clone SymbolIds from one SymbolTable to other.

alaindargelas · 2022-06-18T02:59:08Z

"I can't envision a scenario where something like this would interfere with multithreading and/or multiprocessing since these are read operations and SymbolTable is guaranteeing location persistence."

When parsing in multithreading, all symbol tables (one per file) are populated in parallel with no synchronization.
How do you propose doing that safely with a single table?

Moreover a symbol table per file is what enables partial recompile, where only files that changed gets recompiled and error messages for those files get updated. Finally all files get recombined and symbols merged.

hzeller · 2022-06-29T17:10:32Z

What would be the motiviation to do so ?

If motivation is error-avoidance: maybe make it an option that can be enabled in debug mode and can discover wrong uses, but otherwise is 'lean'.
for compactness: Can a symbolID just be the pointer to the unified string_view ? That way, if we have a symbolID, getting the symbol is just essentially dereferencing it. Disadvantage of course is, that this is not providing nice consecutive numbers that are predictable, so printing will not be pleasent and everything that needs a consecutive index (e.g. for serialization) needs to be handled differently.

hs-apotell · 2022-06-29T20:22:16Z

If motivation is error-avoidance: maybe make it an option that can be enabled in debug mode and can discover wrong uses, but otherwise is 'lean'.

That's one reason.

for compactness: Can a symbolID just be the pointer to the unified string_view ? That way, if we have a symbolID, getting the symbol is just essentially dereferencing it. Disadvantage of course is, that this is not providing nice consecutive numbers that are predictable, so printing will not be pleasent and everything that needs a consecutive index (e.g. for serialization) needs to be handled differently.

I have considered having the index plus the string_view. @alaindargelas seems to be concerned about adding 8 bytes to the implementation and this would double that to additional 16 bytes. I don't have a good justification to defend such a change.

The other reason (apart from the error-avoidance) why I think a index + SymbolTable pointer works better is it avoids having to duplicate symbols in multiple SymbolTables. Treat SymbolId as a pinned (immovable) string and as long as the lifetime of the SymbolTable is longer than the contained symbols itself, all this duplication can be avoided. There's a lot of code where a Symbol is copied from one SymbolTable to other and that to me seems very redundant. This doesn't mean there's only one SymbolTable (contrary to what @alaindargelas suggested in his comment above) across the board. There are just as many SymbolTable instances as there currently are but they will be smaller in size.

hs-apotell · 2022-06-29T20:29:17Z

When parsing in multithreading, all symbol tables (one per file) are populated in parallel with no synchronization.
How do you propose doing that safely with a single table?

I am not proposing single table. The number of SymbolTable remains the same as it is today. Except that if I have a SymbolId instance, I don't have to know which SymbolTable it belong to also - the SymbolId instance holds that information. In a collection of SymbolId' each instance can be pinned in a different SymbolTable and that would just work. At the moment, to create such a collection, all the SymbolIds have to belong to the same SymbolTable.

Moreover a symbol table per file is what enables partial recompile, where only files that changed gets recompiled and error messages for those files get updated. Finally all files get recombined and symbols merged.

How does this process break because of the proposed change?

hs-apotell · 2022-06-30T18:31:55Z

Can a symbolID just be the pointer to the unified string_view ?

On a second though - This is likely to have a performance impact. At the moment, SymbolId are compared merely as an integer comparison. With SymbolId being a wrapper around std::string_view it will be more expensive. Whether this has an overall impact or not is something that would need actual experimentation but at least in theory a integer comparison is preferred over a string comparison.

hzeller · 2022-07-06T22:24:31Z

What I mean is to compare the address of the string, as it is uniqified in the symbol table. So integer or pointer comparison has the same cost.

hs-apotell · 2022-07-06T23:22:38Z

That might work. With any of these solutions, we could also potentially save on memory in SymbolTable since there is no need for the index based lookup table anymore i.e. SymbolTable::m_id2Symbol.

Two questions arise for any of these solutions to work -

Is there ever a need to know which SymbolTable does a symbol belong to?
Is the lifetime of a SymbolTable guaranteed to be longer than the contained symbols? If this can't be guaranteed, falling back to std::shared_ptr might be a solution to ensure that the string memory doesn't get released even if the parent SymbolTable is destroyed. SymbolId will wrap around a std::shared_ptr in this case.

alaindargelas · 2022-07-07T03:33:29Z

I'm not following all the details, but here is a hard constraint:
Symbols have to be small numbers (Not pointers) starting at 0, I can read and remember number IDs when debugging AST dumps, I can't read pointers and remember them, it would render debugging horrendous.

Some answers:
In the current implementation we always know which symbol table holds a symbol because we are getting it from the table (SymbolId id = fC->Name(id)), in your implementation I guess that is not required.

The symbol tables always outlive the strings they represent (Else it is a bug).

hs-apotell · 2022-07-07T18:24:30Z

One side affect or unintended behavior change. Consider two symbol ids that both say resolve to bad symbols. In the current implementation their native ids would be different and so they would be considered not equal. But if the SymbolId was a wrapper on the string (or some representation of it) both would be pointing to the same bad raw symbol making them equal.

I don't know if this difference matter but worth at least pointing it out since it is different from current behavior.

hs-apotell · 2022-09-10T07:03:49Z

Opened a discussion #3228

hs-apotell · 2022-11-12T22:09:23Z

@hzeller @alaindargelas I guess, neither of you are a fan of this approach. I haven't got any feedback on the discussion about PathId using this strategy here #3228. If we aren't making any progress on this I would recommend it closing this issue with your thoughts.

hzeller · 2022-11-14T16:19:47Z

I still think we should have a the filesystem keep track of its own IDs, so e.g. have its own SymbolTable.
If it can be accessed from different threads, then it should just wrap it with a mutex.

hs-apotell · 2022-11-14T19:17:44Z

In the past @alaindargelas has had reservations against using synchronization primitives especially for SymbolTable. Also, it does get a bit complicated with batch jobs since they run on the same process instance sharing the same FileSystem (and with it owning the SymbolTable, it will be bloated).

I don't think the current solution is bad (may be the class name does exactly fit the logic behind it). But, I am open to suggestions on improving the existing solution, for both SymbolId & PathId.

hs-apotell mentioned this issue Sep 9, 2022

IO Abstraction: Introducing PathId & FileSystem (Phase 1) #3219

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Food for thought: SymbolId can carry a pointer to the owning SymbolTable #3054

Food for thought: SymbolId can carry a pointer to the owning SymbolTable #3054

hs-apotell commented Jun 16, 2022

alaindargelas commented Jun 16, 2022

hs-apotell commented Jun 17, 2022

alaindargelas commented Jun 18, 2022

hzeller commented Jun 29, 2022

hs-apotell commented Jun 29, 2022

hs-apotell commented Jun 29, 2022

hs-apotell commented Jun 30, 2022

hzeller commented Jul 6, 2022

hs-apotell commented Jul 6, 2022

alaindargelas commented Jul 7, 2022

hs-apotell commented Jul 7, 2022

hs-apotell commented Sep 10, 2022

hs-apotell commented Nov 12, 2022

hzeller commented Nov 14, 2022

hs-apotell commented Nov 14, 2022

Food for thought: SymbolId can carry a pointer to the owning SymbolTable #3054

Food for thought: SymbolId can carry a pointer to the owning SymbolTable #3054

Comments

hs-apotell commented Jun 16, 2022

alaindargelas commented Jun 16, 2022

hs-apotell commented Jun 17, 2022

alaindargelas commented Jun 18, 2022

hzeller commented Jun 29, 2022

hs-apotell commented Jun 29, 2022

hs-apotell commented Jun 29, 2022

hs-apotell commented Jun 30, 2022

hzeller commented Jul 6, 2022

hs-apotell commented Jul 6, 2022

alaindargelas commented Jul 7, 2022

hs-apotell commented Jul 7, 2022

hs-apotell commented Sep 10, 2022

hs-apotell commented Nov 12, 2022

hzeller commented Nov 14, 2022

hs-apotell commented Nov 14, 2022