Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Food for thought: SymbolId can carry a pointer to the owning SymbolTable #3054

Open
hs-apotell opened this issue Jun 16, 2022 · 15 comments
Open

Comments

@hs-apotell
Copy link
Collaborator

SymbolId in itself is just a number and to resolve it correctly it has to know which instance of SymbolTable it needs to use. This is very error prone and mistakes aren't very obvious. (I actually ended up introducing a bug and running the processes ended up creating files and directories in folders that were hard to get rid of because of special characters in them).

If SymbolId were to carry a pointer against which to resolve, it will eliminate lot of these issues. Basically, treating the SymbolId more like a handle then as an id.

  • Simplifies the API and usage of SymbolId and SymbolTable. Instead of symbolTable->getSymbol(symbolId) one would use symbolId.value()
  • Avoids copying the same symbol into multiple SymbolTables. For instance, symbols from CompilerDesign's::SymbolTable gets copied to ErrorContainer::SymbolTable for reporting an error. Instead one would just add the SymbolId (and thus different symbols within the errors list could point to different instances of SymbolTable).

I can't envision a scenario where something like this would interfere with multithreading and/or multiprocessing since these are read operations and SymbolTable is guaranteeing location persistence.

@alaindargelas @hzeller Thoughts??

@alaindargelas
Copy link
Collaborator

Sounds like a waste of memory (2x for the AST) and potentially on disk too if you persist that field along the IDs.

@hs-apotell
Copy link
Collaborator Author

Is memory increase really a concern here? Even for a reasonable sized design, this wouldn't be any substantial increase.

Currently, VObject is 48 bytes of which 16 bytes is two SymbolIds. It adds another 16 bytes to each instance, rounding it to 64 bytes each instance. Runtime memory lost here is easily gained by not having to clone SymbolIds from one SymbolTable to other.

@alaindargelas
Copy link
Collaborator

"I can't envision a scenario where something like this would interfere with multithreading and/or multiprocessing since these are read operations and SymbolTable is guaranteeing location persistence."

When parsing in multithreading, all symbol tables (one per file) are populated in parallel with no synchronization.
How do you propose doing that safely with a single table?

Moreover a symbol table per file is what enables partial recompile, where only files that changed gets recompiled and error messages for those files get updated. Finally all files get recombined and symbols merged.

@hzeller
Copy link
Collaborator

hzeller commented Jun 29, 2022

What would be the motiviation to do so ?

  • If motivation is error-avoidance: maybe make it an option that can be enabled in debug mode and can discover wrong uses, but otherwise is 'lean'.
  • for compactness: Can a symbolID just be the pointer to the unified string_view ? That way, if we have a symbolID, getting the symbol is just essentially dereferencing it. Disadvantage of course is, that this is not providing nice consecutive numbers that are predictable, so printing will not be pleasent and everything that needs a consecutive index (e.g. for serialization) needs to be handled differently.

@hs-apotell
Copy link
Collaborator Author

If motivation is error-avoidance: maybe make it an option that can be enabled in debug mode and can discover wrong uses, but otherwise is 'lean'.

That's one reason.

for compactness: Can a symbolID just be the pointer to the unified string_view ? That way, if we have a symbolID, getting the symbol is just essentially dereferencing it. Disadvantage of course is, that this is not providing nice consecutive numbers that are predictable, so printing will not be pleasent and everything that needs a consecutive index (e.g. for serialization) needs to be handled differently.

I have considered having the index plus the string_view. @alaindargelas seems to be concerned about adding 8 bytes to the implementation and this would double that to additional 16 bytes. I don't have a good justification to defend such a change.

The other reason (apart from the error-avoidance) why I think a index + SymbolTable pointer works better is it avoids having to duplicate symbols in multiple SymbolTables. Treat SymbolId as a pinned (immovable) string and as long as the lifetime of the SymbolTable is longer than the contained symbols itself, all this duplication can be avoided. There's a lot of code where a Symbol is copied from one SymbolTable to other and that to me seems very redundant. This doesn't mean there's only one SymbolTable (contrary to what @alaindargelas suggested in his comment above) across the board. There are just as many SymbolTable instances as there currently are but they will be smaller in size.

@hs-apotell
Copy link
Collaborator Author

When parsing in multithreading, all symbol tables (one per file) are populated in parallel with no synchronization.
How do you propose doing that safely with a single table?

I am not proposing single table. The number of SymbolTable remains the same as it is today. Except that if I have a SymbolId instance, I don't have to know which SymbolTable it belong to also - the SymbolId instance holds that information. In a collection of SymbolId' each instance can be pinned in a different SymbolTable and that would just work. At the moment, to create such a collection, all the SymbolIds have to belong to the same SymbolTable.

Moreover a symbol table per file is what enables partial recompile, where only files that changed gets recompiled and error messages for those files get updated. Finally all files get recombined and symbols merged.

How does this process break because of the proposed change?

@hs-apotell
Copy link
Collaborator Author

Can a symbolID just be the pointer to the unified string_view ?

On a second though - This is likely to have a performance impact. At the moment, SymbolId are compared merely as an integer comparison. With SymbolId being a wrapper around std::string_view it will be more expensive. Whether this has an overall impact or not is something that would need actual experimentation but at least in theory a integer comparison is preferred over a string comparison.

@hzeller
Copy link
Collaborator

hzeller commented Jul 6, 2022

What I mean is to compare the address of the string, as it is uniqified in the symbol table. So integer or pointer comparison has the same cost.

@hs-apotell
Copy link
Collaborator Author

That might work. With any of these solutions, we could also potentially save on memory in SymbolTable since there is no need for the index based lookup table anymore i.e. SymbolTable::m_id2Symbol.

Two questions arise for any of these solutions to work -

  • Is there ever a need to know which SymbolTable does a symbol belong to?
  • Is the lifetime of a SymbolTable guaranteed to be longer than the contained symbols? If this can't be guaranteed, falling back to std::shared_ptr might be a solution to ensure that the string memory doesn't get released even if the parent SymbolTable is destroyed. SymbolId will wrap around a std::shared_ptr in this case.

@alaindargelas
Copy link
Collaborator

I'm not following all the details, but here is a hard constraint:
Symbols have to be small numbers (Not pointers) starting at 0, I can read and remember number IDs when debugging AST dumps, I can't read pointers and remember them, it would render debugging horrendous.

Some answers:
In the current implementation we always know which symbol table holds a symbol because we are getting it from the table (SymbolId id = fC->Name(id)), in your implementation I guess that is not required.

The symbol tables always outlive the strings they represent (Else it is a bug).

@hs-apotell
Copy link
Collaborator Author

One side affect or unintended behavior change. Consider two symbol ids that both say resolve to bad symbols. In the current implementation their native ids would be different and so they would be considered not equal. But if the SymbolId was a wrapper on the string (or some representation of it) both would be pointing to the same bad raw symbol making them equal.

I don't know if this difference matter but worth at least pointing it out since it is different from current behavior.

@hs-apotell
Copy link
Collaborator Author

Opened a discussion #3228

@hs-apotell
Copy link
Collaborator Author

@hzeller @alaindargelas I guess, neither of you are a fan of this approach. I haven't got any feedback on the discussion about PathId using this strategy here #3228. If we aren't making any progress on this I would recommend it closing this issue with your thoughts.

@hzeller
Copy link
Collaborator

hzeller commented Nov 14, 2022

I still think we should have a the filesystem keep track of its own IDs, so e.g. have its own SymbolTable.
If it can be accessed from different threads, then it should just wrap it with a mutex.

@hs-apotell
Copy link
Collaborator Author

In the past @alaindargelas has had reservations against using synchronization primitives especially for SymbolTable. Also, it does get a bit complicated with batch jobs since they run on the same process instance sharing the same FileSystem (and with it owning the SymbolTable, it will be bloated).

I don't think the current solution is bad (may be the class name does exactly fit the logic behind it). But, I am open to suggestions on improving the existing solution, for both SymbolId & PathId.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants