Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update grammar to accept user defined literal suffixes #14054

Merged

Conversation

matheusaaguiar
Copy link
Collaborator

Task of #12656.

@matheusaaguiar matheusaaguiar added roadmap has dependencies The PR depends on other PRs that must be merged first labels Mar 16, 2023
@matheusaaguiar matheusaaguiar self-assigned this Mar 16, 2023
@matheusaaguiar
Copy link
Collaborator Author

matheusaaguiar commented Mar 16, 2023

AFAIU, whitespaces are ignored in the lexer and then values like the one below (from a failing test) are parsed by antlr with no errors:

The same already happened before for denominations, but we don't seem to have any tests for this case. The following will also be parsed successfully by antlr:

contract C {
 uint x = 1000gwei;
}

docs/grammar/SolidityParser.g4 Outdated Show resolved Hide resolved
docs/grammar/SolidityParser.g4 Outdated Show resolved Hide resolved
@cameel
Copy link
Member

cameel commented Mar 20, 2023

AFAIU, whitespaces are ignored in the lexer and then values like the one below (from a failing test) are parsed by antlr with no errors

I see we have a WS token that represents whitespace. Try putting it before identifier and see if that makes ANTLR require whitespace there.

Also, please add a syntax test for each of the whitespace characters that is allowed there - mine only has the space.

The same already happened before for denominations, but we don't seem to have any tests for this case.

Good find. We should also do the same for subdenominations: fix the grammar and add syntax tests.

@matheusaaguiar matheusaaguiar force-pushed the literal_suffix_functions_update_grammar branch from 7ef6987 to 4af093a Compare March 21, 2023 03:22
@matheusaaguiar
Copy link
Collaborator Author

I see we have a WS token that represents whitespace. Try putting it before identifier and see if that makes ANTLR require whitespace there.

I tried it but doesn't work.
In the previous message, I should have explained better that the WS token defined in the lexer recognizes whitespaces and then discards them using the skip command. So, when the parser gets the tokens, none of them are WS and thus it never chooses suffixedLiteral or numberLiteralWithSubdenomination rules.

I also found a problem with the functionDefinition rule. Since it can recognize multiple modifierInvocation and suffix is not a language keyword, the antlr parser will fail to error out a function with repeated suffix specifiers:

function suffix1(uint) pure suffix suffix returns (uint) {}
// ----
// ParserError 2878: (82-88): Suffix already specified.

@cameel
Copy link
Member

cameel commented Mar 21, 2023

I also found a problem with the functionDefinition rule. Since it can recognize multiple modifierInvocation and suffix is not a language keyword, the antlr parser will fail to error out a function with repeated suffix specifiers:

Hmm... Indeed looks like a problem. Actually, why virtual and override do not cause similar problems? The current grammar allows them on free function definitions. Do we just have no syntax test that tries to put virtual on a free function?

But anyway, one way out would be to split functionDefinition into freeFunctionDefinition and contractFunctionDefinition. It actually would not be that bad because it would make the grammar more precise and freeFunctionDefinition would be quite simple because it does not allow a lot of stuff that's allowed in contract functions.

Another one would be to allow only either modifierInvocation or Suffix. But I suspect that the only nice way to achieve that in ANTLR may be the split I mentioned above.

@cameel
Copy link
Member

cameel commented Mar 21, 2023

I tried it but doesn't work.

ok, so here's another idea. Looks like our scanner actually properly parses 1000suffix as a number followed by a literal and just has an extra validation that rejects it:

// The source character immediately following a numeric literal must
// not be an identifier start or a decimal digit; see ECMA-262
// section 7.8.3, page 17 (note that we read only one decimal digit
// if the value is 0).
if (isDecimalDigit(m_char) || isIdentifierStart(m_char))
return setError(ScannerError::IllegalNumberEnd);

We could simulate this by recognizing a number followed by literal as its own thing. Try to define DecimalNumberFollowedByIdentifier, which can basically consists of chars allowed in DecimalNumber, and then chars allowed in identifier. Just defining such a rule should make ANTLR recognize that 1000suffix is not a decimal number.

Also, to make sure it works, please add test cases for hex and fractional numbers (I suspect you'll need separate rules to handle them): 0x1000suffix, 1000.0suffix 1000.0e-5suffix and 0x1000abcdefgh.

@cameel
Copy link
Member

cameel commented Mar 21, 2023

Also, these grammar changes (i.e. split into free/contract function, number+identifier, renaming of NumberUnit) are quite general and do not depend on suffixes. Once you get them to pass all tests, I'd suggest to move them (along with relevant test cases) to another PR, straight to develop and here leave only changes that introduce suffix. We can get them merged separately and keep the main suffix PR smaller that way.

@matheusaaguiar
Copy link
Collaborator Author

Actually, why virtual and override do not cause similar problems? The current grammar allows them on free function definitions. Do we just have no syntax test that tries to put virtual on a free function?

We do have tests for that, but they generate SyntaxError rather than ParserError, which is the only category test_antlr_grammar.sh looks for in order to detect divergence with solc parser.

if grep -qE "^\/\/ ParserError" "${SOL_FILE}"; then
if [[ "${output}" != "" ]]
then
echo -e "${SGR_BLUE}[${cur}/${max}] Testing ${SOL_FILE}${SGR_RESET} ${SGR_BOLD}${SGR_GREEN}FAILED AS EXPECTED${SGR_RESET}"
else
echo -e "${SGR_BLUE}[${cur}/${max}] Testing ${SOL_FILE}${SGR_RESET} ${SGR_BOLD}${SGR_RED}SUCCEEDED DESPITE PARSER ERROR${SGR_RESET}"

But anyway, one way out would be to split functionDefinition into freeFunctionDefinition and contractFunctionDefinition.

Ok, agreed.

@cameel
Copy link
Member

cameel commented Mar 21, 2023

We do have tests for that, but they generate SyntaxError rather than ParserError, which is the only category test_antlr_grammar.sh looks for in order to detect divergence with solc parser.

Interesting. Well, then maybe working around it test_antlr_grammar.sh in some way would not be out of the question if adjusting it turns out to be too complicated. But first let's try to do it the nice way. We should be able to match the grammar.

Copy link
Member

@cameel cameel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, looks good now apart from some minor tweaks.

Please now extract the generic part into a PR on develop as I suggested in an earlier comment (note that this will require changing some tests to use denominations rather than suffixes).

test/libsolidity/syntaxTests/invalid_octal_number.sol Outdated Show resolved Hide resolved
test/libsolidity/syntaxTests/invalid_hex_number.sol Outdated Show resolved Hide resolved
docs/grammar/SolidityParser.g4 Outdated Show resolved Hide resolved
@cameel
Copy link
Member

cameel commented Mar 22, 2023

By the way, please remember to mark PRs as reviewable once you're done implementing them.

@cameel
Copy link
Member

cameel commented Mar 24, 2023

I rebased the suffix PR on #14066, so if you rebase this one on the suffix PR, we can now continue review here.

@matheusaaguiar matheusaaguiar force-pushed the literal_suffix_functions_update_grammar branch from d133142 to f54ddfc Compare March 24, 2023 20:20
Comment on lines 190 to 193
(
{!$mutabilitySet}? stateMutability {$mutabilitySet = true;}
| {!$suffixSpecifierSet}? Suffix {$suffixSpecifierSet = true;}
)*
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With just mutability and suffix we could still do it without variables and keep it simple:

Suggested change
(
{!$mutabilitySet}? stateMutability {$mutabilitySet = true;}
| {!$suffixSpecifierSet}? Suffix {$suffixSpecifierSet = true;}
)*
stateMutability Suffix?
| Suffix stateMutability?


literal: stringLiteral | numberLiteral | booleanLiteral | hexStringLiteral | unicodeStringLiteral;

literalWithSubDenomination: numberLiteral SubDenomination;

suffixedLiteral: literal identifier;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, just realized that this is not entirely correct. It does not have to be a single literal. Can be member access as well.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if I follow here. Should we be doing something like this?

Suggested change
suffixedLiteral: literal identifier;
memberAcess: identifier (Period identifier)+;
suffixedLiteral: (literal | memberAcess) identifier;

What would be a member access that is equivalent to a literal?

Copy link
Member

@cameel cameel Mar 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean that something like 123 M.suffix is allowed because you can do import "A.sol" as M and A.sol can have suffix defined in it. identifier alone does not allow it.

Not sure if we want to define memberAccess explicitly though. I see that inside expression that's defined in place like this:

expression Period (identifier | Address) # MemberAccess

So I guess the solution is either to repeat that for suffixedLiteral or indeed define it as memberAccess.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, ok, understood it now. I was thinking about the literal, not the identifier.
We have a test with suffix function imported from other file:

import "A.sol" as A;
import "B.sol" as B;
contract C {
function f() pure public {
1 s;
2 z;
3 A.s;
4 B.B.B.A.s;

I have removed the failing test in chk_antlr_grammar before running it locally and then got no errors, so I think that the current grammar already covers that somehow, but not sure.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the current grammar already covers that somehow, but not sure.

But how? It can't be. expression does account for that explicitly so the literals should have to as well.

Comment on lines 92 to 95
Ufixed: 'ufixed' | ('ufixed' [1-9][0-9]+ 'x' [1-9][0-9]+);
Unchecked: 'unchecked';
Unicode: 'unicode';
/**
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, so a keyword was missing here. This is a general change that should go straight to develop.

With a test case preferably.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. Created #14078 .


literal: stringLiteral | numberLiteral | booleanLiteral | hexStringLiteral | unicodeStringLiteral;

literalWithSubDenomination: numberLiteral SubDenomination;

suffixedLiteral: literal identifier;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the current grammar already covers that somehow, but not sure.

But how? It can't be. expression does account for that explicitly so the literals should have to as well.

@matheusaaguiar
Copy link
Collaborator Author

But how? It can't be. expression does account for that explicitly so the literals should have to as well.

Yeah, I am not sure why it works. I will investigate it.

@cameel
Copy link
Member

cameel commented Apr 3, 2023

Yeah, I am not sure why it works. I will investigate it.

So, did you figure it out?

@matheusaaguiar
Copy link
Collaborator Author

So, did you figure it out?

Not yet. I will prioritize it though.

@matheusaaguiar
Copy link
Collaborator Author

matheusaaguiar commented Apr 3, 2023

I checked the test_antlr_grammar script and found out that tests that are multi-source files like the one I mentioned before are excluded before actually running the ANTLR parser.

==== Source: A.sol ====
function s(uint) pure suffix returns (uint) { return 1; }
==== Source: B.sol ====
import {s} from "A.sol";
import {s as z} from "A.sol";

Still, seems like the parser accepts member access as literal suffixes...

@cameel
Copy link
Member

cameel commented Apr 3, 2023

Still, seems like the parser accepts member access as literal suffixes...

You mean ANTLR or solc parser? Because the latter obviously does, but ANTLR should too.

In any case, this explains why it did not fail. This means that the grammar is really incomplete, we just can't properly test it. It needs to be updated to cover this case.

@matheusaaguiar
Copy link
Collaborator Author

You mean ANTLR or solc parser?

I meant ANTLR parser. I just removed the ==== Source ... separators from the file and then let the script apply the ANTLR parser on it, which reported success.

It needs to be updated to cover this case.

Are we going to do this in the context of the literal suffix PR or is it a task we leave for later?

@cameel
Copy link
Member

cameel commented Apr 3, 2023

Are we going to do this in the context of the literal suffix PR or is it a task we leave for later?

In this PR. Without it the grammar is incomplete.

I just removed the ==== Source ... separators from the file and then let the script apply the ANTLR parser on it, which reported success.

This sounds weird. Is it only for suffixes or does it let you use dotted paths in other places that grammar does not officially include?

@matheusaaguiar
Copy link
Collaborator Author

I checked with the ANTLR test tool using the options -tokens and -gui (manually typing in CLI). When parsing expression uint x = 1 A.b.c; it produces the following output:

[@0,0:3='uint',<UnsignedIntegerType>,1:0]
[@1,5:12='constant',<'constant'>,1:5]
[@2,14:14='x',<Identifier>,1:14]
[@3,16:16='=',<'='>,1:16]
[@4,18:18='1',<DecimalNumber>,1:18]
[@5,20:20='A',<Identifier>,1:20]
[@6,21:21='.',<Period>,1:21]
[@7,22:22='b',<Identifier>,1:22]
[@8,23:23='.',<Period>,1:23]
[@9,24:24='c',<Identifier>,1:24]
[@10,25:25=';',<Semicolon>,1:25]
[@11,27:26='<EOF>',<EOF>,2:0]

The visualization tool produces this tree:
antlr4_parse_tree

It first recognizes the value of the assignment as a member access, and then decomposes the expression before the . as either another member access or finally as a suffixedLiteral.

@matheusaaguiar
Copy link
Collaborator Author

This sounds weird. Is it only for suffixes or does it let you use dotted paths in other places that grammar does not officially include?

I am trying to find if there's other cases.

@cameel
Copy link
Member

cameel commented Apr 4, 2023

Oh, that's very helpful. I see what's happening there.

Basically, ANTLR sees it as if it was uint x = (1 A).b.c;, which is already a valid expression. solc parser on the other hand interprets it as uint x = 1 (A.b.c);. We do need an extra rule to get that second interpretation.

@cameel
Copy link
Member

cameel commented Apr 4, 2023

I pushed a new version of the base PR. Please rebase.

@matheusaaguiar matheusaaguiar force-pushed the literal_suffix_functions_update_grammar branch from b21d2f9 to 1dc56e6 Compare April 4, 2023 15:46
@matheusaaguiar matheusaaguiar force-pushed the literal_suffix_functions_update_grammar branch from 1dc56e6 to a1059cf Compare April 5, 2023 13:45
@@ -129,7 +129,8 @@ done < <(
grep -v -E 'license/license_hidden_unicode.sol' |
grep -v -E 'license/license_unicode.sol' |
# Skipping tests with 'something.address' as 'address' as the grammar fails on those
grep -v -E 'inlineAssembly/external_function_pointer_address.*.sol'
grep -v -E 'inlineAssembly/external_function_pointer_address.*.sol' |
grep -v -E 'literalSuffixes/application/invalid_address_member_on_suffix.sol'
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this because I noticed we already had a workaround for the case.

Copy link
Member

@cameel cameel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, looks good now!

@cameel cameel marked this pull request as ready for review April 5, 2023 15:36
@cameel cameel merged commit 67ca210 into literal_suffix_functions Apr 5, 2023
1 check passed
@cameel cameel deleted the literal_suffix_functions_update_grammar branch April 5, 2023 15:36
@cameel cameel mentioned this pull request Apr 17, 2023
52 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
has dependencies The PR depends on other PRs that must be merged first roadmap
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants