Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposed changes to TokenReader #697

Open
gabejohnson opened this issue Apr 27, 2017 · 11 comments
Open

Proposed changes to TokenReader #697

gabejohnson opened this issue Apr 27, 2017 · 11 comments

Comments

@gabejohnson
Copy link
Member

gabejohnson commented Apr 27, 2017

I've been thinking about changes to the reader API to make it simpler and add better state management and caching.

It would make sense to postpone these features until reader macros are in the wild to see what pain points there are.

It would also make sense to see if the API would be informed by a new implementation

Big concerns:

  • Language composability
  • State management
  • Caching
  • Laziness

Example:

const fromDefaultState = override => Object.assign({}, {
  line: 1,
  column: 1,
  pos: 0,
  filename: '',
  allowExprs: false,
  result: List()
}, override);

let _, wsTable;

// the Reader adds source segment/action mappings to the table (cache)
// 'segment': { read: always({sometoken}) }
[_, {table: wsTable}] =
  Reader('\t\v\f \A0...')
  .run(fromDefaultState({table: Readtable({
    '': {
      read(prefix, source) {
        return [null, source];
      }
    }
  }
)}));

// to add 'data' we can map over the entries in the table
wsTable = wsTable.map(v => (v.data = {terminates: true}, v));

let kwTable;
[_, {table: kwTable}] =
  Reader('await delete typeof void break case catch class continue debugger...')
  .run(fromDefaultState({table: wsTable.concat(Readtable({
  '': {
    read(prefix, source) {
      let char = source[0];
      const {data: {terminates}, read} = this.table(char);
      return terminates ?
        [{
          token: {
            type: 'KeywordToken',
            value: prefix
          }
        }, source] :
        action(prefix + char, source.slice(1));
    }
  }
}))}));

let [[[token]]] = Reader('typeof').run(fromDefaultState({table: kwTable}));
token // { type: 'KeywordToken', value: 'typeof', slice: {...}}

let syntaxTemplateTable = Readtable({
  '#`': {
    read: readSyntaxTemplate,
    data: {terminates: true}
  }
});

let numericTable = Readtable().addRange(48, 57, {
  read(prefix, source) {
    ...
  }
});

let delimTable = kwTable.concat(Readtable({
  '{': {
    read(prefix, source) {
      const { locationInfo: { line, column }, allowExpr, result, table } = this;
      const innerB = 
      return Reader.until('}', source)
             .run({
               locationInfo: { line, column },
               allowExpr: isExprPrefix(line, allowExprs, result),
               result: List(),
               table
             })[0]; // [result, source]
    },
    data: {terminates: true}
  },
  '}': {
    read(prefix, source) {
      throw Error(...);
    }
  }
}));

Using this API would aid language composability as the user is simply concatenating two readtables together (Readtable would form a Monoid).

'lang sweet.js, decorator' could simply result in a call to composeLanguages:

function composeLanguages(langs) {
  return langs.reduce((table, lang) => table.concat(require(`${lang}-readtable`)), Readtable.empty);
}

Behind the scenes, the readtable is being "updated" (it's immutable) with the slice of the source string that was consumed mapped to a thunk returning the token(s) produced by the reader. This would allow large portions of the source to avoid analysis and tokenization in the incremental compilation case.

Tokens would be decorated with a slice containing location information by the reader.

I have struggled with state management in the reader for a while now and am starting to suspect that a state monad transformer would be useful here. Currently both the reader and charstream contain state necessary to create a token. I would like to get rid of CharStream altogether as IMO it doesn't provide much beyond holding location state if the readtable can return matched prefixes ala a trie.

I don't know if this caching would actually result in a performance improvement. It's currently just speculation.

As for laziness, I was considering redefining TokenTree as:

type TokenTree = Token | Iterator<TokenTree>

Then read would be:

function read(source: string, filename: ?string): Iterator<TokenTree> {...}

/cc @disnet

@disnet
Copy link
Member

disnet commented May 1, 2017

That all sounds great to me. Use all the FP!

Your composability goal still comes down to "last write wins" right? Just with a well defined method for doing that composition.

Are you sure you want to expose the current API first? Your proposed changes seem clearly superior so we could make the changes first and then expose but admittedly I don't have a clear picture of the details yet.

@gabejohnson
Copy link
Member Author

gabejohnson commented May 1, 2017

Your composability goal still comes down to "last write wins" right?

Yes. It's just concat/mappend.

Are you sure you want to expose the current API first?

No. Users of the API would just get annoyed when we break it. I'll get started on this after we hash out what constitutes a language, I know we've discussed it before, but I'd like to be solid on that first.

My thinking from a while back was that a language would be composed of a readtable and one or more syntactic macros that would be automatically imported (and potentially invoked).

input:

// mylang.js
'lang sweet.js'
...
export default lang;
export { table };

// main.js
'lang mylang';
...

output:

import mylang from 'mylang';
mylang
...

We pass table to the reader and prepend the source with the import ... mylang;

mylang can do whatever it wants. I'm thinking other macro imports are going to be the common case though.

As far as language composability is concerned I'm thinking:

'lang yourlang, mylang';

to:

import yourlang from 'yourlang';
import mylang from 'mylang';
yourlang
mylang

We concat the tables and then pass to the reader.

I've been looking for the right structure to represent the readtable for several months now. I think a trie is the way to go, but I think I'll start with POJOs and linear algorithms first to test out the API.

This is probably going to take a while. Fortunately I don't think it's all or nothing. I can implement something like the above API and see what the default table looks like once that's done (caching should be in place at that point). If everything looks good, I'll find/make a better data structure for the table as I expect the first pass to be dog slow. Laziness is a nice to have AFAIC.

@disnet
Copy link
Member

disnet commented May 2, 2017

I think a trie is the way to go, but I think I'll start with POJOs and linear algorithms first to test out the API.

👍

I'll get started on this after we hash out what constitutes a language, I know we've discussed it before, but I'd like to be solid on that first.

Good call.

My initial proposal is to do the most general thing possible and then consider how to layer API conveniences on top of it. Fundamentally a language needs to:

  • define its lexical strucutre
  • give meaning to "primitive" bindings

I think this is satisfied by a language definition simply being a function from source to a list of TokenTrees (basically read):

// L.js
export default function read(source: string): List<TokenTree>

// main.js
'lang L.js';
// ...

The language's read export can of course use a readtable or do it all itself. Giving meaning to primitive bindings can be done by splicing in to the TokenTree list either imports or regular syntax declarations. Or if we want to get fancy (and I really do) go for Implicit form bindings.

As far as composability goes I'm not sure what the difference between

'lang L.js, K.js';

and

'lang LK.js';

// LK.js
import LReadtable from '...';
import KReadtable from '...';

export default read(source: string) {
  return L.concat(K).read(source); // or whatever
}

is/should be. In otherwords, do we need something specific in the 'lang ...' syntax that "does composability" when you specify multiple languages or should that just be a property of readtables and you can only specify a single language in 'lang ...', which may compose multiple readtables?

Just some initial thoughts, I don't have strong opinions yet :)

@gabejohnson
Copy link
Member Author

gabejohnson commented May 2, 2017

I think this is satisfied by a language definition simply being a function from source to a list of TokenTrees (basically read)

This also requires exporting the readtable from each language file, which is fine. But it doesn't make languages composable unless I can turn the List<TokenTree> back into a string and then run it through my custom read function.

As far as composability goes I'm not sure what the difference between...and...is/should be.

The difference is that languages can now be very small and as a user of a language I can compose without learning about readtables and TokenTrees.

Say L gives me a binding operator :: and K gives me extended Unicode identifiers (e.g. 💎 , ❤️). I want to use both but don't want to care about concatenating their respective readtables or defining a new read function.

@disnet
Copy link
Member

disnet commented May 2, 2017

The concern I have about allowing the language consumer to do arbitrary composition is that readtables are not a monoid (they are not associative) and they don't commute. Having order subtly matter will be a real footgun I worry.

@disnet
Copy link
Member

disnet commented May 2, 2017

Oh wait, they are associative aren't they? Regardless, ordering matters so my point still stands.

@gabejohnson
Copy link
Member Author

I see your point, but ordering matters any time you do binding. And declaring a language binds tokenizers to string patterns.

I think what you're getting at is that a user of composable languages would have to be nearly as sophisticated as a developer of languages and have knowledge of the internals of an implementation.

@gabejohnson
Copy link
Member Author

I'm still worried about extensibility though. Are you thinking readtables will be exported in addition to readers?

@disnet
Copy link
Member

disnet commented May 4, 2017

I think what you're getting at is that a user of composable languages would have to be nearly as sophisticated as a developer of languages and have knowledge of the internals of an implementation.

Right, ordering where the consequences of that ordering are obvious (like binding) is perfectly fine. If I bind x to 'foo' on line 1 and then re-bind x to 'bar' on line 5 I'm not surprised. But, saying 'lang L.js, K.js' means to the user "I want all the features from language L and K", not (what it would actually mean) "I want all the features from language K and all the features in L that are not in K".

We could address this a couple of ways.

One, allow your proposed syntax of 'lang L, K' to mean compose L and K's readtables but if there is any overlap, throw an error. Pro: most of the time the readtables won't collide and in the cases they do at least the user won't be surprised by it subtly working wrong. Con: sometimes you get an error in your face.

Two, don't allow any composition syntax; users can only specify one language. Languages can still be composed of multiple independent readtables following your proposed API, it just has to be done by language authors. Pro: simplest for the user. Con: much more work to create compositions.

Three, take inspiration from traits: 'lang L, K' means compose the readtables suppressing the conflicting keys. Or maybe have multiple composition operators:

  • L+K suppressing conflicted
  • L<K take the conflicted from L
  • L>K take the conflicted from K
  • L!K throw error if conflicted

Pro: most general, users have all the power. Con: potentially confusing, unclear if additional operators would be useful in practice.

@gabejohnson
Copy link
Member Author

gabejohnson commented May 4, 2017

take inspiration from traits

I like this idea a lot. But we don't know how much power people will want/need. Let's start just exposing one language and seeing what people's pain points are. We can always add operators later.

To be clear, a language module must have a read function as default export. Optionally, it can expose >= 1 readtable and whatever else it wants. Correct?

@disnet
Copy link
Member

disnet commented May 4, 2017

To be clear, a language module must have a read function as default export. Optionally, it can expose >= 1 readtable and whatever else it wants. Correct?

I think that's sufficient. The expander needs to know how to read a file, the pragma tells the expander where to find the right reader, readers can use readtables or whatever else they want to do the actually reading. Future composition operators in the pragma might require a different export API but that will be opt-in.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants