Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow to use arbitrary sequences as elements, not only strings #13

Open
dragoon opened this issue Jul 2, 2014 · 4 comments
Open

Allow to use arbitrary sequences as elements, not only strings #13

dragoon opened this issue Jul 2, 2014 · 4 comments

Comments

@dragoon
Copy link

dragoon commented Jul 2, 2014

I tried to construct the following trie:

trie = marisa_trie.Trie([('New', 'York'), ('New', 'Castle')])

Which gave me AttributeError: 'tuple' object has no attribute 'encode'. So I suppose the library accepts only strings, but sometimes you want other structures.

@derpston
Copy link

derpston commented Jul 2, 2014

Have you tried using the RecordTrie instead? (same module)

@dragoon
Copy link
Author

dragoon commented Jul 2, 2014

I don't really understand this structure, it has some keys and values, while I have only values.

@derpston
Copy link

derpston commented Jul 2, 2014

Ah, I see what you mean now, disregard my earlier comment. Yeah, as far as I'm aware it only accepts unicode strings.

@kmike
Copy link
Member

kmike commented Jul 2, 2014

@dragoon I'm not sure adding support for having any object as a key is a good idea - because I don't know how to implement it efficiently.

We can't store just an id of object (it defeats the purpose of marisa-trie), so we should somehow serialize the key to bytes to use it as a key. For strings the wrapper encodes unicode input to utf8.

In order to support arbitrary objects we may use pickle, but I'm not sure how compressable is the result, and better task-specific serialization methods usually exists. For example, in your case (a tuple with 2 strings) it makes sense to join the strings using some separator before adding to the trie and split by this separator when retreiving. You don't need marisa-trie support to do this.

But that's true that there are some edge cases (separator inside the tuple element?), splitting/joining tuples could be more efficient if implemented in Cython, and storing tuples of strings is quite common. So I think adding a trie subclass that allows tuples of strings as keys is a good idea - ngram storage is a common use case. Pull requests are welcome :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants