Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

word_language_model/data.py - two areas of redundant code #1227

Open
drtonyr opened this issue Feb 5, 2024 · 0 comments
Open

word_language_model/data.py - two areas of redundant code #1227

drtonyr opened this issue Feb 5, 2024 · 0 comments

Comments

@drtonyr
Copy link

drtonyr commented Feb 5, 2024

As this is (extremely useful!) example code, it should be as clean as possible.

I'm looking at word_language_model/data.py and there are two areas where the clarity and speed could be improved by removing redundant code.

  1. tokenize() runs in two passes called # Add words to the dictionary and # Tokenize file content. The first calls add_word() which does both the adding words to the dictionary and it returns the token. So everything can be done in one pass. Cleanest is to completely remove the first pass and change the line ids.append(self.dictionary.word2idx[word]) to ids.append(self.dictionary.add_word(word)).

  2. In # Tokenize file content, a list of torch tensors is built and then torch.cat() is used to merge into the final list. It is both cleaner and faster not to use the intermediate torch tensors and simply do:

        # Tokenize file content 
        with open(path, 'r', encoding="utf8") as f:
            ids = []
            for line in f:
                words = line.split() + ['<eos>']
                for word in words:
                    ids.append(self.dictionary.word2idx[word])

        return torch.tensor(ids).type(torch.int64)

In both cases I've just tried to take out the redundant code to make things cleaner to read and faster to execute (data load was about 20 minutes for the billion word corpus).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant