word_language_model/data.py - two areas of redundant code #1227

drtonyr · 2024-02-05T08:22:31Z

As this is (extremely useful!) example code, it should be as clean as possible.

I'm looking at word_language_model/data.py and there are two areas where the clarity and speed could be improved by removing redundant code.

tokenize() runs in two passes called # Add words to the dictionary and # Tokenize file content. The first calls add_word() which does both the adding words to the dictionary and it returns the token. So everything can be done in one pass. Cleanest is to completely remove the first pass and change the line ids.append(self.dictionary.word2idx[word]) to ids.append(self.dictionary.add_word(word)).
In # Tokenize file content, a list of torch tensors is built and then torch.cat() is used to merge into the final list. It is both cleaner and faster not to use the intermediate torch tensors and simply do:

        # Tokenize file content 
        with open(path, 'r', encoding="utf8") as f:
            ids = []
            for line in f:
                words = line.split() + ['<eos>']
                for word in words:
                    ids.append(self.dictionary.word2idx[word])

        return torch.tensor(ids).type(torch.int64)

In both cases I've just tried to take out the redundant code to make things cleaner to read and faster to execute (data load was about 20 minutes for the billion word corpus).

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

word_language_model/data.py - two areas of redundant code #1227

word_language_model/data.py - two areas of redundant code #1227

drtonyr commented Feb 5, 2024

word_language_model/data.py - two areas of redundant code #1227

word_language_model/data.py - two areas of redundant code #1227

Comments

drtonyr commented Feb 5, 2024