Korean text normalization and language preparation package for LM in Kaldi-based ASR system
-
Updated
Apr 23, 2020 - Python
Korean text normalization and language preparation package for LM in Kaldi-based ASR system
Keyword Search Recipe for Subword ASR
Subword-augmented Embedding for Cloze Reading Comprehension (COLING 2018)
Effective Subword Segmentation for Text Comprehension (TASLP 2019)
An implementation of subword division algorithm proposed in T. Mikolov (2012).
johnny - a neural network graph based DEPendency Parser
Unsupervised Word Segmentation using Minimum Description Length for Neural Machine Translation (NMT)
This repository contains source code implementation of assignments for NTU's MSAI course AI6127 on Deep Neural Networks for Natural Language Processing (2019 Sem 2).
Simple-to-use scoring function for arbitrarily tokenized texts.
The concept of DAWGs is based on: Blumer, A. et al. (1985). The smallest automation recognizing the subwords of a text. Theoretical Computer Science, 40, 31–55.
Tokenization is a way of separating a piece of text into smaller units called tokens. Here, tokens can be either words, characters, or subwords. Hence, tokenization can be broadly classified into 3 types – word, character, and subword (n-gram characters) tokenization.
A framework for generating subword vocabulary from a tensorflow dataset and building custom BERT tokenizer models.
Subword Neural Machine Translation
Add a description, image, and links to the subword topic page so that developers can more easily learn about it.
To associate your repository with the subword topic, visit your repo's landing page and select "manage topics."