From 6f99bcd5871281bcb379dc6144c1276773ee91b7 Mon Sep 17 00:00:00 2001 From: romanalakomcikova Date: Fri, 10 Nov 2023 19:36:49 +0100 Subject: [PATCH] Improve nlpO3 tutorial --- source/notebooks/nlpo3ipynb.ipynb | 469 +++++++++++++++--------------- 1 file changed, 238 insertions(+), 231 deletions(-) diff --git a/source/notebooks/nlpo3ipynb.ipynb b/source/notebooks/nlpo3ipynb.ipynb index ebb6ec3..02f506f 100644 --- a/source/notebooks/nlpo3ipynb.ipynb +++ b/source/notebooks/nlpo3ipynb.ipynb @@ -1,248 +1,255 @@ { - "nbformat": 4, - "nbformat_minor": 0, - "metadata": { + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "1dUnwVxgISWq" + }, + "source": [ + "# nlpO3\n", + "\n", + "nlpO3 is a Rust natural language processing library for Thai with Python and Node bindings. Similarly to newmm, it comes with a maximal-matching dictionary-based tokenizer, which honors Thai character cluster boundaries. However, compared to newmm, which is a pure Python implementation, nlpO3 is much faster. For a comparison, refer to [Benchmark nlpo3.segment](https://github.com/PyThaiNLP/nlpo3/blob/main/nlpo3-python/notebooks/nlpo3_segment_benchmarks.ipynb). Lern more about nlpO3 [here](https://github.com/PyThaiNLP/nlpo3).\n", + "\n", + "In this tutorial, you will learn how to use nlpO3 to tokenize a text with a pre-prepared list of words serving as a custom dictionary." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "NBfhmYbiIboQ" + }, + "source": [ + "### Installation\n", + "\n", + "We install the Python binding using pip." + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": { "colab": { - "name": "nlpo3.ipynb", - "provenance": [], - "collapsed_sections": [] - }, - "kernelspec": { - "name": "python3", - "display_name": "Python 3" + "base_uri": "https://localhost:8080/" }, - "language_info": { - "name": "python" + "id": "97rhSDE8IG8X", + "outputId": "04523717-c31a-4f41-cd40-9e3e7e7c7628" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Collecting nlpo3\n", + "Successfully installed nlpo3-1.1.2\n" + ] } + ], + "source": [ + "!pip install nlpo3" + ] }, - "cells": [ - { - "cell_type": "markdown", - "metadata": { - "id": "1dUnwVxgISWq" - }, - "source": [ - "# nlpO3\n", - "\n", - "Thai Natural Language Processing in Rust, with Python-binding.\n", - "\n", - "## Features\n", - "- newmm dictionary-based word tokenization, at ultra fast speed\n", - "- support custom dictionary\n", - "\n", - "GitHub: https://github.com/PyThaiNLP/nlpo3" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "NBfhmYbiIboQ" - }, - "source": [ - "## Install" - ] - }, - { - "cell_type": "code", - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "97rhSDE8IG8X", - "outputId": "04523717-c31a-4f41-cd40-9e3e7e7c7628" - }, - "source": [ - "!pip install nlpo3" - ], - "execution_count": 1, - "outputs": [ - { - "output_type": "stream", - "text": [ - "Collecting nlpo3\n", - "Successfully installed nlpo3-1.1.2\n" - ], - "name": "stdout" - } - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "qgAabbuvIdS8" - }, - "source": [ - "## Using" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "zp1hmAJ6I0a2" - }, - "source": [ - "### PyThaiNLP dictionary\n", - "\n" - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "4_v-LpBbIKN9" - }, - "source": [ - "from nlpo3 import segment" - ], - "execution_count": 2, - "outputs": [] - }, - { - "cell_type": "code", - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "ihpxBQvtIgqJ", - "outputId": "31a01e32-fac7-4f8c-d0ad-f197f035a850" - }, - "source": [ - "segment(\"ทดสอบตัดคำภาษาไทย\")" - ], - "execution_count": 3, - "outputs": [ - { - "output_type": "execute_result", - "data": { - "text/plain": [ - "['ทดสอบ', 'ตัด', 'คำ', 'ภาษาไทย']" - ] - }, - "metadata": { - "tags": [] - }, - "execution_count": 3 - } - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "b_O9xI5fI96A" - }, - "source": [ - "### custom dictionary\n", - "\n", - "We try to use a thai countries dictionary for segment text." - ] + { + "cell_type": "markdown", + "metadata": { + "id": "zp1hmAJ6I0a2" + }, + "source": [ + "### PyThaiNLP dictionary\n", + "\n", + "First we try segmenting a Thai sentence into a list of words without specifying a dictionary parameter." + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": { + "id": "4_v-LpBbIKN9" + }, + "outputs": [], + "source": [ + "from nlpo3 import segment" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" }, + "id": "ihpxBQvtIgqJ", + "outputId": "31a01e32-fac7-4f8c-d0ad-f197f035a850" + }, + "outputs": [ { - "cell_type": "code", - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "rRMh4bXxJDRg", - "outputId": "21a142e7-7aa1-4a99-b341-31a1535a3928" - }, - "source": [ - "!wget https://github.com/PyThaiNLP/pythainlp/raw/dev/pythainlp/corpus/countries_th.txt" - ], - "execution_count": 4, - "outputs": [ - { - "output_type": "stream", - "text": [ - "--2021-06-22 05:14:58-- https://github.com/PyThaiNLP/pythainlp/raw/dev/pythainlp/corpus/countries_th.txt\n", - "Resolving github.com (github.com)... 140.82.112.4\n", - "Connecting to github.com (github.com)|140.82.112.4|:443... connected.\n", - "HTTP request sent, awaiting response... 302 Found\n", - "Location: https://raw.githubusercontent.com/PyThaiNLP/pythainlp/dev/pythainlp/corpus/countries_th.txt [following]\n", - "--2021-06-22 05:14:58-- https://raw.githubusercontent.com/PyThaiNLP/pythainlp/dev/pythainlp/corpus/countries_th.txt\n", - "Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...\n", - "Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.\n", - "HTTP request sent, awaiting response... 200 OK\n", - "Length: 7622 (7.4K) [text/plain]\n", - "Saving to: ‘countries_th.txt’\n", - "\n", - "countries_th.txt 100%[===================>] 7.44K --.-KB/s in 0s \n", - "\n", - "2021-06-22 05:14:58 (70.3 MB/s) - ‘countries_th.txt’ saved [7622/7622]\n", - "\n" - ], - "name": "stdout" - } + "data": { + "text/plain": [ + "['ทดสอบ', 'ตัด', 'คำ', 'ภาษาไทย']" ] + }, + "execution_count": 3, + "metadata": { + "tags": [] + }, + "output_type": "execute_result" + } + ], + "source": [ + "segment(\"ทดสอบตัดคำภาษาไทย\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "b_O9xI5fI96A" + }, + "source": [ + "### Custom dictionary\n", + "\n", + "Now we enhance the tokenization with a pre-prepared list of countries in Thai, which will serve as a custom dictionary.\n", + "\n", + "We use the `wget` command to download the list from GitHub. It's a plain text file containing one word per line. " + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" }, + "id": "rRMh4bXxJDRg", + "outputId": "21a142e7-7aa1-4a99-b341-31a1535a3928" + }, + "outputs": [ { - "cell_type": "code", - "metadata": { - "id": "-DT_WnJeIk6q" - }, - "source": [ - "from nlpo3 import segment, load_dict" - ], - "execution_count": 5, - "outputs": [] - }, - { - "cell_type": "code", - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "3Sv0hrNVJOFA", - "outputId": "97f6c5a2-3402-47f1-b568-b92b95654bc8" - }, - "source": [ - "load_dict(\"countries_th.txt\", \"countries\")" - ], - "execution_count": 9, - "outputs": [ - { - "output_type": "stream", - "text": [ - "Successful: dictionary name countries from file countries_th.txt has been successfully loaded\n" - ], - "name": "stdout" - } - ] + "name": "stdout", + "output_type": "stream", + "text": [ + "--2021-06-22 05:14:58-- https://github.com/PyThaiNLP/pythainlp/raw/dev/pythainlp/corpus/countries_th.txt\n", + "Resolving github.com (github.com)... 140.82.112.4\n", + "Connecting to github.com (github.com)|140.82.112.4|:443... connected.\n", + "HTTP request sent, awaiting response... 302 Found\n", + "Location: https://raw.githubusercontent.com/PyThaiNLP/pythainlp/dev/pythainlp/corpus/countries_th.txt [following]\n", + "--2021-06-22 05:14:58-- https://raw.githubusercontent.com/PyThaiNLP/pythainlp/dev/pythainlp/corpus/countries_th.txt\n", + "Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...\n", + "Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.\n", + "HTTP request sent, awaiting response... 200 OK\n", + "Length: 7622 (7.4K) [text/plain]\n", + "Saving to: ‘countries_th.txt’\n", + "\n", + "countries_th.txt 100%[===================>] 7.44K --.-KB/s in 0s \n", + "\n", + "2021-06-22 05:14:58 (70.3 MB/s) - ‘countries_th.txt’ saved [7622/7622]\n", + "\n" + ] + } + ], + "source": [ + "!wget https://github.com/PyThaiNLP/pythainlp/raw/dev/pythainlp/corpus/countries_th.txt" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We use the `load_dict` function to load the contents of the downloaded file into the `countries` dictionary." + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": { + "id": "-DT_WnJeIk6q" + }, + "outputs": [], + "source": [ + "from nlpo3 import segment, load_dict" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" }, + "id": "3Sv0hrNVJOFA", + "outputId": "97f6c5a2-3402-47f1-b568-b92b95654bc8" + }, + "outputs": [ { - "cell_type": "code", - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "N_Ni-39BJR5Q", - "outputId": "03e69197-6871-4257-ae35-e1949733cc63" - }, - "source": [ - "segment(\"สวัสดีครับประเทศไทย เกาหลี\", \"countries\")" - ], - "execution_count": 11, - "outputs": [ - { - "output_type": "execute_result", - "data": { - "text/plain": [ - "['สวัสดีครับประเทศ', 'ไทย', ' ', 'เกาหลี']" - ] - }, - "metadata": { - "tags": [] - }, - "execution_count": 11 - } - ] + "name": "stdout", + "output_type": "stream", + "text": [ + "Successful: dictionary name countries from file countries_th.txt has been successfully loaded\n" + ] + } + ], + "source": [ + "load_dict(\"countries_th.txt\", \"countries\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Finally, we call the `segment` method on a Thai sentence specifying the `countries` dictionary in the parameters. " + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" }, + "id": "N_Ni-39BJR5Q", + "outputId": "03e69197-6871-4257-ae35-e1949733cc63" + }, + "outputs": [ { - "cell_type": "markdown", - "metadata": { - "id": "8Ng4h-gkLsQK" - }, - "source": [ - "for speed of word segmentation benchmark, you can read more at https://github.com/PyThaiNLP/nlpo3/blob/main/nlpo3-python/notebooks/nlpo3_segment_benchmarks.ipynb" + "data": { + "text/plain": [ + "['สวัสดีครับประเทศ', 'ไทย', ' ', 'เกาหลี']" ] + }, + "execution_count": 7, + "metadata": { + "tags": [] + }, + "output_type": "execute_result" } - ] + ], + "source": [ + "segment(\"สวัสดีครับประเทศไทย เกาหลี\", \"countries\")" + ] + } + ], + "metadata": { + "colab": { + "collapsed_sections": [], + "name": "nlpo3.ipynb", + "provenance": [] + }, + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.9" + } + }, + "nbformat": 4, + "nbformat_minor": 1 }