diff --git a/examples/demo_parsing_instructions.ipynb b/examples/demo_parsing_instructions.ipynb new file mode 100644 index 0000000..609111e --- /dev/null +++ b/examples/demo_parsing_instructions.ipynb @@ -0,0 +1,615 @@ +{ + "nbformat": 4, + "nbformat_minor": 0, + "metadata": { + "colab": { + "provenance": [] + }, + "kernelspec": { + "name": "python3", + "display_name": "Python 3" + }, + "language_info": { + "name": "python" + } + }, + "cells": [ + { + "cell_type": "markdown", + "source": [ + "# LlamaParse - Parsing comic books with parsing intructions\n", + "Parsing intructions allow you to instruct our parsing model the same way you would instruct an LLM!\n", + "\n", + "They can be usefull to help the parser get better results on complex document layouts, to extract data in a specific format, or to transform the document in other ways.\n", + "\n", + "Using Parsing Instruction you will get better results out of LlamaParse on complicated documents, and also be able to simplify your application code." + ], + "metadata": { + "id": "eld1dKaN7P8B" + } + }, + { + "cell_type": "markdown", + "source": [ + "## Installation\n", + "\n", + "Parsing instructions are part of the llamaParse API. They can be accessed by directly specifying the parsing_instruction parameter in the API or by using the LlamaParse python module (which we will use for this tutorial).\n", + "\n", + "To install llama-parse, just get it from PIP:" + ], + "metadata": { + "id": "goB1sV8zu_Xl" + } + }, + { + "cell_type": "code", + "source": [ + "!pip install llama-parse" + ], + "metadata": { + "id": "7Y3_BwQLu-qK", + "colab": { + "base_uri": "https://localhost:8080/" + }, + "outputId": "652f8957-ae7a-48e3-86ae-2e9e885c4e1e" + }, + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "Collecting llama-parse\n", + " Downloading llama_parse-0.3.8-py3-none-any.whl (6.7 kB)\n", + "Collecting llama-index-core>=0.10.7 (from llama-parse)\n", + " Downloading llama_index_core-0.10.19-py3-none-any.whl (15.3 MB)\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m15.3/15.3 MB\u001b[0m \u001b[31m31.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25hRequirement already satisfied: PyYAML>=6.0.1 in /usr/local/lib/python3.10/dist-packages (from llama-index-core>=0.10.7->llama-parse) (6.0.1)\n", + "Requirement already satisfied: SQLAlchemy[asyncio]>=1.4.49 in /usr/local/lib/python3.10/dist-packages (from llama-index-core>=0.10.7->llama-parse) (2.0.28)\n", + "Requirement already satisfied: aiohttp<4.0.0,>=3.8.6 in /usr/local/lib/python3.10/dist-packages (from llama-index-core>=0.10.7->llama-parse) (3.9.3)\n", + "Collecting dataclasses-json (from llama-index-core>=0.10.7->llama-parse)\n", + " Downloading dataclasses_json-0.6.4-py3-none-any.whl (28 kB)\n", + "Collecting deprecated>=1.2.9.3 (from llama-index-core>=0.10.7->llama-parse)\n", + " Downloading Deprecated-1.2.14-py2.py3-none-any.whl (9.6 kB)\n", + "Collecting dirtyjson<2.0.0,>=1.0.8 (from llama-index-core>=0.10.7->llama-parse)\n", + " Downloading dirtyjson-1.0.8-py3-none-any.whl (25 kB)\n", + "Requirement already satisfied: fsspec>=2023.5.0 in /usr/local/lib/python3.10/dist-packages (from llama-index-core>=0.10.7->llama-parse) (2023.6.0)\n", + "Collecting httpx (from llama-index-core>=0.10.7->llama-parse)\n", + " Downloading httpx-0.27.0-py3-none-any.whl (75 kB)\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m75.6/75.6 kB\u001b[0m \u001b[31m6.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25hCollecting llamaindex-py-client<0.2.0,>=0.1.13 (from llama-index-core>=0.10.7->llama-parse)\n", + " Downloading llamaindex_py_client-0.1.13-py3-none-any.whl (107 kB)\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m108.0/108.0 kB\u001b[0m \u001b[31m10.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25hRequirement already satisfied: nest-asyncio<2.0.0,>=1.5.8 in /usr/local/lib/python3.10/dist-packages (from llama-index-core>=0.10.7->llama-parse) (1.6.0)\n", + "Requirement already satisfied: networkx>=3.0 in /usr/local/lib/python3.10/dist-packages (from llama-index-core>=0.10.7->llama-parse) (3.2.1)\n", + "Requirement already satisfied: nltk<4.0.0,>=3.8.1 in /usr/local/lib/python3.10/dist-packages (from llama-index-core>=0.10.7->llama-parse) (3.8.1)\n", + "Requirement already satisfied: numpy in /usr/local/lib/python3.10/dist-packages (from llama-index-core>=0.10.7->llama-parse) (1.25.2)\n", + "Collecting openai>=1.1.0 (from llama-index-core>=0.10.7->llama-parse)\n", + " Downloading openai-1.13.3-py3-none-any.whl (227 kB)\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m227.4/227.4 kB\u001b[0m \u001b[31m16.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25hRequirement already satisfied: pandas in /usr/local/lib/python3.10/dist-packages (from llama-index-core>=0.10.7->llama-parse) (1.5.3)\n", + "Requirement already satisfied: pillow>=9.0.0 in /usr/local/lib/python3.10/dist-packages (from llama-index-core>=0.10.7->llama-parse) (9.4.0)\n", + "Requirement already satisfied: requests>=2.31.0 in /usr/local/lib/python3.10/dist-packages (from llama-index-core>=0.10.7->llama-parse) (2.31.0)\n", + "Requirement already satisfied: tenacity<9.0.0,>=8.2.0 in /usr/local/lib/python3.10/dist-packages (from llama-index-core>=0.10.7->llama-parse) (8.2.3)\n", + "Collecting tiktoken>=0.3.3 (from llama-index-core>=0.10.7->llama-parse)\n", + " Downloading tiktoken-0.6.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.8 MB)\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m1.8/1.8 MB\u001b[0m \u001b[31m43.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25hRequirement already satisfied: tqdm<5.0.0,>=4.66.1 in /usr/local/lib/python3.10/dist-packages (from llama-index-core>=0.10.7->llama-parse) (4.66.2)\n", + "Requirement already satisfied: typing-extensions>=4.5.0 in /usr/local/lib/python3.10/dist-packages (from llama-index-core>=0.10.7->llama-parse) (4.10.0)\n", + "Collecting typing-inspect>=0.8.0 (from llama-index-core>=0.10.7->llama-parse)\n", + " Downloading typing_inspect-0.9.0-py3-none-any.whl (8.8 kB)\n", + "Requirement already satisfied: aiosignal>=1.1.2 in /usr/local/lib/python3.10/dist-packages (from aiohttp<4.0.0,>=3.8.6->llama-index-core>=0.10.7->llama-parse) (1.3.1)\n", + "Requirement already satisfied: attrs>=17.3.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp<4.0.0,>=3.8.6->llama-index-core>=0.10.7->llama-parse) (23.2.0)\n", + "Requirement already satisfied: frozenlist>=1.1.1 in /usr/local/lib/python3.10/dist-packages (from aiohttp<4.0.0,>=3.8.6->llama-index-core>=0.10.7->llama-parse) (1.4.1)\n", + "Requirement already satisfied: multidict<7.0,>=4.5 in /usr/local/lib/python3.10/dist-packages (from aiohttp<4.0.0,>=3.8.6->llama-index-core>=0.10.7->llama-parse) (6.0.5)\n", + "Requirement already satisfied: yarl<2.0,>=1.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp<4.0.0,>=3.8.6->llama-index-core>=0.10.7->llama-parse) (1.9.4)\n", + "Requirement already satisfied: async-timeout<5.0,>=4.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp<4.0.0,>=3.8.6->llama-index-core>=0.10.7->llama-parse) (4.0.3)\n", + "Requirement already satisfied: wrapt<2,>=1.10 in /usr/local/lib/python3.10/dist-packages (from deprecated>=1.2.9.3->llama-index-core>=0.10.7->llama-parse) (1.14.1)\n", + "Requirement already satisfied: pydantic>=1.10 in /usr/local/lib/python3.10/dist-packages (from llamaindex-py-client<0.2.0,>=0.1.13->llama-index-core>=0.10.7->llama-parse) (2.6.3)\n", + "Requirement already satisfied: anyio in /usr/local/lib/python3.10/dist-packages (from httpx->llama-index-core>=0.10.7->llama-parse) (3.7.1)\n", + "Requirement already satisfied: certifi in /usr/local/lib/python3.10/dist-packages (from httpx->llama-index-core>=0.10.7->llama-parse) (2024.2.2)\n", + "Collecting httpcore==1.* (from httpx->llama-index-core>=0.10.7->llama-parse)\n", + " Downloading httpcore-1.0.4-py3-none-any.whl (77 kB)\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m77.8/77.8 kB\u001b[0m \u001b[31m8.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25hRequirement already satisfied: idna in /usr/local/lib/python3.10/dist-packages (from httpx->llama-index-core>=0.10.7->llama-parse) (3.6)\n", + "Requirement already satisfied: sniffio in /usr/local/lib/python3.10/dist-packages (from httpx->llama-index-core>=0.10.7->llama-parse) (1.3.1)\n", + "Collecting h11<0.15,>=0.13 (from httpcore==1.*->httpx->llama-index-core>=0.10.7->llama-parse)\n", + " Downloading h11-0.14.0-py3-none-any.whl (58 kB)\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m58.3/58.3 kB\u001b[0m \u001b[31m5.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25hRequirement already satisfied: click in /usr/local/lib/python3.10/dist-packages (from nltk<4.0.0,>=3.8.1->llama-index-core>=0.10.7->llama-parse) (8.1.7)\n", + "Requirement already satisfied: joblib in /usr/local/lib/python3.10/dist-packages (from nltk<4.0.0,>=3.8.1->llama-index-core>=0.10.7->llama-parse) (1.3.2)\n", + "Requirement already satisfied: regex>=2021.8.3 in /usr/local/lib/python3.10/dist-packages (from nltk<4.0.0,>=3.8.1->llama-index-core>=0.10.7->llama-parse) (2023.12.25)\n", + "Requirement already satisfied: distro<2,>=1.7.0 in /usr/lib/python3/dist-packages (from openai>=1.1.0->llama-index-core>=0.10.7->llama-parse) (1.7.0)\n", + "Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests>=2.31.0->llama-index-core>=0.10.7->llama-parse) (3.3.2)\n", + "Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests>=2.31.0->llama-index-core>=0.10.7->llama-parse) (2.0.7)\n", + "Requirement already satisfied: greenlet!=0.4.17 in /usr/local/lib/python3.10/dist-packages (from SQLAlchemy[asyncio]>=1.4.49->llama-index-core>=0.10.7->llama-parse) (3.0.3)\n", + "Collecting mypy-extensions>=0.3.0 (from typing-inspect>=0.8.0->llama-index-core>=0.10.7->llama-parse)\n", + " Downloading mypy_extensions-1.0.0-py3-none-any.whl (4.7 kB)\n", + "Collecting marshmallow<4.0.0,>=3.18.0 (from dataclasses-json->llama-index-core>=0.10.7->llama-parse)\n", + " Downloading marshmallow-3.21.1-py3-none-any.whl (49 kB)\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m49.4/49.4 kB\u001b[0m \u001b[31m4.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25hRequirement already satisfied: python-dateutil>=2.8.1 in /usr/local/lib/python3.10/dist-packages (from pandas->llama-index-core>=0.10.7->llama-parse) (2.8.2)\n", + "Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas->llama-index-core>=0.10.7->llama-parse) (2023.4)\n", + "Requirement already satisfied: exceptiongroup in /usr/local/lib/python3.10/dist-packages (from anyio->httpx->llama-index-core>=0.10.7->llama-parse) (1.2.0)\n", + "Requirement already satisfied: packaging>=17.0 in /usr/local/lib/python3.10/dist-packages (from marshmallow<4.0.0,>=3.18.0->dataclasses-json->llama-index-core>=0.10.7->llama-parse) (23.2)\n", + "Requirement already satisfied: annotated-types>=0.4.0 in /usr/local/lib/python3.10/dist-packages (from pydantic>=1.10->llamaindex-py-client<0.2.0,>=0.1.13->llama-index-core>=0.10.7->llama-parse) (0.6.0)\n", + "Requirement already satisfied: pydantic-core==2.16.3 in /usr/local/lib/python3.10/dist-packages (from pydantic>=1.10->llamaindex-py-client<0.2.0,>=0.1.13->llama-index-core>=0.10.7->llama-parse) (2.16.3)\n", + "Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.8.1->pandas->llama-index-core>=0.10.7->llama-parse) (1.16.0)\n", + "Installing collected packages: dirtyjson, mypy-extensions, marshmallow, h11, deprecated, typing-inspect, tiktoken, httpcore, httpx, dataclasses-json, openai, llamaindex-py-client, llama-index-core, llama-parse\n", + "Successfully installed dataclasses-json-0.6.4 deprecated-1.2.14 dirtyjson-1.0.8 h11-0.14.0 httpcore-1.0.4 httpx-0.27.0 llama-index-core-0.10.19 llama-parse-0.3.8 llamaindex-py-client-0.1.13 marshmallow-3.21.1 mypy-extensions-1.0.0 openai-1.13.3 tiktoken-0.6.0 typing-inspect-0.9.0\n" + ] + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "## API key\n", + "\n", + "The use of LlamaParse requires an API key which you can get here: https://cloud.llamaindex.ai/parse" + ], + "metadata": { + "id": "i-Rg2D_Rvf2i" + } + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "af6i2P1vuU-U" + }, + "outputs": [], + "source": [ + "import os\n", + "os.environ[\"LLAMA_CLOUD_API_KEY\"] = \"llx-...\"" + ] + }, + { + "cell_type": "markdown", + "source": [ + "## Async (Notebook only)\n", + "llama-parse is async-first, so running the code in a notebook requires the use of nest_asyncio\n" + ], + "metadata": { + "id": "p8Eq-aX-wAEo" + } + }, + { + "cell_type": "code", + "source": [ + "import nest_asyncio\n", + "\n", + "nest_asyncio.apply()" + ], + "metadata": { + "id": "4OB0BkTqv_0l" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "## Import the package" + ], + "metadata": { + "id": "dz927ecMyYo_" + } + }, + { + "cell_type": "code", + "source": [ + "from llama_parse import LlamaParse" + ], + "metadata": { + "id": "nSW-6sEwyXwx" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "## Using llamaparse for getting better results (on Manga!)\n", + "\n", + "Sometimes the layout of a page is unusual and you will get sub-optimal reading order results with LlamaParse. For example, when parsing manga you expect the reading order to be right to left even if the content is in English!" + ], + "metadata": { + "id": "l_D4YsAHwUSk" + } + }, + { + "cell_type": "markdown", + "source": [ + "Let's download an extract of a great manga \"The manga guide to calculus\", by Hiroyuki Kojima (https://www.amazon.com/Manga-Guide-Calculus-Hiroyuki-Kojima/dp/1593271948)\n", + "\n" + ], + "metadata": { + "id": "SV4K2RivxzJG" + } + }, + { + "cell_type": "code", + "source": [ + "! wget \"https://drive.usercontent.google.com/uc?id=1tZJhcpepLRdQFJFCFX50QIqLyLgqzZsY&export=download\" -O ./manga.pdf" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "d3qeuiyawT0U", + "outputId": "e6da0635-dea2-4f2b-ec03-d8db99a75e17" + }, + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "--2024-03-13 13:57:19-- https://drive.usercontent.google.com/uc?id=1tZJhcpepLRdQFJFCFX50QIqLyLgqzZsY&export=download\n", + "Resolving drive.usercontent.google.com (drive.usercontent.google.com)... 173.194.211.132, 2607:f8b0:400c:c10::84\n", + "Connecting to drive.usercontent.google.com (drive.usercontent.google.com)|173.194.211.132|:443... connected.\n", + "HTTP request sent, awaiting response... 303 See Other\n", + "Location: https://drive.usercontent.google.com/download?id=1tZJhcpepLRdQFJFCFX50QIqLyLgqzZsY&export=download [following]\n", + "--2024-03-13 13:57:19-- https://drive.usercontent.google.com/download?id=1tZJhcpepLRdQFJFCFX50QIqLyLgqzZsY&export=download\n", + "Reusing existing connection to drive.usercontent.google.com:443.\n", + "HTTP request sent, awaiting response... 200 OK\n", + "Length: 3041634 (2.9M) [application/octet-stream]\n", + "Saving to: ‘./manga.pdf’\n", + "\n", + "./manga.pdf 100%[===================>] 2.90M --.-KB/s in 0.04s \n", + "\n", + "2024-03-13 13:57:20 (78.6 MB/s) - ‘./manga.pdf’ saved [3041634/3041634]\n", + "\n" + ] + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "### Without parsing instructions\n", + "For the sake of comparison, let's first parse without any instructions." + ], + "metadata": { + "id": "Gbr8RiHEyF3-" + } + }, + { + "cell_type": "code", + "source": [ + "vanilaParsing = LlamaParse(result_type=\"markdown\").load_data(\"./manga.pdf\")" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "3jKnXCuAyQ9_", + "outputId": "8ab58d56-8c51-44de-8d34-0d94b1bb440f" + }, + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "Started parsing the file under job_id 25bf4202-78d8-4705-88cf-c616ae7c82af\n" + ] + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "As you can see below, LlamaParse is not doing a great job here. It is interpreting the grid of comic panels as a table, and trying to fit the dialogue into a table. It's very hard to follow." + ], + "metadata": { + "id": "p4GVOdWzzvYg" + } + }, + { + "cell_type": "code", + "source": [ + "print(vanilaParsing[0].text[100:1000])" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "ZMhWfKrzzhgQ", + "outputId": "5426c0cc-7e62-4836-9877-87258e1f0b6f" + }, + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "\n", + "The Asagake Times Sanda-Cho Distributor\n", + "\n", + "A newspaper distributor? do I have the wrong map?\n", + "\n", + "You’re looking It’s next for the Sanda-cho door. branch office? Everybody mistakes us for the office because we are larger. What Is a Function? 3\n", + "---\n", + "## Calculating the Derivative of a Constant, Linear, or Quadratic Function\n", + "\n", + "|1.|Let’s find the derivative of constant function f(x) = α. The differential coefficient of f(x) at x = a is|\n", + "|---|---|\n", + "| |lim ε→0 (f(a + ε) - f(a)) / ε = lim ε→0 (α - α) = lim ε→0 0 = 0|\n", + "| |Thus, the derivative of f(x) is f′(x) = 0. This makes sense, since our function is constant—the rate of change is 0.|\n", + "\n", + "Note: The differential coefficient of f(x) at x = a is often simply called the derivative of f(x) at x = a, or just f′(a).\n", + "\n", + "|2.|Let’s calculate the derivative of linear function f(x) = αx + β. The derivative of f(x) at x = α is|\n", + "|---|---|\n", + "| |lim ε→0 (f(α + ε) - f(a)) = \n" + ] + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "### Using parsing instructions\n", + "Let's try to parse the manga with custom instructions:\n", + "\n", + "\"The provided document is a manga comic book. Most pages do NOT have title. It does not contain tables. Try to reconstruct the dialogue happening in a cohesive way.\"\n", + "\n", + "To do so just pass the parsing instruction as a parameter to LlamaParse:" + ], + "metadata": { + "id": "sUq6znUryiu0" + } + }, + { + "cell_type": "code", + "source": [ + "parsingInstructionManga = \"\"\"The provided document is a manga comic book, most page do NOT have title.\n", + "It does not contain table.\n", + "Try to reconstruct the dialog happening in a cohesive way.\"\"\"\n", + "withInstructionParsing = LlamaParse(result_type=\"markdown\", parsing_instruction=parsingInstructionManga).load_data(\"./manga.pdf\")" + ], + "metadata": { + "id": "dEX7Mv9V0UvM", + "colab": { + "base_uri": "https://localhost:8080/" + }, + "outputId": "184b77b7-7e3a-4991-f2c9-eb9105a35a7b" + }, + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "Started parsing the file under job_id 88ab273e-b2a7-4f84-8e72-e9367cf6b114\n", + "." + ] + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "Let's see how it compare with page 3! We encourage you to play with the target page and explore other pages. As you will see, the parsing instruction allowed LlamaParse to make sense of the document!\n", + "\n", + "\n", + "\n", + "\n", + "\n" + ], + "metadata": { + "id": "-UQcA-YW2kjd" + } + }, + { + "cell_type": "code", + "source": [ + "target_page=1\n", + "print(vanilaParsing[0].text.split('\\n---\\n')[target_page])\n", + "print(\"\\n\\n------------------------------------------------------------\\n\\n\")\n", + "print(withInstructionParsing[0].text.split('\\n---\\n')[target_page])" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "0oPHXg0F0yAS", + "outputId": "38b729bc-b0b7-42f2-97e0-b4e9d7d459a1" + }, + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "The Asagake Times Sanda-Cho Distributor\n", + "\n", + "A newspaper distributor? do I have the wrong map?\n", + "\n", + "You’re looking It’s next for the Sanda-cho door. branch office? Everybody mistakes us for the office because we are larger. What Is a Function? 3\n", + "\n", + "\n", + "------------------------------------------------------------\n", + "\n", + "\n", + "# The Asagake Times\n", + "\n", + "Sanda-Cho Distributor\n", + "\n", + "A newspaper distributor?\n", + "\n", + "Do I have the wrong map?\n", + "\n", + "You're looking for the Sanda-cho branch office?\n", + "\n", + "It's next door.\n", + "\n", + "Everybody mistakes us for the office because we are larger.\n", + "\n", + "What Is a Function? 3\n" + ] + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "### Math - doing more with parsing instuction!\n", + "\n", + "But this manga is about math and full of equations, why not ask the parser to output them in **LaTeX**?\n", + "\n", + "" + ], + "metadata": { + "id": "yU_jyYWI5fMH" + } + }, + { + "cell_type": "code", + "source": [ + "parsingInstructionMangaLatex = \"\"\"The provided document is a manga comic book, most page do NOT have title.\n", + "It does not contain table. Do not output table.\n", + "Try to reconstruct the dialog happening in a cohesive way.\n", + "Output any math equation in LATEX markdown (between $$)\"\"\"\n", + "withLatex = LlamaParse(result_type=\"markdown\", parsing_instruction=parsingInstructionMangaLatex).load_data(\"./manga.pdf\")" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "FP_YdO2y5e5o", + "outputId": "45171cdd-298a-4f1f-e67d-6be3cb1a1156" + }, + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "Started parsing the file under job_id 3a055e64-d91e-484e-b9b0-99a2e637c08d\n", + "." + ] + } + ] + }, + { + "cell_type": "code", + "source": [ + "target_page=2\n", + "print(\"\\n\\n[Without instruction]------------------------------------------------------------\\n\\n\")\n", + "print(vanilaParsing[0].text.split('\\n---\\n')[target_page])\n", + "print(\"\\n\\n[With instruction to output math in LATEX!]------------------------------------------------------------\\n\\n\")\n", + "print(withLatex[0].text.split('\\n---\\n')[target_page])\n" + ], + "metadata": { + "id": "TntdRRGp6Rui", + "colab": { + "base_uri": "https://localhost:8080/" + }, + "outputId": "e2e503fa-eb87-4f78-83d8-209fe7cebd9e" + }, + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "\n", + "\n", + "[Without instruction]------------------------------------------------------------\n", + "\n", + "\n", + "## Calculating the Derivative of a Constant, Linear, or Quadratic Function\n", + "\n", + "|1.|Let’s find the derivative of constant function f(x) = α. The differential coefficient of f(x) at x = a is|\n", + "|---|---|\n", + "| |lim ε→0 (f(a + ε) - f(a)) / ε = lim ε→0 (α - α) = lim ε→0 0 = 0|\n", + "| |Thus, the derivative of f(x) is f′(x) = 0. This makes sense, since our function is constant—the rate of change is 0.|\n", + "\n", + "Note: The differential coefficient of f(x) at x = a is often simply called the derivative of f(x) at x = a, or just f′(a).\n", + "\n", + "|2.|Let’s calculate the derivative of linear function f(x) = αx + β. The derivative of f(x) at x = α is|\n", + "|---|---|\n", + "| |lim ε→0 (f(α + ε) - f(a)) = lim ε→0 (α(a + ε) + β - (αa + β)) = lim ε→0 α = α|\n", + "| |Thus, the derivative of f(x) is f′(x) = α, a constant value. This result should also be intuitive—linear functions have a constant rate of change by definition.|\n", + "\n", + "|3.|Let’s find the derivative of f(x) = x^2, which appeared in the story. The differential coefficient of f(x) at x = a is|\n", + "|---|---|\n", + "| |lim ε→0 ((a + ε)^2 - a^2) / ε = lim (a^2 + 2aε + ε^2 - a^2) / ε = lim (2aε + ε^2) = lim (2a + ε) = 2a|\n", + "| |Thus, the differential coefficient of f(x) at x = a is 2a, or f′(a) = 2a. Therefore, the derivative of f(x) is f′(x) = 2x.|\n", + "\n", + "## Summary\n", + "\n", + "- The calculation of a limit that appears in calculus is simply a formula calculating an error.\n", + "- A limit is used to obtain a derivative.\n", + "- The derivative is the slope of the tangent line at a given point.\n", + "- The derivative is nothing but the rate of change.\n", + "\n", + "## Chapter 1 Let’s Differentiate a Function!\n", + "\n", + "\n", + "[With instruction to output math in LATEX!]------------------------------------------------------------\n", + "\n", + "\n", + "# Derivative of Constant, Linear, or Quadratic Function\n", + "\n", + "## Calculating the Derivative of a Constant, Linear, or Quadratic Function\n", + "\n", + "1. Let’s find the derivative of constant function f(x) = α. The differential coefficient of f(x) at x = a is\n", + "\n", + "$$\n", + "\\begin{align*}\n", + "&\\lim_{{\\varepsilon \\to 0}} \\left( \\frac{f(a + \\varepsilon) - f(a)}{\\varepsilon} \\right) = \\lim_{{\\varepsilon \\to 0}} \\frac{\\alpha - \\alpha}{\\varepsilon} = \\lim_{{\\varepsilon \\to 0}} 0 = 0 \\\\\n", + "\\end{align*}\n", + "$$\n", + "Thus, the derivative of f(x) is f′(x) = 0. This makes sense, since our function is constant—the rate of change is 0.\n", + "\n", + "Note: The differential coefficient of f(x) at x = a is often simply called the derivative of f(x) at x = a, or just f′(a).\n", + "\n", + "2. Let’s calculate the derivative of linear function f(x) = αx + β. The derivative of f(x) at x = α is\n", + "\n", + "$$\n", + "\\begin{align*}\n", + "&\\lim_{{\\varepsilon \\to 0}} \\left( \\frac{f(\\alpha + \\varepsilon) - f(a)}{\\varepsilon} \\right) = \\lim_{{\\varepsilon \\to 0}} \\frac{\\alpha(a + \\varepsilon) + \\beta - (\\alpha a + \\beta)}{\\varepsilon} = \\lim_{{\\varepsilon \\to 0}} \\alpha = \\alpha \\\\\n", + "\\end{align*}\n", + "$$\n", + "Thus, the derivative of f(x) is f′(x) = α, a constant value. This result should also be intuitive—linear functions have a constant rate of change by definition.\n", + "\n", + "3. Let’s find the derivative of f(x) = x2. The differential coefficient of f(x) at x = a is\n", + "\n", + "$$\n", + "\\begin{align*}\n", + "&\\lim_{{\\varepsilon \\to 0}} \\left( \\frac{f(a + \\varepsilon) - f(a)}{\\varepsilon} \\right) = \\lim_{{\\varepsilon \\to 0}} \\left( (a + \\varepsilon)^2 - a^2 \\right) = \\lim_{{\\varepsilon \\to 0}} 2a\\varepsilon + \\varepsilon = \\lim_{{\\varepsilon \\to 0}} (2a + \\varepsilon) = 2a \\\\\n", + "\\end{align*}\n", + "$$\n", + "Thus, the differential coefficient of f(x) at x = a is 2a, or f′(a) = 2a. Therefore, the derivative of f(x) is f′(x) = 2x.\n", + "\n", + "### Summary\n", + "\n", + "- The calculation of a limit that appears in calculus is simply a formula calculating an error.\n", + "- A limit is used to obtain a derivative.\n", + "- The derivative is the slope of the tangent line at a given point.\n", + "- The derivative is nothing but the rate of change.\n" + ] + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "And here is the result as rendered by https://upmath.me/ .\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "Over this short notebook we saw how to use parsing instructions to increase the quality and accuracy of parsing with LLamaParse!" + ], + "metadata": { + "id": "rfFdeWZKmmLW" + } + } + ] +} \ No newline at end of file