From 82f8bd5b58e8d742797ef19bc8ab031b940df75f Mon Sep 17 00:00:00 2001 From: Jerry Liu Date: Thu, 14 Mar 2024 00:47:55 -0700 Subject: [PATCH] cr --- examples/other_files/demo_ppt_financial.ipynb | 422 ++++++++++++++++++ 1 file changed, 422 insertions(+) create mode 100644 examples/other_files/demo_ppt_financial.ipynb diff --git a/examples/other_files/demo_ppt_financial.ipynb b/examples/other_files/demo_ppt_financial.ipynb new file mode 100644 index 0000000..947fadf --- /dev/null +++ b/examples/other_files/demo_ppt_financial.ipynb @@ -0,0 +1,422 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "eld1dKaN7P8B" + }, + "source": [ + "# LlamaParse - Parsing Financial Powerpoints 📊\n", + "\n", + "In this cookbook we show you how to use LlamaParse to parse a financial powerpoint." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "goB1sV8zu_Xl" + }, + "source": [ + "## Installation\n", + "\n", + "Parsing instruction are part of the LlamaParse API. They can be access by directly specifying the parsing_instruction parameter in the API or by using LlamaParse python module (which we will use for this tutorial).\n", + "\n", + "To install llama-parse, just get it from `pip`:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "7Y3_BwQLu-qK", + "outputId": "b1129c52-7a70-44cc-ad03-1f8d3a8c794a" + }, + "outputs": [], + "source": [ + "!pip install llama-index\n", + "!pip install llama-parse\n", + "!pip install torch transformers python-pptx Pillow" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "i-Rg2D_Rvf2i" + }, + "source": [ + "## API Key\n", + "\n", + "The use of LlamaParse requires an API key which you can get here: https://cloud.llamaindex.ai/parse" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "af6i2P1vuU-U" + }, + "outputs": [], + "source": [ + "import os\n", + "os.environ[\"LLAMA_CLOUD_API_KEY\"] = \"llx-...\"\n", + "os.environ[\"OPENAI_API_KEY\"] = \"sk-...\"" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "p8Eq-aX-wAEo" + }, + "source": [ + "**NOTE**: Since LlamaParse is natively async, running the sync code in a notebook requires the use of nest_asyncio.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": { + "id": "4OB0BkTqv_0l", + "tags": [] + }, + "outputs": [], + "source": [ + "import nest_asyncio\n", + "\n", + "nest_asyncio.apply()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "dz927ecMyYo_" + }, + "source": [ + "## Importing the package\n", + "\n", + "To import llama_parse simply do:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "nSW-6sEwyXwx", + "tags": [] + }, + "outputs": [], + "source": [ + "from llama_parse import LlamaParse" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "l_D4YsAHwUSk" + }, + "source": [ + "## Using LlamaParse to Parse Presentations\n", + "\n", + "Like Powerpoints, presentations are often hard to extract for RAG. With LlamaParse we can now parse them and unclock their content of presentations for RAG.\n", + "\n", + "Let's download a financial report from the World Meteorological Association." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "d3qeuiyawT0U", + "outputId": "cec0ea0a-be8b-49b6-9376-797c91f63be7", + "tags": [] + }, + "outputs": [], + "source": [ + "! mkdir data; wget \"https://meetings.wmo.int/Cg-19/PublishingImages/SitePages/FINAC-43/7%20-%20EC-77-Doc%205%20Financial%20Statements%20for%202022%20(FINAC).pptx\" -O data/presentation.pptx" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Gbr8RiHEyF3-" + }, + "source": [ + "### Parsing the presentation\n", + "\n", + "Now let's parse it into Markdown with LlamaParse and the default LlamaIndex parser.\n", + "\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "osocsofoJ42S" + }, + "source": [ + "#### Llama Index default" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "PTVy5XCNJwW-", + "outputId": "d0e2cc4b-1407-45a9-b5e6-d06f91a533b4", + "tags": [] + }, + "outputs": [], + "source": [ + "from llama_index.core import SimpleDirectoryReader\n", + "\n", + "vanilla_documents = SimpleDirectoryReader(\"./data/\").load_data()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "oucbsciZJwxt" + }, + "source": [ + "#### Llama Parse" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "3jKnXCuAyQ9_", + "outputId": "1f668f17-1e20-46e5-fbab-9a55e4b28891", + "tags": [] + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Started parsing the file under job_id 56724c0d-e45a-4e30-ae8c-e416173c608a\n" + ] + } + ], + "source": [ + "llama_parse_documents = LlamaParse(result_type=\"markdown\").load_data(\"./data/presentation.pptx\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let's take a look at the parsed output from an example slide (see image below).\n", + "\n", + "As we can see the table is faithfully extracted!" + ] + }, + { + "cell_type": "code", + "execution_count": 27, + "metadata": { + "tags": [] + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "ation and mitigation\n", + "---\n", + "|Item|31 Dec 2022|31 Dec 2021|Change|\n", + "|---|---|---|---|\n", + "|Payables and accruals|4,685|4,066|619|\n", + "|Employee benefits|127,215|84,676|42,539|\n", + "|Contributions received in advance|6,975|10,192|(3,217)|\n", + "|Unearned revenue from exchange transactions|20|651|(631)|\n", + "|Deferred Revenue|71,301|55,737|15,564|\n", + "|Borrowings|28,229|29,002|(773)|\n", + "|Funds held in trust|30,373|29,014|1,359|\n", + "|Provisions|1,706|1,910|(204)|\n", + "|Total Liabilities|270,504|215,248|55,256|\n", + "---\n", + "## Liabilities\n", + "\n", + "Employee Ben\n" + ] + } + ], + "source": [ + "print(llama_parse_documents[0].get_content()[-2800:-2300])" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "tags": [] + }, + "source": [ + "Compared against the original slide image.\n", + "![Demo](demo_ppt_financial_1.png)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "p4GVOdWzzvYg" + }, + "source": [ + "## Comparing the two for RAG\n", + "\n", + "The main difference between LlamaParse and the previous directory reader approach, it that LlamaParse will extract the document in a structured format, allowing better RAG." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "oVcdGus5NDxi" + }, + "source": [ + "### Query Engine on SimpleDirectoryReader results" + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "metadata": { + "id": "DqXYsLCWNg9_", + "tags": [] + }, + "outputs": [], + "source": [ + "from llama_index.core import VectorStoreIndex, SimpleDirectoryReader\n", + "\n", + "vanilla_index = VectorStoreIndex.from_documents(vanilla_documents)\n", + "vanilla_query_engine = vanilla_index.as_query_engine()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ZLkHt9l2Nbxx" + }, + "source": [ + "### Query Engine on LlamaParse Results\n" + ] + }, + { + "cell_type": "code", + "execution_count": 20, + "metadata": { + "id": "ZllaDcfRNLv3", + "tags": [] + }, + "outputs": [], + "source": [ + "llama_parse_index = VectorStoreIndex.from_documents(llama_parse_documents)\n", + "llama_parse_query_engine = llama_parse_index.as_query_engine()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "0dY_0_1bNg0X", + "tags": [] + }, + "source": [ + "### Liability provision\n", + "What was the liability provision as of Dec 31 2021?\n", + "\n", + "" + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "Tmn-qNTEN-cb", + "outputId": "a9bffc00-9cfc-43d8-b159-596a6c1aca64", + "tags": [] + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "The liability provision as of December 31, 2021, included Employee Benefit Liabilities, Contributions received in advance (assessed contributions), and Deferred revenue.\n" + ] + } + ], + "source": [ + "vanilla_response = vanilla_query_engine.query(\"What was the liability provision as of Dec 31 2021?\")\n", + "print(vanilla_response)" + ] + }, + { + "cell_type": "code", + "execution_count": 22, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "4EZ_uqlROP7R", + "outputId": "0645a159-06c6-411e-d1f6-79ea95d32b42", + "tags": [] + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "The liability provision as of December 31, 2021, was 1,910 CHF.\n" + ] + } + ], + "source": [ + "llama_parse_response = llama_parse_query_engine.query(\"What was the liability provision as of Dec 31 2021?\")\n", + "print(llama_parse_response)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "colab": { + "provenance": [] + }, + "kernelspec": { + "display_name": "llama_parse", + "language": "python", + "name": "llama_parse" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.8" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +}