LAION-AI · EditaNEmilis · Jun 15, 2023 · Jun 16, 2023 · Jun 16, 2023 · Jun 20, 2023
@@ -1,7 +1,7 @@
 * @yk @andreaskoepf
 /website/ @AbdBarho @notmd @yk @andreaskoepf
 /website/src/data/team.json @yk @andreaskoepf @fozziethebeat @AbdBarho @notmd @theblackcat102 @sanagno @olliestanley @andrewm4894
-/model/ @theblackcat102 @sanagno @dvruette @andreaskoepf @yk @jordiclive @shahules786
+/model/ @theblackcat102 @sanagno @dvruette @andreaskoepf @yk
 /copilot/ @andreaskoepf @yk
 /docs/ @andrewm4894 @olliestanley @andreaskoepf @yk
 /.devcontainer/ @andrewm4894 @andreaskoepf @yk

@@ -11,7 +11,7 @@ In root directory, run
 a database. The default settings are already configured to connect to the
 database at `localhost:5432`. (See
 [FAQ](https://projects.laion.ai/Open-Assistant/docs/faq#enable-dockers-buildkit-backend)
-if you face any docker problems).
+if you face any Docker problems).
 
 > **Note:** when running on MacOS with an M1 chip you have to use:
 > `DB_PLATFORM=linux/x86_64 docker compose ...`
@@ -21,7 +21,7 @@ the `.python-version` in the project root directory.
 
 ### Python Packages
 
-Next, to install all requirements, You can run
+Next, to install all requirements, you can run:
 
 1. `pip install -r backend/requirements.txt`
 2. `pip install -e ./oasst-shared/.`
@@ -58,7 +58,7 @@ information.
 Once you have successfully started the backend server, you can access the
 default api docs at `localhost:8080/docs`. If you need to update the exported
 openapi.json in the docs/ folder you can run below command to `wget` them from
-the relevant local fastapi endpoint. This will enable anyone to just see API
+the relevant local FastAPI endpoint. This will enable anyone to just see API
 docs via something like
 [Swagger.io](https://editor.swagger.io/?url=https://raw.githubusercontent.com/LAION-AI/Open-Assistant/main/docs/docs/api/openapi.json)
 without having to actually set up and run a development backend.
@@ -68,16 +68,16 @@ without having to actually set up and run a development backend.
 wget localhost:8080/api/v1/openapi.json -O docs/docs/api/backend-openapi.json
 ```
 
-Note: The api docs should be automatically updated by the
+Note: The API docs should be automatically updated by the
 `test-api-contract.yaml` workflow. (TODO)
 
 ## Running Celery Worker(s) for API and periodic tasks
 
-Celery workers are used for Huggingface API calls like toxicity and feature
+Celery workers are used for HuggingFace API calls like toxicity and feature
 extraction. Celery Beat along with worker is used for periodic tasks like user
 streak update
 
-To run APIs locally
+To run APIs locally:
 
 - update HUGGING_FACE_API_KEY in backend/oasst_backend/config.py with the
   correct API_KEY
@@ -87,7 +87,7 @@ To run APIs locally
 - run start_worker.sh in backend dir
 - to see logs , use `tail -f celery.log` and `tail -f celery.beat.log`
 
-In CI
+In CI:
 
 - set `DEBUG_SKIP_TOXICITY_CALCULATION=False` and
   `DEBUG_SKIP_EMBEDDING_COMPUTATION=False` in docker-compose.yaml

@@ -30,8 +30,8 @@
 from oasst_backend.models.payload_column_type import PayloadContainer
 from oasst_backend.task_repository import TaskRepository, validate_frontend_message_id
 from oasst_backend.user_repository import UserRepository
-from oasst_backend.utils import discord
 from oasst_backend.utils.database_utils import CommitMode, db_lang_to_postgres_ts_lang, managed_tx_method
+from oasst_backend.utils.discord import send_new_report_message
 from oasst_shared.exceptions import OasstError, OasstErrorCode
 from oasst_shared.schemas import protocol as protocol_schema
 from oasst_shared.schemas.protocol import SystemStats
@@ -595,7 +595,7 @@ def store_text_labels(self, text_labels: protocol_schema.TextLabels) -> tuple[Te
                         message_id, protocol_schema.EmojiOp.add, protocol_schema.EmojiCode.red_flag
                     )
 
-                    discord.send_new_report_message(message=message, label_text=text_labels.text, user_id=self.user_id)
+                    send_new_report_message.delay(message=message, label_text=text_labels.text, user_id=self.user_id)
 
                 # update existing record for repeated updates (same user no task associated)
                 existing_text_label = self.fetch_non_task_text_labels(message_id, self.user_id)

@@ -2,15 +2,18 @@
 
 import requests
 from loguru import logger
+from oasst_backend.celery_worker import app as celery_app
 from oasst_backend.config import settings
 from oasst_backend.models.message import Message
 
 ROOT_ENDPOINT = "https://discord.com/api/v10"
 
 
+@celery_app.task(name="send_new_report_message")
 def send_new_report_message(message: Message, label_text: str, user_id: UUID):
     """
     Send a message to the Discord channel when a new message is flagged.
+    Note: this is a Celery task.
 
     Args:
         message (Message): the flagged message

@@ -2,7 +2,7 @@ aiohttp==3.8.3
 alembic==1.8.1
 asgiref==3.6.0
 Celery==5.2.0
-cryptography==41.0.0
+cryptography==39.0.0
 fastapi==0.88.0
 fastapi-limiter==0.1.5
 fastapi-utils==0.2.1
@@ -15,7 +15,7 @@ pydantic[email]==1.10.4
 python-dotenv==0.21.0
 python-jose[cryptography]==3.3.0
 redis==4.5.5
-requests==2.31.0
+requests==2.30.0
 scipy==1.8.1
 SQLAlchemy==1.4.41
 sqlmodel==0.0.8

@@ -2,7 +2,7 @@
 
 from loguru import logger
 from oasst_backend.models import ApiClient, Message
-from oasst_backend.scheduled_tasks import hf_feature_extraction, toxicity
+from oasst_backend.scheduled_tasks import check_toxicity, hf_feature_extraction
 from oasst_backend.utils.database_utils import default_session_factory
 from sqlmodel import text
 
@@ -71,7 +71,7 @@ def find_and_update_toxicity(message_ids):
                     text = result.payload.payload.text
                     api_client = session.query(ApiClient).filter(ApiClient.id == api_client_id).first()
                     if api_client is not None and text is not None:
-                        toxicity(text=text, message_id=message_id, api_client=api_client.__dict__)
+                        check_toxicity(text=text, message_id=message_id, api_client=api_client.__dict__)
                         # to not get rate limited from HF
                         time.sleep(10)
     except Exception as e:

@@ -25,14 +25,14 @@ Replace with a proper domain to setup SSL certificates.
 copilot env deploy
 ```
 
-This will create a variety of aws roles and services needed for deployment.
+This will create a variety of AWS roles and services needed for deployment.
 
 ```sh
 copilot deploy
 ```
 
 This will deploy the services but it won't be 100% ready for usage. Before being
-ready, we have to inspect the AWS Secrets manager and extract out the database
+ready, we have to inspect the AWS Secrets Manager and extract out the database
 credentials. Read those credentials then put them, and a few other secrets, in a
 `secrets.yml` file like the following:
 

@@ -3,7 +3,7 @@ from an annotated version of the code-search-net dataset. The annotated version
 of code-search-net dataset can be found
 [here](https://huggingface.co/datasets/Nan-Do/code-search-net-python).
 
-The dataset contains around 450000 python annotated functions. The dataset is
+The dataset contains around 450000 Python annotated functions. The dataset is
 split into two blocks, one in which the task is starting from the annotated
 summary to generate an instruction to generate the code as a response, and
 another one in which the expected response is to generate a description of the

@@ -38,7 +38,7 @@
     "\n",
     "this list was build from https://anvaka.github.io/redsim. Can be used to expand the list of favourable subreddits.\n",
     "\n",
-    "taking these for now"
+    "takeing these for now"
    ]
   },
   {

@@ -7,7 +7,7 @@
    "source": [
     "Takes this Kaggle dataset 'leetcode-solutions'\n",
     "https://www.kaggle.com/datasets/erichartford/leetcode-solutions, and turns them into basic\n",
-    "dialogue using a preset list of user prompt templates."
+    "dialogue using a preset list of user prompt tempaltes."
    ]
   },
   {

@@ -10,17 +10,17 @@ Languages English
 
 Dataset Structure This dataset follows the OA format, which is:
 
-INSTRUCTION (string): The user asks for a poem (from a variety of premade
-prompts) with topics (tags). If the given poem has no tags, the user asks for a
-poem on its own.
+- INSTRUCTION (string): The user asks for a poem (from a variety of premade
+  prompts) with topics (tags). If the given poem has no tags, the user asks for
+  a poem on it's own.
 
-RESPONSE (string): The assistant replies with the poem and title (from a variety
-of premade prompts).
+- RESPONSE (string): The assistant replies with the poem and title (from a
+  variety of premade prompts).
 
-SOURCE (string): The source is PoetryFoundation.org and the poet's name.
+- SOURCE (string): The source is PoetryFoundation.org and the poet's name.
 
-METADATA (JSON String): {"author": "author of the original poem", "title":
-"title of the poem", "tags": "tags from poetry foundation."}
+- METADATA (JSON String): {"author": "author of the original poem", "title":
+  "title of the poem", "tags": "tags from poetry foundation."}
 
 Preparing the Dataset The dataset can be created with prepare.py. Make sure to
 install the required libraries in requirements.txt!

@@ -6,7 +6,7 @@
 - A [classifier](https://huggingface.co/shahules786/prosocial-classifier)
   trained on prosocial dialog dataset is used for pseudo labeling.
 - More information on dataset can be found
-  [here](https://huggingface.co/datasets/shahules786/prosocial-confessions)
+  [here](https://huggingface.co/datasets/shahules786/prosocial-confessions).
 
 ## Example
 

@@ -5,7 +5,7 @@
 License: MIT. Contains Parquet of a list of instructions and answers (English
 only). Reasoning, logic and programming.
 
-Each row consists of
+Each row consists of:
 
 - INSTRUCTION
 - RESPONSE

@@ -2,7 +2,7 @@
 
 Here we convert several existing recipe ingredient and instructions datasets
 into dialogue. Each notebook processes a different dataset and creates a final
-dataset to be uploaded to huggingface.
+dataset to be uploaded to HuggingFace.
 
 ## tasty_recipes.ipynb
 

@@ -5,7 +5,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Takes this Kaggle dataset 'Recipes from Tasty' https://www.kaggle.com/datasets/zeeenb/recipes-from-tasty?select=ingredient_and_instructions.json, and turns them into basic dialogue using a preset list of user prompt templates."
+    "Takes this Kaggle dataset 'Recipes from Tasty' https://www.kaggle.com/datasets/zeeenb/recipes-from-tasty?select=ingredient_and_instructions.json, and turns them into basic dialogue using a preset list of user prompt tempaltes."
    ]
   },
   {

@@ -951,7 +951,7 @@
     "Parent-Child Support Line": {
         "region": "Hong Kong (China)",
         "page": "https://childhelplineinternational.org/hong-kong-china-parent-child-support-line/",
-        "description": "Operated by Action Against Abuse (ACA), the Parent-Child Support Line provides service where parents, children, professionals and the public can call the hotline 2755 1122, or go to the ACA centre to report suspected child abuse cases or ask questions about any issues they are facing. It is also a support and hotline for children to express their voices and opinions. The personal data and case content of the data provider/reporter are kept strictly confidential.",
+        "description": "Operated by Action Againt Abuse (ACA), the Parent-Child Support Line provides service where parents, children, professionals and the public can call the hotline 2755 1122, or go to the ACA centre to report suspected child abuse cases or ask questions about any issues they are facing. It is also a support and hotline for children to express their voices and opinions. The personal data and case content of the data provider/reporter are kept strictly confidential.",
         "contacts": {
             "Website": {"type": "website", "link": "https://www.aca.org.hk/index.php#.YmRbANNBw-Q"},
             "116 111": {"type": "phone", "link": "tel:"},

@@ -231,9 +231,9 @@
     "            text += f\"{speaker}\\r\\n\"\n",
     "        if not re.findall(r\"\\[.+?\\] .+?\\r\\n\\r\\n\\[.+?\\] .+?\\r\\n\\r\\n\", text):\n",
     "            return \"\"\n",
-    "        first_occurrence = re.findall(r\"\\[.+?\\] \", text)[0]\n",
-    "        if len(re.findall(re.escape(first_occurrence), text)) == 1:\n",
-    "            text = re.sub(re.escape(first_occurrence), f\"{first_occurrence[1:-2]}\\r\\n\", text)\n",
+    "        first_occurance = re.findall(r\"\\[.+?\\] \", text)[0]\n",
+    "        if len(re.findall(re.escape(first_occurance), text)) == 1:\n",
+    "            text = re.sub(re.escape(first_occurance), f\"{first_occurance[1:-2]}\\r\\n\", text)\n",
     "\n",
     "        text = text.replace(\"&amp;\", \"&\")\n",
     "        text = \"\\r\\n\".join(text.splitlines())\n",

@@ -3,7 +3,7 @@
 import pandas as pd
 
 
-def reformat_csv_to_openassistant(df: pd.DataFrame) -> pd.DataFrame:
+def reformat_csv_to_openassitant(df: pd.DataFrame) -> pd.DataFrame:
     """
     Reformat the downloaded CSV into either Instruction or Text format
     so that it could be directly ingested into the training pipeline.
@@ -44,6 +44,6 @@ def reformat_csv_to_openassistant(df: pd.DataFrame) -> pd.DataFrame:
     input_csv = "zhihu.csv"
     # Create a pandas dataframe from your dataset file(s)
     df = pd.read_csv(input_csv)  # or any other way
-    df = reformat_csv_to_openassistant(df)
+    df = reformat_csv_to_openassitant(df)
     # Save the file in the Parquet format
     df.to_parquet("dataset.parquet", row_group_size=100, engine="pyarrow", index=False)
@@ -155,7 +155,7 @@ def get_answer_content(qid: str, aid) -> str:
     return content
 
 
-def reformat_csv_to_openassistant(df: pd.DataFrame) -> pd.DataFrame:
+def reformat_csv_to_openassitant(df: pd.DataFrame) -> pd.DataFrame:
     """
     Reformat the downloaded CSV into either Instruction or Text format
     so that it could be directly ingested into the training pipeline.
@@ -226,7 +226,7 @@ def start(qid: str, aid: str):
         start(qid, aid)
     multitasking.wait_for_tasks()
     df["回答内容"] = df["问题ID"].apply(lambda x: content_list[x])
-    updated_df = reformat_csv_to_openassistant(df)
+    updated_df = reformat_csv_to_openassitant(df)
     updated_df.to_csv(csv_path, encoding="utf-8-sig", index=None)
     bar.close()
     print(f"url_token 为 {url_token} 的用户回答数据已存储到文件:{csv_path}")

@@ -9,6 +9,6 @@ This folder contains various configuration files for Grafana.
   Grafana where some pre-configured dashboards live.
 - [`./dashboards/fastapi-backend.json`](./dashboards/fastapi-backend.json) - A
   json representation of a saved Grafana dashboard focusing on some high level
-  api endpoint metrics etc.
+  API endpoint metrics etc.
 - [`./datasources/datasource.yml`](./datasources/datasource.yml) - A config file
   to set up Grafana to read from the local Prometheus source.
@@ -8,7 +8,7 @@ image: https://img.youtube.com/vi/5IymlBZDw-0/0.jpg
 
 import ReactPlayer from "react-player";
 
-Livestream playing around with Open Assistant and AI alignment :)
+Livestream playing around with Open Assistant and AI allignement :)
 
 https://open-assistant.io/chat
 

@@ -4,6 +4,6 @@
 
 The Inference architecture is comprised of several core components: a text, or
 frontend client, a FastAPI webserver, a database with several tables, Reddis
-used for queueing, and distributed gpu workers.
+used for queueing, and distributed GPU workers.
 
 A more detailed overview can be viewed [here](inference.md).
@@ -6,7 +6,7 @@
 
 :::note
 
-In the GitHub repo You can see all issues and PR's with the
+In the GitHub repo, you can see all issues and PR's with the
 [`plugins`](https://github.com/LAION-AI/Open-Assistant/issues?q=label%3Aplugins)
 label if you want to dive deeper.
 

@@ -139,7 +139,7 @@ i.e. the 7B can utilize 40 nearest neighbor chunks, a 172M model only 10 NNs.
 ### Bertsch et al. 2023: Unlimiformer: Long-Range Transformers with Unlimited Length Input
 
 Idea: Use retrieval to actually maximize overlap of "query embeddings" with
-embeddings from an encoder (in an encoder-decoder architecture). Essentially it
+embeddings from an encoder (in a encoder-decoder architecture). Essentially it
 is an ideal approximation of the softmax in the Cross-Attention over all
 previous tokens (in the encoder inputs).
 

@@ -75,8 +75,8 @@ Navigate to http://0.0.0.0:8089/ to view the locust UI.
 
 ## API Docs
 
-To update the api docs, once the inference server is running run below command
-to download the inference openapi json into the relevant folder under `/docs`:
+To update the API docs, once the inference server is running run below command
+to download the inference OpenAPI json into the relevant folder under `/docs`:
 
 ```bash
 wget localhost:8000/openapi.json -O docs/docs/api/inference-openapi.json

@@ -3,7 +3,7 @@
 Workers communicate with the `/work` endpoint via Websocket. They provide their
 configuration and if a task is available, the server returns it. The worker then
 performs the task and returns the result in a streaming fashion to the server,
-also via websocket.
+also via Websocket.
 
 Clients first call `/chat` to make a new chat, then add to that via
 `/chat/<id>/message`. The response is a SSE event source, which will send tokens