Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

README Fixing #3481

Open
wants to merge 8 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
2 changes: 1 addition & 1 deletion CODEOWNERS
Validating CODEOWNERS rules …
@@ -1,7 +1,7 @@
* @yk @andreaskoepf
/website/ @AbdBarho @notmd @yk @andreaskoepf
/website/src/data/team.json @yk @andreaskoepf @fozziethebeat @AbdBarho @notmd @theblackcat102 @sanagno @olliestanley @andrewm4894
/model/ @theblackcat102 @sanagno @dvruette @andreaskoepf @yk @jordiclive @shahules786
/model/ @theblackcat102 @sanagno @dvruette @andreaskoepf @yk
/copilot/ @andreaskoepf @yk
/docs/ @andrewm4894 @olliestanley @andreaskoepf @yk
/.devcontainer/ @andrewm4894 @andreaskoepf @yk
Expand Down
14 changes: 7 additions & 7 deletions backend/README.md
Expand Up @@ -11,7 +11,7 @@ In root directory, run
a database. The default settings are already configured to connect to the
database at `localhost:5432`. (See
[FAQ](https://projects.laion.ai/Open-Assistant/docs/faq#enable-dockers-buildkit-backend)
if you face any docker problems).
if you face any Docker problems).

> **Note:** when running on MacOS with an M1 chip you have to use:
> `DB_PLATFORM=linux/x86_64 docker compose ...`
Expand All @@ -21,7 +21,7 @@ the `.python-version` in the project root directory.

### Python Packages

Next, to install all requirements, You can run
Next, to install all requirements, you can run:

1. `pip install -r backend/requirements.txt`
2. `pip install -e ./oasst-shared/.`
Expand Down Expand Up @@ -58,7 +58,7 @@ information.
Once you have successfully started the backend server, you can access the
default api docs at `localhost:8080/docs`. If you need to update the exported
openapi.json in the docs/ folder you can run below command to `wget` them from
the relevant local fastapi endpoint. This will enable anyone to just see API
the relevant local FastAPI endpoint. This will enable anyone to just see API
docs via something like
[Swagger.io](https://editor.swagger.io/?url=https://raw.githubusercontent.com/LAION-AI/Open-Assistant/main/docs/docs/api/openapi.json)
without having to actually set up and run a development backend.
Expand All @@ -68,16 +68,16 @@ without having to actually set up and run a development backend.
wget localhost:8080/api/v1/openapi.json -O docs/docs/api/backend-openapi.json
```

Note: The api docs should be automatically updated by the
Note: The API docs should be automatically updated by the
`test-api-contract.yaml` workflow. (TODO)

## Running Celery Worker(s) for API and periodic tasks

Celery workers are used for Huggingface API calls like toxicity and feature
Celery workers are used for HuggingFace API calls like toxicity and feature
extraction. Celery Beat along with worker is used for periodic tasks like user
streak update

To run APIs locally
To run APIs locally:

- update HUGGING_FACE_API_KEY in backend/oasst_backend/config.py with the
correct API_KEY
Expand All @@ -87,7 +87,7 @@ To run APIs locally
- run start_worker.sh in backend dir
- to see logs , use `tail -f celery.log` and `tail -f celery.beat.log`

In CI
In CI:

- set `DEBUG_SKIP_TOXICITY_CALCULATION=False` and
`DEBUG_SKIP_EMBEDDING_COMPUTATION=False` in docker-compose.yaml
Expand Down
4 changes: 2 additions & 2 deletions backend/oasst_backend/prompt_repository.py
Expand Up @@ -30,8 +30,8 @@
from oasst_backend.models.payload_column_type import PayloadContainer
from oasst_backend.task_repository import TaskRepository, validate_frontend_message_id
from oasst_backend.user_repository import UserRepository
from oasst_backend.utils import discord
from oasst_backend.utils.database_utils import CommitMode, db_lang_to_postgres_ts_lang, managed_tx_method
from oasst_backend.utils.discord import send_new_report_message
from oasst_shared.exceptions import OasstError, OasstErrorCode
from oasst_shared.schemas import protocol as protocol_schema
from oasst_shared.schemas.protocol import SystemStats
Expand Down Expand Up @@ -595,7 +595,7 @@ def store_text_labels(self, text_labels: protocol_schema.TextLabels) -> tuple[Te
message_id, protocol_schema.EmojiOp.add, protocol_schema.EmojiCode.red_flag
)

discord.send_new_report_message(message=message, label_text=text_labels.text, user_id=self.user_id)
send_new_report_message.delay(message=message, label_text=text_labels.text, user_id=self.user_id)

# update existing record for repeated updates (same user no task associated)
existing_text_label = self.fetch_non_task_text_labels(message_id, self.user_id)
Expand Down
3 changes: 3 additions & 0 deletions backend/oasst_backend/utils/discord.py
Expand Up @@ -2,15 +2,18 @@

import requests
from loguru import logger
from oasst_backend.celery_worker import app as celery_app
from oasst_backend.config import settings
from oasst_backend.models.message import Message

ROOT_ENDPOINT = "https://discord.com/api/v10"


@celery_app.task(name="send_new_report_message")
def send_new_report_message(message: Message, label_text: str, user_id: UUID):
"""
Send a message to the Discord channel when a new message is flagged.
Note: this is a Celery task.

Args:
message (Message): the flagged message
Expand Down
4 changes: 2 additions & 2 deletions backend/requirements.txt
Expand Up @@ -2,7 +2,7 @@ aiohttp==3.8.3
alembic==1.8.1
asgiref==3.6.0
Celery==5.2.0
cryptography==41.0.0
cryptography==39.0.0
fastapi==0.88.0
fastapi-limiter==0.1.5
fastapi-utils==0.2.1
Expand All @@ -15,7 +15,7 @@ pydantic[email]==1.10.4
python-dotenv==0.21.0
python-jose[cryptography]==3.3.0
redis==4.5.5
requests==2.31.0
requests==2.30.0
scipy==1.8.1
SQLAlchemy==1.4.41
sqlmodel==0.0.8
Expand Down
4 changes: 2 additions & 2 deletions backend/update_message_attributes.py
Expand Up @@ -2,7 +2,7 @@

from loguru import logger
from oasst_backend.models import ApiClient, Message
from oasst_backend.scheduled_tasks import hf_feature_extraction, toxicity
from oasst_backend.scheduled_tasks import check_toxicity, hf_feature_extraction
from oasst_backend.utils.database_utils import default_session_factory
from sqlmodel import text

Expand Down Expand Up @@ -71,7 +71,7 @@ def find_and_update_toxicity(message_ids):
text = result.payload.payload.text
api_client = session.query(ApiClient).filter(ApiClient.id == api_client_id).first()
if api_client is not None and text is not None:
toxicity(text=text, message_id=message_id, api_client=api_client.__dict__)
check_toxicity(text=text, message_id=message_id, api_client=api_client.__dict__)
# to not get rate limited from HF
time.sleep(10)
except Exception as e:
Expand Down
4 changes: 2 additions & 2 deletions copilot/README.md
Expand Up @@ -25,14 +25,14 @@ Replace with a proper domain to setup SSL certificates.
copilot env deploy
```

This will create a variety of aws roles and services needed for deployment.
This will create a variety of AWS roles and services needed for deployment.

```sh
copilot deploy
```

This will deploy the services but it won't be 100% ready for usage. Before being
ready, we have to inspect the AWS Secrets manager and extract out the database
ready, we have to inspect the AWS Secrets Manager and extract out the database
credentials. Read those credentials then put them, and a few other secrets, in a
`secrets.yml` file like the following:

Expand Down
Expand Up @@ -3,7 +3,7 @@ from an annotated version of the code-search-net dataset. The annotated version
of code-search-net dataset can be found
[here](https://huggingface.co/datasets/Nan-Do/code-search-net-python).

The dataset contains around 450000 python annotated functions. The dataset is
The dataset contains around 450000 Python annotated functions. The dataset is
split into two blocks, one in which the task is starting from the annotated
summary to generate an instruction to generate the code as a response, and
another one in which the expected response is to generate a description of the
Expand Down
Expand Up @@ -38,7 +38,7 @@
"\n",
"this list was build from https://anvaka.github.io/redsim. Can be used to expand the list of favourable subreddits.\n",
"\n",
"taking these for now"
"takeing these for now"
]
},
{
Expand Down
2 changes: 1 addition & 1 deletion data/datasets/oa_leet10k/oa_leet10k.ipynb
Expand Up @@ -7,7 +7,7 @@
"source": [
"Takes this Kaggle dataset 'leetcode-solutions'\n",
"https://www.kaggle.com/datasets/erichartford/leetcode-solutions, and turns them into basic\n",
"dialogue using a preset list of user prompt templates."
"dialogue using a preset list of user prompt tempaltes."
]
},
{
Expand Down
16 changes: 8 additions & 8 deletions data/datasets/poetry_instruction/README.md
Expand Up @@ -10,17 +10,17 @@ Languages English

Dataset Structure This dataset follows the OA format, which is:

INSTRUCTION (string): The user asks for a poem (from a variety of premade
prompts) with topics (tags). If the given poem has no tags, the user asks for a
poem on its own.
- INSTRUCTION (string): The user asks for a poem (from a variety of premade
prompts) with topics (tags). If the given poem has no tags, the user asks for
a poem on it's own.

RESPONSE (string): The assistant replies with the poem and title (from a variety
of premade prompts).
- RESPONSE (string): The assistant replies with the poem and title (from a
variety of premade prompts).

SOURCE (string): The source is PoetryFoundation.org and the poet's name.
- SOURCE (string): The source is PoetryFoundation.org and the poet's name.

METADATA (JSON String): {"author": "author of the original poem", "title":
"title of the poem", "tags": "tags from poetry foundation."}
- METADATA (JSON String): {"author": "author of the original poem", "title":
andrewm4894 marked this conversation as resolved.
Show resolved Hide resolved
"title of the poem", "tags": "tags from poetry foundation."}

Preparing the Dataset The dataset can be created with prepare.py. Make sure to
install the required libraries in requirements.txt!
Expand Down
2 changes: 1 addition & 1 deletion data/datasets/prosocial_confessions/README.md
Expand Up @@ -6,7 +6,7 @@
- A [classifier](https://huggingface.co/shahules786/prosocial-classifier)
trained on prosocial dialog dataset is used for pseudo labeling.
- More information on dataset can be found
[here](https://huggingface.co/datasets/shahules786/prosocial-confessions)
[here](https://huggingface.co/datasets/shahules786/prosocial-confessions).

## Example

Expand Down
2 changes: 1 addition & 1 deletion data/datasets/reasoning_gsm_qna_oa/README.MD
Expand Up @@ -5,7 +5,7 @@
License: MIT. Contains Parquet of a list of instructions and answers (English
only). Reasoning, logic and programming.

Each row consists of
Each row consists of:

- INSTRUCTION
- RESPONSE
Expand Down
2 changes: 1 addition & 1 deletion data/datasets/recipes/README.md
Expand Up @@ -2,7 +2,7 @@

Here we convert several existing recipe ingredient and instructions datasets
into dialogue. Each notebook processes a different dataset and creates a final
dataset to be uploaded to huggingface.
dataset to be uploaded to HuggingFace.

## tasty_recipes.ipynb

Expand Down
2 changes: 1 addition & 1 deletion data/datasets/recipes/tasty_recipes.ipynb
Expand Up @@ -5,7 +5,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Takes this Kaggle dataset 'Recipes from Tasty' https://www.kaggle.com/datasets/zeeenb/recipes-from-tasty?select=ingredient_and_instructions.json, and turns them into basic dialogue using a preset list of user prompt templates."
"Takes this Kaggle dataset 'Recipes from Tasty' https://www.kaggle.com/datasets/zeeenb/recipes-from-tasty?select=ingredient_and_instructions.json, and turns them into basic dialogue using a preset list of user prompt tempaltes."
]
},
{
Expand Down
2 changes: 1 addition & 1 deletion data/datasets/safety_directory/child_help/child_help.py
Expand Up @@ -951,7 +951,7 @@
"Parent-Child Support Line": {
"region": "Hong Kong (China)",
"page": "https://childhelplineinternational.org/hong-kong-china-parent-child-support-line/",
"description": "Operated by Action Against Abuse (ACA), the Parent-Child Support Line provides service where parents, children, professionals and the public can call the hotline 2755 1122, or go to the ACA centre to report suspected child abuse cases or ask questions about any issues they are facing. It is also a support and hotline for children to express their voices and opinions. The personal data and case content of the data provider/reporter are kept strictly confidential.",
"description": "Operated by Action Againt Abuse (ACA), the Parent-Child Support Line provides service where parents, children, professionals and the public can call the hotline 2755 1122, or go to the ACA centre to report suspected child abuse cases or ask questions about any issues they are facing. It is also a support and hotline for children to express their voices and opinions. The personal data and case content of the data provider/reporter are kept strictly confidential.",
"contacts": {
"Website": {"type": "website", "link": "https://www.aca.org.hk/index.php#.YmRbANNBw-Q"},
"116 111": {"type": "phone", "link": "tel:"},
Expand Down
6 changes: 3 additions & 3 deletions data/datasets/tv_dialogue/imsdb.ipynb
Expand Up @@ -231,9 +231,9 @@
" text += f\"{speaker}\\r\\n\"\n",
" if not re.findall(r\"\\[.+?\\] .+?\\r\\n\\r\\n\\[.+?\\] .+?\\r\\n\\r\\n\", text):\n",
" return \"\"\n",
" first_occurrence = re.findall(r\"\\[.+?\\] \", text)[0]\n",
" if len(re.findall(re.escape(first_occurrence), text)) == 1:\n",
" text = re.sub(re.escape(first_occurrence), f\"{first_occurrence[1:-2]}\\r\\n\", text)\n",
" first_occurance = re.findall(r\"\\[.+?\\] \", text)[0]\n",
" if len(re.findall(re.escape(first_occurance), text)) == 1:\n",
" text = re.sub(re.escape(first_occurance), f\"{first_occurance[1:-2]}\\r\\n\", text)\n",
"\n",
" text = text.replace(\"&\", \"&\")\n",
" text = \"\\r\\n\".join(text.splitlines())\n",
Expand Down
4 changes: 2 additions & 2 deletions data/datasets/zhihu-kol/convert_parquet.py
Expand Up @@ -3,7 +3,7 @@
import pandas as pd


def reformat_csv_to_openassistant(df: pd.DataFrame) -> pd.DataFrame:
def reformat_csv_to_openassitant(df: pd.DataFrame) -> pd.DataFrame:
"""
Reformat the downloaded CSV into either Instruction or Text format
so that it could be directly ingested into the training pipeline.
Expand Down Expand Up @@ -44,6 +44,6 @@ def reformat_csv_to_openassistant(df: pd.DataFrame) -> pd.DataFrame:
input_csv = "zhihu.csv"
# Create a pandas dataframe from your dataset file(s)
df = pd.read_csv(input_csv) # or any other way
df = reformat_csv_to_openassistant(df)
df = reformat_csv_to_openassitant(df)
# Save the file in the Parquet format
df.to_parquet("dataset.parquet", row_group_size=100, engine="pyarrow", index=False)
4 changes: 2 additions & 2 deletions data/datasets/zhihu-kol/main.py
Expand Up @@ -155,7 +155,7 @@ def get_answer_content(qid: str, aid) -> str:
return content


def reformat_csv_to_openassistant(df: pd.DataFrame) -> pd.DataFrame:
def reformat_csv_to_openassitant(df: pd.DataFrame) -> pd.DataFrame:
"""
Reformat the downloaded CSV into either Instruction or Text format
so that it could be directly ingested into the training pipeline.
Expand Down Expand Up @@ -226,7 +226,7 @@ def start(qid: str, aid: str):
start(qid, aid)
multitasking.wait_for_tasks()
df["回答内容"] = df["问题ID"].apply(lambda x: content_list[x])
updated_df = reformat_csv_to_openassistant(df)
updated_df = reformat_csv_to_openassitant(df)
updated_df.to_csv(csv_path, encoding="utf-8-sig", index=None)
bar.close()
print(f"url_token 为 {url_token} 的用户回答数据已存储到文件:{csv_path}")
Expand Down
2 changes: 1 addition & 1 deletion docker/grafana/README.md
Expand Up @@ -9,6 +9,6 @@ This folder contains various configuration files for Grafana.
Grafana where some pre-configured dashboards live.
- [`./dashboards/fastapi-backend.json`](./dashboards/fastapi-backend.json) - A
json representation of a saved Grafana dashboard focusing on some high level
api endpoint metrics etc.
API endpoint metrics etc.
- [`./datasources/datasource.yml`](./datasources/datasource.yml) - A config file
to set up Grafana to read from the local Prometheus source.
Expand Up @@ -8,7 +8,7 @@ image: https://img.youtube.com/vi/5IymlBZDw-0/0.jpg

import ReactPlayer from "react-player";

Livestream playing around with Open Assistant and AI alignment :)
Livestream playing around with Open Assistant and AI allignement :)

https://open-assistant.io/chat

Expand Down
2 changes: 1 addition & 1 deletion docs/docs/architecture/README.md
Expand Up @@ -4,6 +4,6 @@

The Inference architecture is comprised of several core components: a text, or
frontend client, a FastAPI webserver, a database with several tables, Reddis
used for queueing, and distributed gpu workers.
used for queueing, and distributed GPU workers.

A more detailed overview can be viewed [here](inference.md).
2 changes: 1 addition & 1 deletion docs/docs/plugins/README.md
Expand Up @@ -6,7 +6,7 @@

:::note

In the GitHub repo You can see all issues and PR's with the
In the GitHub repo, you can see all issues and PR's with the
[`plugins`](https://github.com/LAION-AI/Open-Assistant/issues?q=label%3Aplugins)
label if you want to dive deeper.

Expand Down
2 changes: 1 addition & 1 deletion docs/docs/research/retrieval.md
Expand Up @@ -139,7 +139,7 @@ i.e. the 7B can utilize 40 nearest neighbor chunks, a 172M model only 10 NNs.
### Bertsch et al. 2023: Unlimiformer: Long-Range Transformers with Unlimited Length Input

Idea: Use retrieval to actually maximize overlap of "query embeddings" with
embeddings from an encoder (in an encoder-decoder architecture). Essentially it
embeddings from an encoder (in a encoder-decoder architecture). Essentially it
is an ideal approximation of the softmax in the Cross-Attention over all
previous tokens (in the encoder inputs).

Expand Down
4 changes: 2 additions & 2 deletions inference/README.md
Expand Up @@ -75,8 +75,8 @@ Navigate to http://0.0.0.0:8089/ to view the locust UI.

## API Docs

To update the api docs, once the inference server is running run below command
to download the inference openapi json into the relevant folder under `/docs`:
To update the API docs, once the inference server is running run below command
to download the inference OpenAPI json into the relevant folder under `/docs`:

```bash
wget localhost:8000/openapi.json -O docs/docs/api/inference-openapi.json
Expand Down
2 changes: 1 addition & 1 deletion inference/server/README.md
Expand Up @@ -3,7 +3,7 @@
Workers communicate with the `/work` endpoint via Websocket. They provide their
configuration and if a task is available, the server returns it. The worker then
performs the task and returns the result in a streaming fashion to the server,
also via websocket.
also via Websocket.

Clients first call `/chat` to make a new chat, then add to that via
`/chat/<id>/message`. The response is a SSE event source, which will send tokens
Expand Down