Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

README Fixing #3481

Open
wants to merge 8 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
14 changes: 7 additions & 7 deletions backend/README.md
Expand Up @@ -11,7 +11,7 @@ In root directory, run
a database. The default settings are already configured to connect to the
database at `localhost:5432`. (See
[FAQ](https://projects.laion.ai/Open-Assistant/docs/faq#enable-dockers-buildkit-backend)
if you face any docker problems).
if you face any Docker problems).

> **Note:** when running on MacOS with an M1 chip you have to use:
> `DB_PLATFORM=linux/x86_64 docker compose ...`
Expand All @@ -21,7 +21,7 @@ the `.python-version` in the project root directory.

### Python Packages

Next, to install all requirements, You can run
Next, to install all requirements, you can run:

1. `pip install -r backend/requirements.txt`
2. `pip install -e ./oasst-shared/.`
Expand Down Expand Up @@ -58,7 +58,7 @@ information.
Once you have successfully started the backend server, you can access the
default api docs at `localhost:8080/docs`. If you need to update the exported
openapi.json in the docs/ folder you can run below command to `wget` them from
the relevant local fastapi endpoint. This will enable anyone to just see API
the relevant local FastAPI endpoint. This will enable anyone to just see API
docs via something like
[Swagger.io](https://editor.swagger.io/?url=https://raw.githubusercontent.com/LAION-AI/Open-Assistant/main/docs/docs/api/openapi.json)
without having to actually set up and run a development backend.
Expand All @@ -68,16 +68,16 @@ without having to actually set up and run a development backend.
wget localhost:8080/api/v1/openapi.json -O docs/docs/api/backend-openapi.json
```

Note: The api docs should be automatically updated by the
Note: The API docs should be automatically updated by the
`test-api-contract.yaml` workflow. (TODO)

## Running Celery Worker(s) for API and periodic tasks

Celery workers are used for Huggingface API calls like toxicity and feature
Celery workers are used for HuggingFace API calls like toxicity and feature
extraction. Celery Beat along with worker is used for periodic tasks like user
streak update

To run APIs locally
To run APIs locally:

- update HUGGING_FACE_API_KEY in backend/oasst_backend/config.py with the
correct API_KEY
Expand All @@ -87,7 +87,7 @@ To run APIs locally
- run start_worker.sh in backend dir
- to see logs , use `tail -f celery.log` and `tail -f celery.beat.log`

In CI
In CI:

- set `DEBUG_SKIP_TOXICITY_CALCULATION=False` and
`DEBUG_SKIP_EMBEDDING_COMPUTATION=False` in docker-compose.yaml
Expand Down
4 changes: 2 additions & 2 deletions copilot/README.md
Expand Up @@ -25,14 +25,14 @@ Replace with a proper domain to setup SSL certificates.
copilot env deploy
```

This will create a variety of aws roles and services needed for deployment.
This will create a variety of AWS roles and services needed for deployment.

```sh
copilot deploy
```

This will deploy the services but it won't be 100% ready for usage. Before being
ready, we have to inspect the AWS Secrets manager and extract out the database
ready, we have to inspect the AWS Secrets Manager and extract out the database
credentials. Read those credentials then put them, and a few other secrets, in a
`secrets.yml` file like the following:

Expand Down
Expand Up @@ -3,7 +3,7 @@ from an annotated version of the code-search-net dataset. The annotated version
of code-search-net dataset can be found
[here](https://huggingface.co/datasets/Nan-Do/code-search-net-python).

The dataset contains around 450000 python annotated functions. The dataset is
The dataset contains around 450000 Python annotated functions. The dataset is
split into two blocks, one in which the task is starting from the annotated
summary to generate an instruction to generate the code as a response, and
another one in which the expected response is to generate a description of the
Expand Down
8 changes: 4 additions & 4 deletions data/datasets/poetry_instruction/README.md
Expand Up @@ -10,16 +10,16 @@ Languages English

Dataset Structure This dataset follows the OA format, which is:

INSTRUCTION (string): The user asks for a poem (from a variety of premade
- INSTRUCTION (string): The user asks for a poem (from a variety of premade
prompts) with topics (tags). If the given poem has no tags, the user asks for a
poem on it's own.

RESPONSE (string): The assistant replies with the poem and title (from a variety
- RESPONSE (string): The assistant replies with the poem and title (from a variety
of premade prompts).

SOURCE (string): The source is PoetryFoundation.org and the poet's name.
- SOURCE (string): The source is PoetryFoundation.org and the poet's name.

METADATA (JSON String): {"author": "author of the original poem", "title":
- METADATA (JSON String): {"author": "author of the original poem", "title":
andrewm4894 marked this conversation as resolved.
Show resolved Hide resolved
"title of the poem", "tags": "tags from poetry foundation."}

Preparing the Dataset The dataset can be created with prepare.py. Make sure to
Expand Down
2 changes: 1 addition & 1 deletion data/datasets/prosocial_confessions/README.md
Expand Up @@ -6,7 +6,7 @@
- A [classifier](https://huggingface.co/shahules786/prosocial-classifier)
trained on prosocial dialog dataset is used for pseudo labeling.
- More information on dataset can be found
[here](https://huggingface.co/datasets/shahules786/prosocial-confessions)
[here](https://huggingface.co/datasets/shahules786/prosocial-confessions).

## Example

Expand Down
2 changes: 1 addition & 1 deletion data/datasets/reasoning_gsm_qna_oa/README.MD
Expand Up @@ -5,7 +5,7 @@
License: MIT. Contains Parquet of a list of instructions and answers (English
only). Reasoning, logic and programming.

Each row consists of
Each row consists of:

- INSTRUCTION
- RESPONSE
Expand Down
2 changes: 1 addition & 1 deletion data/datasets/recipes/README.md
Expand Up @@ -2,7 +2,7 @@

Here we convert several existing recipe ingredient and instructions datasets
into dialogue. Each notebook processes a different dataset and creates a final
dataset to be uploaded to huggingface.
dataset to be uploaded to HuggingFace.

## tasty_recipes.ipynb

Expand Down
2 changes: 1 addition & 1 deletion docker/grafana/README.md
Expand Up @@ -9,6 +9,6 @@ This folder contains various configuration files for Grafana.
Grafana where some pre-configured dashboards live.
- [`./dashboards/fastapi-backend.json`](./dashboards/fastapi-backend.json) - A
json representation of a saved Grafana dashboard focusing on some high level
api endpoint metrics etc.
API endpoint metrics etc.
- [`./datasources/datasource.yml`](./datasources/datasource.yml) - A config file
to set up Grafana to read from the local Prometheus source.
2 changes: 1 addition & 1 deletion docs/docs/architecture/README.md
Expand Up @@ -4,6 +4,6 @@

The Inference architecture is comprised of several core components: a text, or
frontend client, a FastAPI webserver, a database with several tables, Reddis
used for queueing, and distributed gpu workers.
used for queueing, and distributed GPU workers.

A more detailed overview can be viewed [here](inference.md).
2 changes: 1 addition & 1 deletion docs/docs/plugins/README.md
Expand Up @@ -6,7 +6,7 @@

:::note

In the GitHub repo You can see all issues and PR's with the
In the GitHub repo, you can see all issues and PR's with the
[`plugins`](https://github.com/LAION-AI/Open-Assistant/issues?q=label%3Aplugins)
label if you want to dive deeper.

Expand Down
4 changes: 2 additions & 2 deletions inference/README.md
Expand Up @@ -75,8 +75,8 @@ Navigate to http://0.0.0.0:8089/ to view the locust UI.

## API Docs

To update the api docs, once the inference server is running run below command
to download the inference openapi json into the relevant folder under `/docs`:
To update the API docs, once the inference server is running run below command
to download the inference OpenAPI json into the relevant folder under `/docs`:

```bash
wget localhost:8000/openapi.json -O docs/docs/api/inference-openapi.json
Expand Down
2 changes: 1 addition & 1 deletion inference/server/README.md
Expand Up @@ -3,7 +3,7 @@
Workers communicate with the `/work` endpoint via Websocket. They provide their
configuration and if a task is available, the server returns it. The worker then
performs the task and returns the result in a streaming fashion to the server,
also via websocket.
also via Websocket.

Clients first call `/chat` to make a new chat, then add to that via
`/chat/<id>/message`. The response is a SSE event source, which will send tokens
Expand Down
4 changes: 2 additions & 2 deletions inference/worker/README.md
Expand Up @@ -2,8 +2,8 @@

## Running the worker

To run the worker, you need to have docker installed, including the docker
nvidia runtime if you want to use a GPU. We made a convenience-script you can
To run the worker, you need to have Docker installed, including the Docker
NVIDIA runtime if you want to use a GPU. We made a convenience-script you can
download and run to start the worker:

```bash
Expand Down
8 changes: 4 additions & 4 deletions model/README.md
Expand Up @@ -83,7 +83,7 @@ To change the model used, i.e. larger pythia version create a new config in
`EleutherAI/pythia-{size}-deduped`. Larger models will probably need to also
adjust the `--learning_rate` and `--per_device_train_batch_size` flags.

4. Get SFT trained model
4. Get SFT trained model.

```bash
# choose a specific checkpoint
Expand All @@ -95,14 +95,14 @@ export SFT_MODEL=$MODEL_PATH/sft_model/$(ls -t $MODEL_PATH/sft_model/ | head -n

### RM Training

5. Train the reward model
5. Train the reward model.

```bash
cd ../reward/instructor
python trainer.py configs/deberta-v3-base.yml --output_dir $MODEL_PATH/reward_model
```

6. Get RM trained model
6. Get RM trained model.

```bash
# choose a specific checkpoint
Expand All @@ -114,7 +114,7 @@ export REWARD_MODEL=$MODEL_PATH/reward_model/$(ls -t $MODEL_PATH/reward_model/ |

### RL Training

7. Train the RL agent
7. Train the RL agent.

```bash
cd ../../model_training
Expand Down
4 changes: 2 additions & 2 deletions notebooks/closed-book-qa/README.md
@@ -1,6 +1,6 @@
# Generate Topics, Questions, and Answers from a paragraph of text

This python code can be used to generate topics, questions, and answers from a
This Python code can be used to generate topics, questions, and answers from a
paragraph of text. This is a good way to generate ground truth knowledge about a
topic from a trusted source.

Expand Down Expand Up @@ -38,7 +38,7 @@ The output of this is a dictionary with the following information:

## Requirements

This code is verified to work on a 24GB vram graphics card (like an RTX3090). We
This code is verified to work on a 24GB VRAM graphics card (like an RTX3090). We
are working on getting it to run on Google Colab TPUs, and also it may be
possible to use smaller T5 models like the 3 billion parameter model and still
get acceptable results.
4 changes: 2 additions & 2 deletions notebooks/data-augmentation/essay-instructions/README.md
Expand Up @@ -2,10 +2,10 @@

Essay Instructions is a notebook that takes an essay as an input and generates
instructions on how to generate that essay. This will be very useful for data
collecting for the model
collecting for the model.

## Contributing

Feel free to contribute to this notebook, it's nowhere near perfect but it's a
good start. If you want to contribute finding a new model that better suits this
task would be great. Huggingface has a lot of models that could help.
task would be great. HuggingFace has a lot of models that could help.
10 changes: 5 additions & 5 deletions notebooks/detoxify-evaluation/README.md
@@ -1,7 +1,7 @@
# Detoxify evaluation

[Detoxify](https://github.com/unitaryai/detoxify) is a open source model used to
identify prompts as toxic
identify prompts as toxic.

<img src="https://raw.githubusercontent.com/unitaryai/detoxify/master/examples.png" alt="Image from detoxify github that shows the example input/output of their model" />

Expand All @@ -16,15 +16,15 @@ trained on

Unbiased and original models also have a 'small' version - but since normal
models are not memory heavy, and small models perform noticeably worse, they are
only described in the notebook
only described in the notebook.

## All tests below were ran on a 3090TI

# Inference and training times and memory usages

Charts showing detailed memory usages and times for different sentence lengths
and batch sizes are inside the notebook Quick overview batch size 16, sentence
length 4k for training, batch size 128 sentence length 4k for Inference
length 4k for training, batch size 128 sentence length 4k for Inference.

| Model name | Training memory | Training speed | Inference Memory | Inference Speed |
| :----------: | :-------------: | :------------: | :--------------: | :-------------: |
Expand All @@ -34,7 +34,7 @@ length 4k for training, batch size 128 sentence length 4k for Inference

# Filtering quality

Detoxify was tested on 4 different types of inputs
Detoxify was tested on 4 different types of inputs:

- Not obviously toxic
- Not obviously non-toxic
Expand All @@ -57,7 +57,7 @@ toxicity if it's presented in formal language.

With some caution it can be used to filter prompts but I would suggest also
using someone for verification of messages that are marked as toxic but still
below 90% confidence
below 90% confidence.

# Licensing

Expand Down
4 changes: 2 additions & 2 deletions oasst-data/README.md
Expand Up @@ -8,7 +8,7 @@ If you got the exception `ModuleNotFoundError: No module named 'oasst_data'` you
first need to install the `oasst_data` package:

Run `pip install -e .` in the `oasst-data/` directory of the Open-Assistant
repository to install the `oasst_data` python package in editable mode.
repository to install the `oasst_data` Python package in editable mode.

## Reading Open-Assistant Export Files

Expand Down Expand Up @@ -41,7 +41,7 @@ which is used to load Open-Assistant export data for supervised fine-tuning
(training) of our language models.

You can also load jsonl data completely without dependencies to `oasst_data`
solely with standard python libraries. In this case the json objects are loaded
solely with standard Python libraries. In this case the json objects are loaded
as nested dicts which need to be 'parsed' manually by you:

```python
Expand Down
2 changes: 1 addition & 1 deletion scripts/data-collection/twitter/README.md
Expand Up @@ -75,6 +75,6 @@ conversation, or at least as a prompt with replies.
- Write script that matches the original tweets and their text with the archive
data to create the prompt/reply dataset. (Optional)
- Decide on final output format and storage options for the dataset. Currently
in JSONL with tree / node architecture as python dicts which is acceptable I
in JSONL with tree / node architecture as Python dicts which is acceptable I
believe.
- Alternatively: Store processed tweets into DB or alternative option.(Optional)
8 changes: 4 additions & 4 deletions website/README.md
Expand Up @@ -36,7 +36,7 @@ To contribute to the website, make sure you have the following setup and install

1. Node 16: if you are on windows, you can [download node from their website](https://nodejs.org/en/download/releases),
if you are on linux, use [NVM](https://github.com/nvm-sh/nvm) (Once installed, run `nvm use 16`)
1. [Docker](https://www.docker.com/): We use docker to simplify running dependent services.
1. [Docker](https://www.docker.com/): We use Docker to simplify running dependent services.

### Getting everything up and running

Expand All @@ -48,11 +48,11 @@ If you're doing active development we suggest the following workflow:
- If you want to work on the chat api, you need to run the inference profile as well. Your new command would look
like: `docker compose --profile frontend-dev --profile inference up --build --attach-dependencies`
- See [FAQ](https://projects.laion.ai/Open-Assistant/docs/faq#enable-dockers-buildkit-backend) if you face any
docker problems.
Docker problems.
- Leave this running in the background and continue:
1. Open another terminal tab, navigate to `${OPEN_ASSISTANT_ROOT/website`.
1. Run `npm ci`
1. Run `npx prisma db push` (This is also needed when you restart the docker stack from scratch).
1. Run `npx prisma db push` (This is also needed when you restart the Docker stack from scratch).
1. Run `npm run dev`. Now the website is up and running locally at `http://localhost:3000`.
1. To create an account, login via the user using email authentication and navigate to `http://localhost:1080`. Check
the email listed and click the log in link. You're now logged in and authenticated.
Expand All @@ -63,7 +63,7 @@ If you're doing active development we suggest the following workflow:
You can use the debug credentials provider to log in without fancy emails or OAuth.

1. This feature is automatically on in development mode, i.e. when you run `npm run dev`. In case you want to do the
same with a production build (for example, the docker image), then run the website with environment variable
same with a production build (for example, the Docker image), then run the website with environment variable
`DEBUG_LOGIN=true`.
1. Use the `Login` button in the top right to go to the login page.
1. You should see a section for debug credentials. Enter any username you wish, you will be logged in as that user.
Expand Down
2 changes: 1 addition & 1 deletion website/cypress/README.md
@@ -1,6 +1,6 @@
# Component and e2e testing with Cypress

[Cypress](https://www.cypress.io/) is used for both component- and end-to-end testing. Below there's a few examples for
[Cypress](https://www.cypress.io/) is used for both component and end-to-end testing. Below there's a few examples for
the context of this site. To learn more, the
[Cypress documentation](https://docs.cypress.io/guides/getting-started/opening-the-app) has it all.

Expand Down