LAION-AI · EditaNEmilis · Jun 15, 2023 · Jun 16, 2023 · Jun 16, 2023 · Jun 20, 2023
@@ -11,7 +11,7 @@ In root directory, run
 a database. The default settings are already configured to connect to the
 database at `localhost:5432`. (See
 [FAQ](https://projects.laion.ai/Open-Assistant/docs/faq#enable-dockers-buildkit-backend)
-if you face any docker problems).
+if you face any Docker problems).
 
 > **Note:** when running on MacOS with an M1 chip you have to use:
 > `DB_PLATFORM=linux/x86_64 docker compose ...`
@@ -21,7 +21,7 @@ the `.python-version` in the project root directory.
 
 ### Python Packages
 
-Next, to install all requirements, You can run
+Next, to install all requirements, you can run:
 
 1. `pip install -r backend/requirements.txt`
 2. `pip install -e ./oasst-shared/.`
@@ -58,7 +58,7 @@ information.
 Once you have successfully started the backend server, you can access the
 default api docs at `localhost:8080/docs`. If you need to update the exported
 openapi.json in the docs/ folder you can run below command to `wget` them from
-the relevant local fastapi endpoint. This will enable anyone to just see API
+the relevant local FastAPI endpoint. This will enable anyone to just see API
 docs via something like
 [Swagger.io](https://editor.swagger.io/?url=https://raw.githubusercontent.com/LAION-AI/Open-Assistant/main/docs/docs/api/openapi.json)
 without having to actually set up and run a development backend.
@@ -68,16 +68,16 @@ without having to actually set up and run a development backend.
 wget localhost:8080/api/v1/openapi.json -O docs/docs/api/backend-openapi.json
 ```
 
-Note: The api docs should be automatically updated by the
+Note: The API docs should be automatically updated by the
 `test-api-contract.yaml` workflow. (TODO)
 
 ## Running Celery Worker(s) for API and periodic tasks
 
-Celery workers are used for Huggingface API calls like toxicity and feature
+Celery workers are used for HuggingFace API calls like toxicity and feature
 extraction. Celery Beat along with worker is used for periodic tasks like user
 streak update
 
-To run APIs locally
+To run APIs locally:
 
 - update HUGGING_FACE_API_KEY in backend/oasst_backend/config.py with the
   correct API_KEY
@@ -87,7 +87,7 @@ To run APIs locally
 - run start_worker.sh in backend dir
 - to see logs , use `tail -f celery.log` and `tail -f celery.beat.log`
 
-In CI
+In CI:
 
 - set `DEBUG_SKIP_TOXICITY_CALCULATION=False` and
   `DEBUG_SKIP_EMBEDDING_COMPUTATION=False` in docker-compose.yaml

@@ -25,14 +25,14 @@ Replace with a proper domain to setup SSL certificates.
 copilot env deploy
 ```
 
-This will create a variety of aws roles and services needed for deployment.
+This will create a variety of AWS roles and services needed for deployment.
 
 ```sh
 copilot deploy
 ```
 
 This will deploy the services but it won't be 100% ready for usage. Before being
-ready, we have to inspect the AWS Secrets manager and extract out the database
+ready, we have to inspect the AWS Secrets Manager and extract out the database
 credentials. Read those credentials then put them, and a few other secrets, in a
 `secrets.yml` file like the following:
 

@@ -3,7 +3,7 @@ from an annotated version of the code-search-net dataset. The annotated version
 of code-search-net dataset can be found
 [here](https://huggingface.co/datasets/Nan-Do/code-search-net-python).
 
-The dataset contains around 450000 python annotated functions. The dataset is
+The dataset contains around 450000 Python annotated functions. The dataset is
 split into two blocks, one in which the task is starting from the annotated
 summary to generate an instruction to generate the code as a response, and
 another one in which the expected response is to generate a description of the

@@ -10,16 +10,16 @@ Languages English
 
 Dataset Structure This dataset follows the OA format, which is:
 
-INSTRUCTION (string): The user asks for a poem (from a variety of premade
+- INSTRUCTION (string): The user asks for a poem (from a variety of premade
 prompts) with topics (tags). If the given poem has no tags, the user asks for a
 poem on it's own.
 
-RESPONSE (string): The assistant replies with the poem and title (from a variety
+- RESPONSE (string): The assistant replies with the poem and title (from a variety
 of premade prompts).
 
-SOURCE (string): The source is PoetryFoundation.org and the poet's name.
+- SOURCE (string): The source is PoetryFoundation.org and the poet's name.
 
-METADATA (JSON String): {"author": "author of the original poem", "title":
+- METADATA (JSON String): {"author": "author of the original poem", "title":
 "title of the poem", "tags": "tags from poetry foundation."}
 
 Preparing the Dataset The dataset can be created with prepare.py. Make sure to

@@ -6,7 +6,7 @@
 - A [classifier](https://huggingface.co/shahules786/prosocial-classifier)
   trained on prosocial dialog dataset is used for pseudo labeling.
 - More information on dataset can be found
-  [here](https://huggingface.co/datasets/shahules786/prosocial-confessions)
+  [here](https://huggingface.co/datasets/shahules786/prosocial-confessions).
 
 ## Example
 

@@ -5,7 +5,7 @@
 License: MIT. Contains Parquet of a list of instructions and answers (English
 only). Reasoning, logic and programming.
 
-Each row consists of
+Each row consists of:
 
 - INSTRUCTION
 - RESPONSE

@@ -2,7 +2,7 @@
 
 Here we convert several existing recipe ingredient and instructions datasets
 into dialogue. Each notebook processes a different dataset and creates a final
-dataset to be uploaded to huggingface.
+dataset to be uploaded to HuggingFace.
 
 ## tasty_recipes.ipynb
 

@@ -9,6 +9,6 @@ This folder contains various configuration files for Grafana.
   Grafana where some pre-configured dashboards live.
 - [`./dashboards/fastapi-backend.json`](./dashboards/fastapi-backend.json) - A
   json representation of a saved Grafana dashboard focusing on some high level
-  api endpoint metrics etc.
+  API endpoint metrics etc.
 - [`./datasources/datasource.yml`](./datasources/datasource.yml) - A config file
   to set up Grafana to read from the local Prometheus source.
@@ -4,6 +4,6 @@
 
 The Inference architecture is comprised of several core components: a text, or
 frontend client, a FastAPI webserver, a database with several tables, Reddis
-used for queueing, and distributed gpu workers.
+used for queueing, and distributed GPU workers.
 
 A more detailed overview can be viewed [here](inference.md).
@@ -6,7 +6,7 @@
 
 :::note
 
-In the GitHub repo You can see all issues and PR's with the
+In the GitHub repo, you can see all issues and PR's with the
 [`plugins`](https://github.com/LAION-AI/Open-Assistant/issues?q=label%3Aplugins)
 label if you want to dive deeper.
 

@@ -75,8 +75,8 @@ Navigate to http://0.0.0.0:8089/ to view the locust UI.
 
 ## API Docs
 
-To update the api docs, once the inference server is running run below command
-to download the inference openapi json into the relevant folder under `/docs`:
+To update the API docs, once the inference server is running run below command
+to download the inference OpenAPI json into the relevant folder under `/docs`:
 
 ```bash
 wget localhost:8000/openapi.json -O docs/docs/api/inference-openapi.json

@@ -3,7 +3,7 @@
 Workers communicate with the `/work` endpoint via Websocket. They provide their
 configuration and if a task is available, the server returns it. The worker then
 performs the task and returns the result in a streaming fashion to the server,
-also via websocket.
+also via Websocket.
 
 Clients first call `/chat` to make a new chat, then add to that via
 `/chat/<id>/message`. The response is a SSE event source, which will send tokens

@@ -2,8 +2,8 @@
 
 ## Running the worker
 
-To run the worker, you need to have docker installed, including the docker
-nvidia runtime if you want to use a GPU. We made a convenience-script you can
+To run the worker, you need to have Docker installed, including the Docker
+NVIDIA runtime if you want to use a GPU. We made a convenience-script you can
 download and run to start the worker:
 
 ```bash

@@ -83,7 +83,7 @@ To change the model used, i.e. larger pythia version create a new config in
 `EleutherAI/pythia-{size}-deduped`. Larger models will probably need to also
 adjust the `--learning_rate` and `--per_device_train_batch_size` flags.
 
-4. Get SFT trained model
+4. Get SFT trained model.
 
 ```bash
 # choose a specific checkpoint
@@ -95,14 +95,14 @@ export SFT_MODEL=$MODEL_PATH/sft_model/$(ls -t $MODEL_PATH/sft_model/ | head -n
 
 ### RM Training
 
-5. Train the reward model
+5. Train the reward model.
 
 ```bash
 cd ../reward/instructor
 python trainer.py configs/deberta-v3-base.yml --output_dir $MODEL_PATH/reward_model
 ```
 
-6. Get RM trained model
+6. Get RM trained model.
 
 ```bash
 # choose a specific checkpoint
@@ -114,7 +114,7 @@ export REWARD_MODEL=$MODEL_PATH/reward_model/$(ls -t $MODEL_PATH/reward_model/ |
 
 ### RL Training
 
-7. Train the RL agent
+7. Train the RL agent.
 
 ```bash
 cd ../../model_training

@@ -1,6 +1,6 @@
 # Generate Topics, Questions, and Answers from a paragraph of text
 
-This python code can be used to generate topics, questions, and answers from a
+This Python code can be used to generate topics, questions, and answers from a
 paragraph of text. This is a good way to generate ground truth knowledge about a
 topic from a trusted source.
 
@@ -38,7 +38,7 @@ The output of this is a dictionary with the following information:
 
 ## Requirements
 
-This code is verified to work on a 24GB vram graphics card (like an RTX3090). We
+This code is verified to work on a 24GB VRAM graphics card (like an RTX3090). We
 are working on getting it to run on Google Colab TPUs, and also it may be
 possible to use smaller T5 models like the 3 billion parameter model and still
 get acceptable results.
@@ -2,10 +2,10 @@
 
 Essay Instructions is a notebook that takes an essay as an input and generates
 instructions on how to generate that essay. This will be very useful for data
-collecting for the model
+collecting for the model.
 
 ## Contributing
 
 Feel free to contribute to this notebook, it's nowhere near perfect but it's a
 good start. If you want to contribute finding a new model that better suits this
-task would be great. Huggingface has a lot of models that could help.
+task would be great. HuggingFace has a lot of models that could help.
@@ -1,7 +1,7 @@
 # Detoxify evaluation
 
 [Detoxify](https://github.com/unitaryai/detoxify) is a open source model used to
-identify prompts as toxic
+identify prompts as toxic.
 
 <img  src="https://raw.githubusercontent.com/unitaryai/detoxify/master/examples.png"  alt="Image from detoxify github that shows the example input/output of their model"  />
 
@@ -16,15 +16,15 @@ trained on
 
 Unbiased and original models also have a 'small' version - but since normal
 models are not memory heavy, and small models perform noticeably worse, they are
-only described in the notebook
+only described in the notebook.
 
 ## All tests below were ran on a 3090TI
 
 # Inference and training times and memory usages
 
 Charts showing detailed memory usages and times for different sentence lengths
 and batch sizes are inside the notebook Quick overview batch size 16, sentence
-length 4k for training, batch size 128 sentence length 4k for Inference
+length 4k for training, batch size 128 sentence length 4k for Inference.
 
 |  Model name  | Training memory | Training speed | Inference Memory | Inference Speed |
 | :----------: | :-------------: | :------------: | :--------------: | :-------------: |
@@ -34,7 +34,7 @@ length 4k for training, batch size 128 sentence length 4k for Inference
 
 # Filtering quality
 
-Detoxify was tested on 4 different types of inputs
+Detoxify was tested on 4 different types of inputs:
 
 - Not obviously toxic
 - Not obviously non-toxic
@@ -57,7 +57,7 @@ toxicity if it's presented in formal language.
 
 With some caution it can be used to filter prompts but I would suggest also
 using someone for verification of messages that are marked as toxic but still
-below 90% confidence
+below 90% confidence.
 
 # Licensing
 

@@ -8,7 +8,7 @@ If you got the exception `ModuleNotFoundError: No module named 'oasst_data'` you
 first need to install the `oasst_data` package:
 
 Run `pip install -e .` in the `oasst-data/` directory of the Open-Assistant
-repository to install the `oasst_data` python package in editable mode.
+repository to install the `oasst_data` Python package in editable mode.
 
 ## Reading Open-Assistant Export Files
 
@@ -41,7 +41,7 @@ which is used to load Open-Assistant export data for supervised fine-tuning
 (training) of our language models.
 
 You can also load jsonl data completely without dependencies to `oasst_data`
-solely with standard python libraries. In this case the json objects are loaded
+solely with standard Python libraries. In this case the json objects are loaded
 as nested dicts which need to be 'parsed' manually by you:
 
 ```python

@@ -75,6 +75,6 @@ conversation, or at least as a prompt with replies.
 - Write script that matches the original tweets and their text with the archive
   data to create the prompt/reply dataset. (Optional)
 - Decide on final output format and storage options for the dataset. Currently
-  in JSONL with tree / node architecture as python dicts which is acceptable I
+  in JSONL with tree / node architecture as Python dicts which is acceptable I
   believe.
 - Alternatively: Store processed tweets into DB or alternative option.(Optional)
@@ -36,7 +36,7 @@ To contribute to the website, make sure you have the following setup and install
 
 1.  Node 16: if you are on windows, you can [download node from their website](https://nodejs.org/en/download/releases),
     if you are on linux, use [NVM](https://github.com/nvm-sh/nvm) (Once installed, run `nvm use 16`)
-1.  [Docker](https://www.docker.com/): We use docker to simplify running dependent services.
+1.  [Docker](https://www.docker.com/): We use Docker to simplify running dependent services.
 
 ### Getting everything up and running
 
@@ -48,11 +48,11 @@ If you're doing active development we suggest the following workflow:
     - If you want to work on the chat api, you need to run the inference profile as well. Your new command would look
       like: `docker compose --profile frontend-dev --profile inference up --build --attach-dependencies`
     - See [FAQ](https://projects.laion.ai/Open-Assistant/docs/faq#enable-dockers-buildkit-backend) if you face any
-      docker problems.
+      Docker problems.
     - Leave this running in the background and continue:
 1.  Open another terminal tab, navigate to `${OPEN_ASSISTANT_ROOT/website`.
 1.  Run `npm ci`
-1.  Run `npx prisma db push` (This is also needed when you restart the docker stack from scratch).
+1.  Run `npx prisma db push` (This is also needed when you restart the Docker stack from scratch).
 1.  Run `npm run dev`. Now the website is up and running locally at `http://localhost:3000`.
 1.  To create an account, login via the user using email authentication and navigate to `http://localhost:1080`. Check
     the email listed and click the log in link. You're now logged in and authenticated.
@@ -63,7 +63,7 @@ If you're doing active development we suggest the following workflow:
 You can use the debug credentials provider to log in without fancy emails or OAuth.
 
 1. This feature is automatically on in development mode, i.e. when you run `npm run dev`. In case you want to do the
-   same with a production build (for example, the docker image), then run the website with environment variable
+   same with a production build (for example, the Docker image), then run the website with environment variable
    `DEBUG_LOGIN=true`.
 1. Use the `Login` button in the top right to go to the login page.
 1. You should see a section for debug credentials. Enter any username you wish, you will be logged in as that user.

@@ -1,6 +1,6 @@
 # Component and e2e testing with Cypress
 
-[Cypress](https://www.cypress.io/) is used for both component- and end-to-end testing. Below there's a few examples for
+[Cypress](https://www.cypress.io/) is used for both component and end-to-end testing. Below there's a few examples for
 the context of this site. To learn more, the
 [Cypress documentation](https://docs.cypress.io/guides/getting-started/opening-the-app) has it all.