LAION-AI · EditaNEmilis · Jun 15, 2023 · Jun 16, 2023 · Jun 16, 2023 · Jun 20, 2023
@@ -11,7 +11,7 @@ In root directory, run
 a database. The default settings are already configured to connect to the
 database at `localhost:5432`. (See
 [FAQ](https://projects.laion.ai/Open-Assistant/docs/faq#enable-dockers-buildkit-backend)
-if you face any docker problems).
+if you face any Docker problems).
 
 > **Note:** when running on MacOS with an M1 chip you have to use:
 > `DB_PLATFORM=linux/x86_64 docker compose ...`
@@ -21,7 +21,7 @@ the `.python-version` in the project root directory.
 
 ### Python Packages
 
-Next, to install all requirements, You can run
+Next, to install all requirements, you can run:
 
 1. `pip install -r backend/requirements.txt`
 2. `pip install -e ./oasst-shared/.`
@@ -58,7 +58,7 @@ information.
 Once you have successfully started the backend server, you can access the
 default api docs at `localhost:8080/docs`. If you need to update the exported
 openapi.json in the docs/ folder you can run below command to `wget` them from
-the relevant local fastapi endpoint. This will enable anyone to just see API
+the relevant local FastAPI endpoint. This will enable anyone to just see API
 docs via something like
 [Swagger.io](https://editor.swagger.io/?url=https://raw.githubusercontent.com/LAION-AI/Open-Assistant/main/docs/docs/api/openapi.json)
 without having to actually set up and run a development backend.
@@ -68,16 +68,16 @@ without having to actually set up and run a development backend.
 wget localhost:8080/api/v1/openapi.json -O docs/docs/api/backend-openapi.json
 ```
 
-Note: The api docs should be automatically updated by the
+Note: The API docs should be automatically updated by the
 `test-api-contract.yaml` workflow. (TODO)
 
 ## Running Celery Worker(s) for API and periodic tasks
 
-Celery workers are used for Huggingface API calls like toxicity and feature
+Celery workers are used for HuggingFace API calls like toxicity and feature
 extraction. Celery Beat along with worker is used for periodic tasks like user
 streak update
 
-To run APIs locally
+To run APIs locally:
 
 - update HUGGING_FACE_API_KEY in backend/oasst_backend/config.py with the
   correct API_KEY
@@ -87,7 +87,7 @@ To run APIs locally
 - run start_worker.sh in backend dir
 - to see logs , use `tail -f celery.log` and `tail -f celery.beat.log`
 
-In CI
+In CI:
 
 - set `DEBUG_SKIP_TOXICITY_CALCULATION=False` and
   `DEBUG_SKIP_EMBEDDING_COMPUTATION=False` in docker-compose.yaml

@@ -25,7 +25,7 @@ Replace with a proper domain to setup SSL certificates.
 copilot env deploy
 ```
 
-This will create a variety of aws roles and services needed for deployment.
+This will create a variety of AWS roles and services needed for deployment.
 
 ```sh
 copilot deploy

@@ -3,7 +3,7 @@ from an annotated version of the code-search-net dataset. The annotated version
 of code-search-net dataset can be found
 [here](https://huggingface.co/datasets/Nan-Do/code-search-net-python).
 
-The dataset contains around 450000 python annotated functions. The dataset is
+The dataset contains around 450000 Python annotated functions. The dataset is
 split into two blocks, one in which the task is starting from the annotated
 summary to generate an instruction to generate the code as a response, and
 another one in which the expected response is to generate a description of the

@@ -10,16 +10,16 @@ Languages English
 
 Dataset Structure This dataset follows the OA format, which is:
 
-INSTRUCTION (string): The user asks for a poem (from a variety of premade
+- NSTRUCTION (string): The user asks for a poem (from a variety of premade
 prompts) with topics (tags). If the given poem has no tags, the user asks for a
 poem on it's own.
 
-RESPONSE (string): The assistant replies with the poem and title (from a variety
+- RESPONSE (string): The assistant replies with the poem and title (from a variety
 of premade prompts).
 
-SOURCE (string): The source is PoetryFoundation.org and the poet's name.
+- SOURCE (string): The source is PoetryFoundation.org and the poet's name.
 
-METADATA (JSON String): {"author": "author of the original poem", "title":
+- METADATA (JSON String): {"author": "author of the original poem", "title":
 "title of the poem", "tags": "tags from poetry foundation."}
 
 Preparing the Dataset The dataset can be created with prepare.py. Make sure to

@@ -6,7 +6,7 @@
 - A [classifier](https://huggingface.co/shahules786/prosocial-classifier)
   trained on prosocial dialog dataset is used for pseudo labeling.
 - More information on dataset can be found
-  [here](https://huggingface.co/datasets/shahules786/prosocial-confessions)
+  [here](https://huggingface.co/datasets/shahules786/prosocial-confessions).
 
 ## Example
 

@@ -5,7 +5,7 @@
 License: MIT. Contains Parquet of a list of instructions and answers (English
 only). Reasoning, logic and programming.
 
-Each row consists of
+Each row consists of:
 
 - INSTRUCTION
 - RESPONSE

@@ -2,7 +2,7 @@
 
 Here we convert several existing recipe ingredient and instructions datasets
 into dialogue. Each notebook processes a different dataset and creates a final
-dataset to be uploaded to huggingface.
+dataset to be uploaded to HuggingFace.
 
 ## tasty_recipes.ipynb
 
@@ -14,7 +14,7 @@ dialogue using a preset list of user prompt templates.
 ### Some ideas for extending this dataset
 
 This dataset is nicely structured, and the ingredients section includes the
-quantities and units separated out. Some, but not all already include a
+quantities and units separated out. Somehow, but not all already include a
 primary_unit (US) and metric_unit. We could find all recipes with both units and
 generate dialogue for the prompt 'convert the ingredients into metric', 'what
 are the ingredients in UK measurements'? etc..
@@ -9,6 +9,6 @@ This folder contains various configuration files for Grafana.
   Grafana where some pre-configured dashboards live.
 - [`./dashboards/fastapi-backend.json`](./dashboards/fastapi-backend.json) - A
   json representation of a saved Grafana dashboard focusing on some high level
-  api endpoint metrics etc.
+  API endpoint metrics etc.
 - [`./datasources/datasource.yml`](./datasources/datasource.yml) - A config file
   to set up Grafana to read from the local Prometheus source.
@@ -75,8 +75,8 @@ Navigate to http://0.0.0.0:8089/ to view the locust UI.
 
 ## API Docs
 
-To update the api docs, once the inference server is running run below command
-to download the inference openapi json into the relevant folder under `/docs`:
+To update the API docs, once the inference server is running run below command
+to download the inference OpenAPI json into the relevant folder under `/docs`:
 
 ```bash
 wget localhost:8000/openapi.json -O docs/docs/api/inference-openapi.json

@@ -83,7 +83,7 @@ To change the model used, i.e. larger pythia version create a new config in
 `EleutherAI/pythia-{size}-deduped`. Larger models will probably need to also
 adjust the `--learning_rate` and `--per_device_train_batch_size` flags.
 
-4. Get SFT trained model
+4. Get SFT trained model.
 
 ```bash
 # choose a specific checkpoint
@@ -95,14 +95,14 @@ export SFT_MODEL=$MODEL_PATH/sft_model/$(ls -t $MODEL_PATH/sft_model/ | head -n
 
 ### RM Training
 
-5. Train the reward model
+5. Train the reward model.
 
 ```bash
 cd ../reward/instructor
 python trainer.py configs/deberta-v3-base.yml --output_dir $MODEL_PATH/reward_model
 ```
 
-6. Get RM trained model
+6. Get RM trained model.
 
 ```bash
 # choose a specific checkpoint
@@ -114,7 +114,7 @@ export REWARD_MODEL=$MODEL_PATH/reward_model/$(ls -t $MODEL_PATH/reward_model/ |
 
 ### RL Training
 
-7. Train the RL agent
+7. Train the RL agent.
 
 ```bash
 cd ../../model_training

@@ -1,6 +1,6 @@
 # Generate Topics, Questions, and Answers from a paragraph of text
 
-This python code can be used to generate topics, questions, and answers from a
+This Python code can be used to generate topics, questions, and answers from a
 paragraph of text. This is a good way to generate ground truth knowledge about a
 topic from a trusted source.
 

@@ -8,7 +8,7 @@ If you got the exception `ModuleNotFoundError: No module named 'oasst_data'` you
 first need to install the `oasst_data` package:
 
 Run `pip install -e .` in the `oasst-data/` directory of the Open-Assistant
-repository to install the `oasst_data` python package in editable mode.
+repository to install the `oasst_data` Python package in editable mode.
 
 ## Reading Open-Assistant Export Files
 
@@ -41,7 +41,7 @@ which is used to load Open-Assistant export data for supervised fine-tuning
 (training) of our language models.
 
 You can also load jsonl data completely without dependencies to `oasst_data`
-solely with standard python libraries. In this case the json objects are loaded
+solely with standard Python libraries. In this case the json objects are loaded
 as nested dicts which need to be 'parsed' manually by you:
 
 ```python

@@ -75,6 +75,6 @@ conversation, or at least as a prompt with replies.
 - Write script that matches the original tweets and their text with the archive
   data to create the prompt/reply dataset. (Optional)
 - Decide on final output format and storage options for the dataset. Currently
-  in JSONL with tree / node architecture as python dicts which is acceptable I
+  in JSONL with tree / node architecture as Python dicts which is acceptable I
   believe.
 - Alternatively: Store processed tweets into DB or alternative option.(Optional)
@@ -36,7 +36,7 @@ To contribute to the website, make sure you have the following setup and install
 
 1.  Node 16: if you are on windows, you can [download node from their website](https://nodejs.org/en/download/releases),
     if you are on linux, use [NVM](https://github.com/nvm-sh/nvm) (Once installed, run `nvm use 16`)
-1.  [Docker](https://www.docker.com/): We use docker to simplify running dependent services.
+1.  [Docker](https://www.docker.com/): We use Docker to simplify running dependent services.
 
 ### Getting everything up and running
 
@@ -48,11 +48,11 @@ If you're doing active development we suggest the following workflow:
     - If you want to work on the chat api, you need to run the inference profile as well. Your new command would look
       like: `docker compose --profile frontend-dev --profile inference up --build --attach-dependencies`
     - See [FAQ](https://projects.laion.ai/Open-Assistant/docs/faq#enable-dockers-buildkit-backend) if you face any
-      docker problems.
+      Docker problems.
     - Leave this running in the background and continue:
 1.  Open another terminal tab, navigate to `${OPEN_ASSISTANT_ROOT/website`.
 1.  Run `npm ci`
-1.  Run `npx prisma db push` (This is also needed when you restart the docker stack from scratch).
+1.  Run `npx prisma db push` (This is also needed when you restart the Docker stack from scratch).
 1.  Run `npm run dev`. Now the website is up and running locally at `http://localhost:3000`.
 1.  To create an account, login via the user using email authentication and navigate to `http://localhost:1080`. Check
     the email listed and click the log in link. You're now logged in and authenticated.
@@ -63,7 +63,7 @@ If you're doing active development we suggest the following workflow:
 You can use the debug credentials provider to log in without fancy emails or OAuth.
 
 1. This feature is automatically on in development mode, i.e. when you run `npm run dev`. In case you want to do the
-   same with a production build (for example, the docker image), then run the website with environment variable
+   same with a production build (for example, the Docker image), then run the website with environment variable
    `DEBUG_LOGIN=true`.
 1. Use the `Login` button in the top right to go to the login page.
 1. You should see a section for debug credentials. Enter any username you wish, you will be logged in as that user.