Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mitchish kempner #352

Open
wants to merge 7 commits into
base: main
Choose a base branch
from
Open

Mitchish kempner #352

wants to merge 7 commits into from

Conversation

ibeltagy
Copy link
Contributor

No description provided.

@ibeltagy ibeltagy changed the base branch from main to Llama October 31, 2023 23:38
load_path: null

max_duration: 423855 # 2T tokens
global_train_batch_size: 1536
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is putting only 3M tokens into one batch.


activation_checkpointing: by_layer

compile: null
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mitchish has a different config here.

compile:
  fullgraph: false

precision: amp_bf16

max_grad_norm: 1.0

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should explicitly set max_grad_norm_ratio as well, just in case.

--run_name=kempner_mitchish7_${SLURM_JOB_ID} \
--save_folder=/n/holyscratch01/kempner_lab/Lab/checkpoints/${SLURM_JOB_ID}/ \
--data.num_workers=4 \
--device_train_microbatch_size=6 \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This works when the global batch size is divisible by 6, but it should not be. The fact that it is is a bug.

--distribution=block:block \
--kill-on-bad-exit \
scripts/run_with_environment.sh \
$HOME/miniconda3/envs/LLM/bin/python -u scripts/train.py configs/llama7.yaml \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is referring to the wrong config.

Base automatically changed from Llama to main November 2, 2023 00:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants