Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distributed RL batch training on Azure error #101

Open
wonjoonSeol opened this issue Apr 4, 2019 · 1 comment
Open

Distributed RL batch training on Azure error #101

wonjoonSeol opened this issue Apr 4, 2019 · 1 comment

Comments

@wonjoonSeol
Copy link

wonjoonSeol commented Apr 4, 2019

It is very difficult to reproduce the result shown in the paper by following the steps in the tutorial.

Issue 1: unable to run jobs on Azure using LaunchTrainJob.ipynb

azure error2

Running LaunchTrainJob notebook result :
TaskSchedulingConstraintFailed Reason: The user used to run the task is not found

We specified batch_job_user_name in the json, but that creates this error.
I need to change user identity to Task user (Admin). Then this problem goes away.

azure error

After fixing that issue, I end up with The specified command program is not found

CommandLine: call C:\\prereq\\mount.bat && C:\\ProgramData\\Anaconda3\\Scripts\\activate.bat py36 && python -u Z:\\scripts_downpour\\app\\distributed_agent.py data_dir=Z: role=agent max_epoch_runtime_sec=30 per_iter_epsilon_reduction=0.003000 min_epsilon=0.100000 batch_size=32 replay_memory_size=2000 experiment_name=distributed_rl_75726dee-3f90-41e4-8657-3f7ae8dc924d weights_path=Z:\data\pretrain_model_weights.h5 train_conv_layers=false
Message: The system cannot find the file specified.

Notice that weights_path=Z:\data\pretrain_model_weights.h5 (generated from the code) does not have extra escape character '\', I tried adding that too but still the same error.

I honestly don't think anyone who star this repo has actually ran the code themselves.
This issue 1 is the most critical part because I cannot run the training job.

Issue 2: SetupCluster.ipynb

This one is merely for bug reporting.

with open('Template\\pool.json.template', 'r') as f:
    pool_config = f.read()
    
pool_config = pool_config\
                .replace('{batch_pool_name}', NOTEBOOK_CONFIG['batch_pool_name'])\
                .replace('{subscription_id}', NOTEBOOK_CONFIG['subscription_id'])\
                .replace('{resource_group_name}', NOTEBOOK_CONFIG['resource_group_name'])\
                .replace('{storage_account_name}', NOTEBOOK_CONFIG['storage_account_name'])\
                .replace('{batch_job_user_name}', NOTEBOOK_CONFIG['batch_job_user_name'])\
                .replace('{batch_job_user_password}', NOTEBOOK_CONFIG['batch_job_user_password'])\
                .replace('{batch_pool_size}', str(NOTEBOOK_CONFIG['batch_pool_size']))

with open('pool.json', 'w') as f:
    f.write(pool_config)
    
create_cmd = 'powershell.exe ".\ProvisionCluster.ps1 -subscriptionId {0} -resourceGroupName {1} -batchAccountName {2}"'\
    .format(NOTEBOOK_CONFIG['subscription_id'], NOTEBOOK_CONFIG['resource_group_name'], NOTEBOOK_CONFIG['batch_account_name'])
    
print('Executing command. Check the terminal output for authentication instructions.')

os.system(create_cmd)

This code no longer works, this is because the json file it creates no longer contains sufficient information to create a pool on the latest Azure cloud.

I created a pool manually using Batch Explorer, I noticed that the pool should be created without adding any 'Start Task' and then set Start Task separately after creating the pool. Otherwise, you end up with the error:

InvalidPropertyValue
The value provided for one of the properties in the request body is invalid.

PropertyName: dataDisks
Reason: Only one of dataDisks and virtualMachineImageId can be specified

LaunchTrainingJob.ipynb

Syntax error in the code:
batch_client = batch.BatchServiceClient(batch_credentials, base_url=NOTEBOOK_CONFIG['batch_account_url'])

Should be :

batch_client = batch.BatchServiceClient(credentials=batch_credentials, **batch_url**=NOTEBOOK_CONFIG['batch_account_url'])

Similarily,

job = batch.models.JobAddParameter(
        job_id,
        batch.models.PoolInformation(pool_id=NOTEBOOK_CONFIG['batch_pool_name']))

batch_client.job.add(job)

Should be :

job = batch.models.JobAddParameter(
        id=job_id,
        **pool_info**=batch.models.PoolInformation(pool_id=NOTEBOOK_CONFIG['batch_pool_name']))

Miscellaneous

  • Should be careful with choosing Azure server region. Not many regions have NV6. So trying to create a pool in those regions will cause an error. (I am currently using US East)
  • Make sure to upgrade your free-trial to pay-as-go and request for higher batch quota via support ticket. Free-trial subscription doesn't offer NV6.
@mitchellspryn
Copy link
Contributor

Thanks for the report. This worked a year ago when we initially wrote the tutorial; it looks like the API has changed a bit from under us. We'll look at updating it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants