Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sagemaker 'NoneType object' issue with data in 'walkthrough-classification-mxnet-sagemaker' example #90

Open
joshwapiano opened this issue Jul 8, 2018 · 8 comments

Comments

@joshwapiano
Copy link

I've been following the walkthough found here (albeit with a smaller bounding box), and have initiated a Sagemaker Notebook instance. The data.npz file is sitting in the sagemaker folder, and I'm having no problem reading it when running the relevant sections of mx_lenet_sagemaker.py in a new notebook on the instance, however when I run the second cell of SageMaker_mx-lenet I hit the following error:

ValueError: Error training sagemaker-mxnet-2018-07-08-18-12-13-217: Failed Reason: AlgorithmError: uncaught exception during training: 'NoneType' object has no attribute 'read'
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/container_support/training.py", line 36, in start
    fw.train()
  File "/usr/local/lib/python2.7/dist-packages/mxnet_container/train.py", line 191, in train
    model = user_module.train(**kwargs_to_pass)
  File "/opt/ml/code/mx_lenet_sagemaker.py", line 92, in train
    train_iter, val_iter = prep_data(data_path)
  File "/opt/ml/code/mx_lenet_sagemaker.py", line 14, in prep_data
    data = np.load(find_file(data_path, 'data.npz'))
  File "/usr/local/lib/python2.7/dist-packages/numpy/lib/npyio.py", line 402, in load
    magic = fid.read(N)
AttributeError: 'NoneType' object has no attribute 'read'

After several hours trying different fixes I'm having little to no luck debugging, but was hoping you could check the example to ensure it runs fine when you attempt it?

@joshwapiano joshwapiano changed the title Sagemaker issue with 'walkthrough-classification-mxnet-sagemaker' example Sagemaker 'NoneType object' issue with data in 'walkthrough-classification-mxnet-sagemaker' example Jul 8, 2018
@Geoyi
Copy link
Contributor

Geoyi commented Jul 9, 2018

@joshwapiano, I guess something happened with the data (it's none), and it might be caused by multiple problems.

  • problem 1, missing tiles

Can you go back to your tiles folder that created by Label Maker, and check if every tile were written correctly? We've seen if the MapBox imagery API token is not set up correctly, the image tiles can be blank.

  • problem 2, the data.npz is not at the corrected directory for the model to read the data.

For this problem, can you just check with the function prep_data (run it in your notebook cell separately with find_file function) if it prints out the correct data input shape? The shape of the data should be ((1831, 3, 256, 256), (458, 3, 256, 256), (1831,), (458,)) for X_train, X_test, Y_train and Y_test.

  • S3 bucket

If you think the first two were not the problems.
You did not mention you have set up an S3 bucket, so I'm just guessing it can cause a problem too. I remember I saved data.npz in an s3 bucket, and give the s3 URL in the second cell mxnet_estimator.fit, where you found the error you mentioned above. It was not a problem that read data.npz from the root directory, but SageMaker continues being updated by the AWS team, so I'm not sure if you're assumed to feed data through S3 now. For details see here.

Let me know if we can help further.

@joshwapiano
Copy link
Author

@Geoyi Thanks for getting back to me. I had written a much longer response, but for some reason GitHub has not saved this comment.

Essentially I have investigated both problem 1 and problem 2 and both give the expected results.

I have a feeling that the issue is with the S3 Bucket, and have tried multiple approaches on this. None of which have been successful. Would you be able to run the example on your own sagemaker notebook instance and if it functions as expected share the syntax/approach used in the mxnet_estimator.fit argument? Or any other changes you make?

Many thanks

@mapmeld
Copy link

mapmeld commented Jul 22, 2018

I'm getting this same problem.

  • re problem 1: I downloaded the jpg files from label_maker and there are satellite images. Some are ocean tiles. Could this be affecting the model?

  • re problem 2 : I use S3 but only from copying examples; so I am unfamiliar with basic stuff, your configuration, as well as S3 + SageMaker. Would a good S3 URL look like: s3://sagemaker-data-npz/data.npz, or just s3://sagemaker-data-npz, or do these both look wrong to you? Was I supposed to do some previous work on IAM keys or make the bucket public?

  • I have used notebooks before. Where would I run or insert prep_data(find_file( ... )) to test that part correctly?

@Geoyi
Copy link
Contributor

Geoyi commented Jul 23, 2018

@joshwapiano and @mapmeld, I will spin up the sagemaker notebook and give a check next week. Let me know if you solve the problem before I get back to you. Sorry for the delay.

@joshwapiano
Copy link
Author

@Geoyi thanks for getting back to us - still not having any luck producing the correct data format/feed for the sagemaker notebook. I think they have made changes to sagemaker/mxnet without providing adequate documentation. Good luck, looking forward to hearing from you.

@Geoyi
Copy link
Contributor

Geoyi commented Aug 1, 2018

Phewwww, I finally solved the problem and took me a whole morning today, @joshwapiano, and @mapmeld.
You're right about S3 bucket and prep_data(find_file( ... )), @mapmeld. I deleted find_file( ... ) function. And @joshwapiano, SageMaker team dosen't do a good job of documenting their work.

Additional things I did:

  • S3 bucket need to have 'sagemaker' in the name;
  • create a notebook using conda-mxnet_p27 instead of their python 3.6. Appearantly, the MXNet only works with python 2.7 and 3.5, so their conda-mxnet_p36 won't work. It gave me no sagemaker.mxnet module error.
  • The 'NoneType' object has no attribute 'read'error you originally got was caused by the data could not be parsed to the model correctly from S3 bucket. Other people recently have this problem with reading data from AWS private S3 bucket.

Here are the scripts to replace the scripts in this notebook:

%%file mx_lenet_sagemaker.py
### replace this to the first cell


import logging
from os import path as op
import os

import mxnet as mx
import numpy as np
import boto3


batch_size = 64
num_cpus = 0
num_gpus = 1

s3_url = "Your_s3_bucket_URL"
s3_client = boto3.client('s3')
s3_client.download_file('Buket-name', "data.npz", "data.npz")

def prep_data():
    """
    Convert numpy array to mx Nd-array.
    Parameters
    ----------
    path: the directory that save data.npz.
    """
    data_file = np.load(op.join(os.getcwd(), 'data.npz'))
    x_train = data_file['x_train']
    y_train = data_file['y_train'][:,:1] ## only take the second column of y_train
    x_test = data_file['x_test']
    y_test = data_file['y_test'][:,:1]
    x_train = x_train.astype('float32')
    x_test = x_test.astype('float32')
    print(x_train.shape, x_train.mean())
    img_mean = np.mean(x_train, axis=(0, 1, 2))
    img_std = np.std(x_train, axis=(0, 1, 2))
    
    x_train -= img_mean
    x_train /= img_std
    x_test -= img_mean
    x_test /= img_std

    img_rows = 256
    img_cols = 256

    x_train = x_train.reshape(x_train.shape[0], 3, img_rows, img_cols) ## reshape it to (448, ) instead of (448,1)
    x_test = x_test.reshape(x_test.shape[0], 3, img_rows, img_cols)
    y_train = y_train.reshape(y_train.shape[0], )
    y_test = y_test.reshape(y_test.shape[0], )
    print(x_train.shape, x_test.shape, y_train.shape, y_test.shape)

    train_iter = mx.io.NDArrayIter(x_train, y_train, batch_size, shuffle=True)
    val_iter = mx.io.NDArrayIter(x_test, y_test, batch_size)

    return train_iter, val_iter

def mx_lenet():
    """Building a three layer LeNet sytle Convolutional Neural Net using MXNet."""
    data = mx.sym.var('data')
    data_dp = mx.symbol.Dropout(data, p = 0.2) ## 20% of the input that gets dropped out during training time
    # first conv layer
    conv1 = mx.sym.Convolution(data=data_dp, kernel=(5, 5), num_filter=20)
    tanh1 = mx.sym.Activation(data=conv1, act_type="tanh")
    pool1 = mx.sym.Pooling(data=tanh1, pool_type="max", kernel=(2, 2), stride=(2, 2))
    # second conv layer
    conv2 = mx.sym.Convolution(data=pool1, kernel=(5, 5), num_filter=50)
    tanh2 = mx.sym.Activation(data=conv2, act_type="tanh")
    pool2 = mx.sym.Pooling(data=tanh2, pool_type="max", kernel=(2, 2), stride=(2, 2))
    
    # third conv layer
    conv3 = mx.sym.Convolution(data=pool1, kernel=(5, 5), num_filter=50)
    tanh3 = mx.sym.Activation(data=conv2, act_type="tanh")
    pool3 = mx.sym.Pooling(data=tanh2, pool_type="max", kernel=(2, 2), stride=(2, 2))
    
    # first fullc layer
    flatten = mx.sym.flatten(data=pool3)
    fc1 = mx.symbol.FullyConnected(data=flatten, num_hidden=500)
    tanh4 = mx.sym.Activation(data=fc1, act_type="tanh")
    # second fullc
    fc2 = mx.sym.FullyConnected(data=tanh4, num_hidden=2)
    # softmax loss
    return mx.sym.SoftmaxOutput(data=fc2, name='softmax')


def train(num_cpus, num_gpus, **kwargs):
    """
    Train the image classification neural net.
    Parameters
    ----------
    num_cpus: If train the model on an aws GPS machine, num_cpus = 0 and num_gpus = 1, vice versa.
    num_gpus: apply to the same rule above
    """
    train_iter, val_iter = prep_data()
    lenet = mx_lenet()
    lenet_model = mx.mod.Module(
        symbol=lenet,
        context=get_train_context(num_cpus, num_gpus))
    logging.getLogger().setLevel(logging.DEBUG)
    lenet_model.fit(train_iter,
                    eval_data=val_iter,
                    optimizer='sgd',
                    optimizer_params={'learning_rate': 0.1},
                    eval_metric='acc',
                    batch_end_callback=mx.callback.Speedometer(batch_size, 16),
                    num_epoch=100)
    return lenet_model


def get_train_context(num_cpus, num_gpus):
    """
    Define the model training instance.
    Parameters
    ----------
    num_cpus: If train the model on an aws GPS machine, num_cpus = 0 and num_gpus = 1, vice versa.
    num_gpus: apply to the same rule above
    """
    if num_gpus > 0:
        return mx.gpu()
    return mx.cpu()

def get_train_context(num_cpus, num_gpus):
    if num_gpus > 0:
        print("It's {} instance".format(num_gpus))
        return mx.gpu()
    print("It's {} instance".format(num_cpus))
    return mx.cpu()

and do this to the second cell:

%%time
from  sagemaker.mxnet import MXNet
from sagemaker import get_execution_role

s3_url = "Your_s3_bucket_URL"
mxnet_estimator = MXNet("mx_lenet_sagemaker.py", 
                        role=get_execution_role(), 
                        output_path= s3_url,
                        train_instance_type="ml.p2.xlarge", 
                        train_instance_count=1)

mxnet_estimator.fit(s3_url)

@mapmeld
Copy link

mapmeld commented Aug 5, 2018

@Geoyi this works for me - thank you so much for fixing this!

@joshwapiano
Copy link
Author

@Geoyi Many thanks for providing this! Looking forward to trying it out. Will let you know how I get on!
I've come across an issue with the labelling that label-maker is producing, will raise in separate #issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants