Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Container localhost does not exist. #16481

Closed
eaplatanios opened this issue Jan 27, 2018 · 13 comments
Closed

Container localhost does not exist. #16481

eaplatanios opened this issue Jan 27, 2018 · 13 comments
Assignees
Labels
stat:awaiting tensorflower Status - Awaiting response from tensorflower type:bug Bug

Comments

@eaplatanios
Copy link
Contributor

Hi,

I upgraded from 1.5.0-rc1 to the current master branch and I started receiving the following error:

2018-01-27 02:48:38.928667: W tensorflow/core/framework/op_kernel.cc:1201] OP_REQUIRES failed at lookup_table_op.cc:656 : Not found: Container localhost does not exist. (Could not find resource: localhost/hash_table_/Users/anthony/Development/GitHub/symphony-mt/temp/data/iwslt-15/vocab.vi_WHOLE_LINE_LINE_NUMBER)
2018-01-27 02:48:38.928786: W tensorflow/core/framework/op_kernel.cc:1201] OP_REQUIRES failed at iterator_ops.cc:855 : Not found: Container localhost does not exist. (Could not find resource: localhost/hash_table_/Users/anthony/Development/GitHub/symphony-mt/temp/data/iwslt-15/vocab.vi_WHOLE_LINE_LINE_NUMBER)
	 [[Node: Lookup_1/LookupTableFind = LookupTableFindV2[Tin=DT_STRING, Tout=DT_INT64](lookup_1_placeholder, input_1, lookup_1_placeholder_1)]]
Exception in thread "main" org.platanios.tensorflow.jni.NotFoundException: Container localhost does not exist. (Could not find resource: localhost/hash_table_/Users/anthony/Development/GitHub/symphony-mt/temp/data/iwslt-15/vocab.vi_WHOLE_LINE_LINE_NUMBER)
	 [[Node: Lookup_1/LookupTableFind = LookupTableFindV2[Tin=DT_STRING, Tout=DT_INT64](lookup_1_placeholder, input_1, lookup_1_placeholder_1)]]
	 [[Node: Model/Model/Iterator/Next = IteratorGetNext[output_shapes=[[?,?], [?], [?,?], [?,?], [?]], output_types=[DT_INT32, DT_INT32, DT_INT32, DT_INT32, DT_INT32], _device="/job:localhost/replica:0/task:0/device:CPU:0"](Model/Model/Iterator)]]

It's hard to reproduce this error but a summary of the context is that I have a lookup table op inside a dataset map operator and I get this error when I try to execute the corresponding iterator "GetNext" op. I'm looking for information in how to parse and debug this error. I never explicitly set any containers for my variables or lookup tables (i.e., leave them to the default value; an empty string). Were there any changes introduced recently that could result in this error? Note that this happens with my Scala API but not with the Python API and so it may be that I haven't updated something in my code. I just don't really know where to look for this.

Thanks!

@mrry
Copy link
Contributor

mrry commented Jan 27, 2018

I think this could be a knock-on effect from 9f4118d, which modifies (and in principle simplifies) how DT_RESOURCE tensors are captured inside a Dataset transformation (such as Dataset.map()).

It sounds like there is some disagreement between the code that creates the lookup table and the op that performs the lookup itself. Could you share the code for creating the table and the Dataset.map()?

/cc @rohan100jain

@mrry mrry added the stat:awaiting response Status - Awaiting response from author label Jan 27, 2018
@eaplatanios
Copy link
Contributor Author

@mrry I'd love to share the code but it's written in Scala using my API and it might be hard to get familiar with the codebase. I can do so if you think that can help. However, in order to give you a summary, the lookup table is created right before being used in the dataset map. It is also initialized prior to being used by calling GetNext. What kind of disagreement could cause such an error? Is there an easy way to debug? Also, is localhost the default container used when not providing and container?

@mrry
Copy link
Contributor

mrry commented Jan 29, 2018

Hmm, what you're doing doesn't sound like it's wrong, because that's effectively what the Python version would be doing too. And the recent change should make it less likely to see a NotFound error, because it changes the function implementation to share the same ResourceMgr (which defines the namespace for resource names and containers), whereas before there was an explicit copying-and-rewriting step for captured resource handles.

Could you possibly create a minimal example that exhibits the problem in both Python and Scala, and share the code for these? If you could also capture the text-format GraphDef, it might be possible to inspect the graphs and find the discrepancy.

Also, is localhost the default container used when not providing and container?

That's right. Are you using the container attr in your code? (It should still work, but it's possible there's a bug there....)

@eaplatanios
Copy link
Contributor Author

@mrry I can look into creating a minimal example this week, but I have found some information that might be useful. It has to do with creating functions and capturing control dependencies of these functions. Currently, functions are created in a separate graphs and tensors that are used as inputs are captured and replaced with placeholders to be fed later. What about control dependencies that some of the new ops might have on ops defined outside the function? For example, I have a dataset map op that uses a lookup table, but I want the iterator initializer op for that dataset to have a control dependency on the lookup table initializer op of the outer graph. How should I go about handling that? Currently I get an error of the form Source ... not in the main body, where source refers to the source of the control edge. I think that's what's causing the error because I cannot properly initialize the iterator.

@eaplatanios
Copy link
Contributor Author

@mrry Actually never mind, even if I work around this initialization issue and manage to run all the initializer ops fine, I still get the same error once I get to invoke GetNext. I'll try to create a minimal example for this, but in the meantime, do you have any idea of what could be causing the container to not be found? Shouldn't the localhost container always exist given that it's the default one?

@eaplatanios
Copy link
Contributor Author

Note also that if I put a breakpoint in my Scala code, this error:

W tensorflow/core/framework/op_kernel.cc:1201] OP_REQUIRES failed at lookup_table_op.cc:656 : Not found: Container localhost does not exist. (Could not find resource: localhost/hash_table_/Users/anthony/Development/GitHub/symphony-mt/temp/data/iwslt-15/vocab.vi_WHOLE_LINE_LINE_NUMBER)

keeps being printed over and over again indefinitely as if it keeps being thrown from within some loop running in another thread.

@eaplatanios
Copy link
Contributor Author

@mrry Given that the lookup table initializer op and the iterator initializer op run without any issues, is there any chance the resource manager is cleared/reset afterwards in between the session runs? Is there any tips you might have on how to debug this sort of error?

@eaplatanios
Copy link
Contributor Author

@mrry I'm wondering if this has to do with me using the C API TF_GraphToFunction method. I'm using it to create the function provided to the dataset map. The lookup table which is used by the function is replaced with a placeholder of type RESOURCE. The initialized lookup table resource tensor is then passed as input to the created function op. Could this be causing the problem? It was all working fine with version 1.5.0-rc1.

@eaplatanios
Copy link
Contributor Author

@mrry I finally resolved this. I was accidentally setting the shared name of the iterator I was creating. I'm not sure why this caused the problem, but without setting the shared name, everything works fine. Do you know why that could have happened? I'm actually curious.

@mrry
Copy link
Contributor

mrry commented Feb 1, 2018

Thanks for tracking that down! I think this is a legitimate bug, introduced in 9f4118d. That change modifies most iterators to use the same Device, FunctionLibraryRuntime, and ResourceMgr as the op that created them, which enables the resource-capturing logic to be simplified, because handles are valid in both the caller and the callee.

However, when the shared_name is set, the iterator might outlive the FunctionLibraryRuntime used by the op that created it. The new code identifies this case and creates its own private FunctionLibrary, Device, and ResourceMgr that will outlive the caller. Unfortunately, this means that the handles from the caller and no longer valid in the callee, which manifests as a NotFoundError.

Thanks for the promising lead on this bug! We'll try to get a fix in soon.

@mrry mrry added stat:awaiting tensorflower Status - Awaiting response from tensorflower type:bug Bug and removed stat:awaiting response Status - Awaiting response from author labels Feb 1, 2018
@mrry mrry self-assigned this Feb 1, 2018
@eaplatanios
Copy link
Contributor Author

@mrry Thanks for tracking this down and fixing it! :)

@mrry
Copy link
Contributor

mrry commented Feb 5, 2018

Thanks for digging into it and making it much easier to fix!

@Koushik667
Copy link

Hi @mrry @eaplatanios can you please take a look if the issue #50801 is related to this in any way?
Thanks in advance :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stat:awaiting tensorflower Status - Awaiting response from tensorflower type:bug Bug
Projects
None yet
Development

No branches or pull requests

3 participants