Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ES Universe package on DC/OS Packet does not run #565

Open
olafmol opened this issue May 16, 2016 · 19 comments
Open

ES Universe package on DC/OS Packet does not run #565

olafmol opened this issue May 16, 2016 · 19 comments

Comments

@olafmol
Copy link

olafmol commented May 16, 2016

It keeps on deploying, waiting, failing on DC/OS 1.7.x on Packet. It seems to be unable to bind to expected ports.

@philwinder
Copy link
Contributor

Please provide steps and configuration to recreate. We don't use Packet and don't test on DC/OS. Just plain old Mesos.

@olafmol
Copy link
Author

olafmol commented May 16, 2016

Using this Terraform script: https://dcos.io/docs/1.7/administration/installing/cloud/packet/
After a successful install go to "Universe" in DC/OS dashboard, and install ES package.
Same issue appears when using this Marathon installation instruction: http://mesos-elasticsearch.readthedocs.io/en/latest/#getting-started

(BTW, it seems to work correctly when installing DC/OS on Google Cloud, so it might be a specific Packet thing).

@philwinder
Copy link
Contributor

Ok, thanks. I can't vouch for the DC/OS installer, as that hasn't been updated for a long time. But the marathon command should work.

When you say expected ports, how are you specifying them? By default, ES lets mesos pick random ports from its pool. You can override this using the elasticsearchPorts option.

@olafmol
Copy link
Author

olafmol commented May 16, 2016

I don't specify a specific port.

@zsmithnyc
Copy link

The issue seems to be that Elastic's java can't get the local address:

java.net.UnknownHostException: zac-dcos-agent-03: zac-dcos-agent-03: unknown error atjava.net.InetAddress.getLocalHost(InetAddress.java:1505)`

@zsmithnyc
Copy link

@philwinder how does this container attempt to get its address? Is it using a meta data service?

@jstabenow
Copy link

in my case the problem is a static configured --default.network.publish_host=_non_loopback:ipv4_. I have tested this with DC/OS on Docker and the executor always takes the IPv4 of the spartan interface.
A solution can be --default.network.publish_host=$(hostname -i). Maybe it's possible to implement a parameter for this setting e.g. --executorNetworkPublishHost=_non_loopback:ipv4_

@jstabenow
Copy link

@zsmith928
I also had trouble with this package on DCOS-Docker and have tried to find a solution. Would be nice if you can verify if it also runs on your system.

Just do:

dcos package repo add universe-jstabenow https://github.com/jstabenow/dcos-packages/archive/version-2.x.zip
dcos package install elasticsearch

Here is my workaround for the wrong "publish_host" on executor:
https://github.com/jstabenow/docker-images/tree/master/dcos-elasticsearch

I replace only the argument of the framework by --default.network.publish_host=$LIBPROCESS_IP

@jstabenow
Copy link

update: #569

@jbirch
Copy link

jbirch commented May 28, 2016

Unfortunately, taking @jstabenow's helpful repo for a spin doesn't seem to help us. We're seeing the same thing --- in which Java complains about not knowing what the AWS-supplied hostname ip-10-1-23-254 is, and then failing to bind to local host.

@jstabenow
Copy link

jstabenow commented May 29, 2016

hey @jbirch

that sounds after a similar problem.

On Google there are many articles about network problems with Elasticsearch / Java. That's why I added the publish_host as a parameter.

In my case has Elasticsearch elected the wrong interface for the publish_host.
In your case it is a problem with the resolution of the elected interface. = Let's play with this parameter.

Can you post the executor log and the results if you execute the following commands on your machine "ip-10-1-23-254"?:

$ docker run -it --net=host elasticsearch:latest --default.network.publish_host="10.1.23.254"
$ docker run -it --net=host elasticsearch:latest --default.network.publish_host="ip-10-1-23-254"
$ docker run -it --net=host elasticsearch:latest --default.network.publish_host=$(hostname -i)

That would be the right log:

[2016-05-29 12:19:40,859][INFO ][transport                ] [Storm] publish_address {10.1.23.254:9300}, bound_addresses {[::]:9300}
[2016-05-29 12:19:44,093][INFO ][cluster.service          ] [Storm] new_master {Storm}{N93bF9aPT1SEaqsHGsF6Eg}{10.1.23.254}{10.1.23.254:9300}, reason: zen-disco-join(elected_as_master, [0] joins received)
[2016-05-29 12:19:44,210][INFO ][http                     ] [Storm] publish_address {10.1.23.254:9200}, bound_addresses {[::]:9200}

And can you post the available ENV of a running Docker container of your DC/OS cluster?

Hope we can find the problem and the right setting for you.

@jbirch
Copy link

jbirch commented May 29, 2016

Hey @jstabenow, thanks for taking the time to reply on the weekend to a stranger. I appreciate it.

With respect to your commands:

"10.1.23.19: Comes up and binds to the given IP.
"ip-10-1-23-19: Fails to resolve ip-10-1-23-19, and then fails to start
$(hostname -i):

ERROR: Parameter [fe80::42:f5ff:feb0:2cb1%docker0]does not start with --

"$(hostname -i)":

java.net.UnknownHostException: no such interface eth0 fe80::42:f5ff:feb0:2cb1%docker0 fe80::707a:26ff:feb3:dbb1%spartan fe80::8045:21ff:fe59:a821%veth6d620c6 fe80::a8f5:c6ff:fee7:3af3%veth8af0e4f 10.1.23.19 172.17.0.1 198.51.100.1 198.51.100.2 198.51.100.3

"$LIPPROCESS_IP": Starts up and binds to 198.51.100.1.

The issue here is that I've got no issues starting elasticsearch:latest in DC/OS. It'll bind to 198.51.100.1 and start, much the same as if I didn't provide the --default.network.publish_host argument. My hope was that your package would help with mesos/elasticsearch-scheduler having a bad time.

Regarding an existing env, here's the output of docker inspect --format '{{ .Config.Env }}' 7ef131bf3c5a | tr ' ' '\n' on the Universe-provided weavescope-probe container:

[MARATHON_APP_LABEL_DCOS_PACKAGE_SOURCE=https://universe.mesosphere.com/repo
MARATHON_APP_VERSION=2016-05-24T19:28:35.443Z
HOST=10.1.23.19
MARATHON_APP_RESOURCE_CPUS=0.05
MARATHON_APP_LABEL_DCOS_PACKAGE_REGISTRY_VERSION=2.0
PORT_10102=18179
MARATHON_APP_LABEL_DCOS_PACKAGE_RELEASE=1
MARATHON_APP_DOCKER_IMAGE=weaveworks/scope:0.15.0
MARATHON_APP_LABEL_DCOS_PACKAGE_NAME=weavescope-probe
MARATHON_APP_LABEL_DCOS_PACKAGE_VERSION=0.15.0
MESOS_TASK_ID=weavescope-probe.f4fc1fba-21e5-11e6-b902-e6205eb290e4
PORT=18179
MARATHON_APP_RESOURCE_MEM=256.0
PORTS=18179
MARATHON_APP_LABEL_DCOS_PACKAGE_IS_FRAMEWORK=true
MARATHON_APP_RESOURCE_DISK=0.0
MARATHON_APP_LABELS=DCOS_PACKAGE_RELEASE
DCOS_PACKAGE_SOURCE
DCOS_PACKAGE_REGISTRY_VERSION
DCOS_PACKAGE_VERSION
DCOS_PACKAGE_NAME
DCOS_PACKAGE_IS_FRAMEWORK
MARATHON_APP_ID=/weavescope-probe
PORT0=18179
LIBPROCESS_IP=10.1.23.19

@jstabenow
Copy link

Hey @jbirch
no problem :-) Please try ${ENV} instead of $ENV

$ docker run -it --net=host elasticsearch:latest --default.network.publish_host=${LIBPROCESS_IP}
$ docker run -it --net=host elasticsearch:latest --default.network.publish_host=${HOST}

This two ENV should work:

HOST=10.1.23.19
LIBPROCESS_IP=10.1.23.19

@jstabenow
Copy link

Ah sorry ... this can't work because it's not created by Mesos = No ENV ;-)
Please try my ES-Package again and replace ${LIBPROCESS_IP} with ${HOST}.
But that was supposed to be the same. Strange....

bildschirmfoto 2016-05-30 um 00 14 06

@philwinder
Copy link
Contributor

Hi all. Thanks @jstabenow for continuing to help out on this. To answer a previous question:

  • The executors are elasticsearch. So the executors obtain their IP address according to the elasticsearch code. AFAIK, it's a typical java InetAddress call, which get's the first available adapter.

Remember that you can pass your own settings file and that the ES containers can be overridden. So I would oppose any core code changes that could otherwise be achieved by this.

@jstabenow
Copy link

Hey @philwinder
No problem. I will close my PR.

@jbirch
Copy link

jbirch commented May 30, 2016

Hi @philwinder,

We still have the case where mesos/elasticsearch-scheduler, when either installed via Universe or via the instructions at https://mesos-elasticsearch.readthedocs.io/en/latest/#how-to-install-on-marathon, fails to work 'out-of-the-box' where mesos/elasticsearch does work. It looks like this case might be limited to just the default resolver settings when you bring up the world in AWS, but I think (apropos of no hard data) that it'd be a common configuration.

Note that the thing that fails to do the binding is https://github.com/mesos/elasticsearch/blob/1.0.1/commons/src/main/java/org/apache/mesos/elasticsearch/common/util/NetworkUtils.java:30, not Elasticsearch itself.

Noting that there's myriad deployment options of the underlying platform on which mesos/elasticsearch-scheduler can run, I don't want to ask anyone to be in the business of making specific changes to support one particular option where it works generally.

Caveat here being that maybe it's actually totally fine and my environment is just screwed up :)

@philwinder
Copy link
Contributor

@jbirch I did all my manual testing on AWS, so I'm surprised there's a problem here. But I used vanilla Mesos, not DCOS, so I assume it's some difference there.

Can you post the log that is showing the error? That might help decide what to do.

Thanks, Phil

@jbirch
Copy link

jbirch commented May 31, 2016

I'm almost certain it's an issue on our end, and isn't indicative of the package itself generally "not working".

I would expect something like dig -tANY $(hostname) @169.254.169.253 +short to work out-of-the-box on any AWS instance with DNS enabled in the VPC. In our case, it doesn't, and I think that's why we eventually fail to run mesos/elasticsearch-scheduler (I'd suspect the default resolver of 198.51.100.1 eventually chains up to it).

Tentatively let's call this one a layer 8 problem and I'll try and get things shored up on our end. It really does look more like "DNS isn't 100%" rather than "mesos/elasticsearch-schedular has a bug".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants