ES Universe package on DC/OS Packet does not run #565

olafmol · 2016-05-16T11:59:32Z

It keeps on deploying, waiting, failing on DC/OS 1.7.x on Packet. It seems to be unable to bind to expected ports.

philwinder · 2016-05-16T12:20:01Z

Please provide steps and configuration to recreate. We don't use Packet and don't test on DC/OS. Just plain old Mesos.

olafmol · 2016-05-16T12:24:57Z

Using this Terraform script: https://dcos.io/docs/1.7/administration/installing/cloud/packet/
After a successful install go to "Universe" in DC/OS dashboard, and install ES package.
Same issue appears when using this Marathon installation instruction: http://mesos-elasticsearch.readthedocs.io/en/latest/#getting-started

(BTW, it seems to work correctly when installing DC/OS on Google Cloud, so it might be a specific Packet thing).

philwinder · 2016-05-16T12:30:47Z

Ok, thanks. I can't vouch for the DC/OS installer, as that hasn't been updated for a long time. But the marathon command should work.

When you say expected ports, how are you specifying them? By default, ES lets mesos pick random ports from its pool. You can override this using the elasticsearchPorts option.

olafmol · 2016-05-16T12:33:27Z

I don't specify a specific port.

zsmithnyc · 2016-05-16T16:39:41Z

The issue seems to be that Elastic's java can't get the local address:

java.net.UnknownHostException: zac-dcos-agent-03: zac-dcos-agent-03: unknown error atjava.net.InetAddress.getLocalHost(InetAddress.java:1505)`

zsmithnyc · 2016-05-16T16:52:22Z

@philwinder how does this container attempt to get its address? Is it using a meta data service?

jstabenow · 2016-05-24T19:45:23Z

in my case the problem is a static configured --default.network.publish_host=_non_loopback:ipv4_. I have tested this with DC/OS on Docker and the executor always takes the IPv4 of the spartan interface.
A solution can be --default.network.publish_host=$(hostname -i). Maybe it's possible to implement a parameter for this setting e.g. --executorNetworkPublishHost=_non_loopback:ipv4_

jstabenow · 2016-05-26T08:09:20Z

@zsmith928
I also had trouble with this package on DCOS-Docker and have tried to find a solution. Would be nice if you can verify if it also runs on your system.

Just do:

dcos package repo add universe-jstabenow https://github.com/jstabenow/dcos-packages/archive/version-2.x.zip
dcos package install elasticsearch

Here is my workaround for the wrong "publish_host" on executor:
https://github.com/jstabenow/docker-images/tree/master/dcos-elasticsearch

I replace only the argument of the framework by --default.network.publish_host=$LIBPROCESS_IP

jstabenow · 2016-05-27T13:08:09Z

update: #569

jbirch · 2016-05-28T03:29:42Z

Unfortunately, taking @jstabenow's helpful repo for a spin doesn't seem to help us. We're seeing the same thing --- in which Java complains about not knowing what the AWS-supplied hostname ip-10-1-23-254 is, and then failing to bind to local host.

jstabenow · 2016-05-29T12:28:28Z

hey @jbirch

that sounds after a similar problem.

On Google there are many articles about network problems with Elasticsearch / Java. That's why I added the publish_host as a parameter.

In my case has Elasticsearch elected the wrong interface for the publish_host.
In your case it is a problem with the resolution of the elected interface. = Let's play with this parameter.

Can you post the executor log and the results if you execute the following commands on your machine "ip-10-1-23-254"?:

$ docker run -it --net=host elasticsearch:latest --default.network.publish_host="10.1.23.254"
$ docker run -it --net=host elasticsearch:latest --default.network.publish_host="ip-10-1-23-254"
$ docker run -it --net=host elasticsearch:latest --default.network.publish_host=$(hostname -i)

That would be the right log:

[2016-05-29 12:19:40,859][INFO ][transport                ] [Storm] publish_address {10.1.23.254:9300}, bound_addresses {[::]:9300}
[2016-05-29 12:19:44,093][INFO ][cluster.service          ] [Storm] new_master {Storm}{N93bF9aPT1SEaqsHGsF6Eg}{10.1.23.254}{10.1.23.254:9300}, reason: zen-disco-join(elected_as_master, [0] joins received)
[2016-05-29 12:19:44,210][INFO ][http                     ] [Storm] publish_address {10.1.23.254:9200}, bound_addresses {[::]:9200}

And can you post the available ENV of a running Docker container of your DC/OS cluster?

Hope we can find the problem and the right setting for you.

jbirch · 2016-05-29T21:41:28Z

Hey @jstabenow, thanks for taking the time to reply on the weekend to a stranger. I appreciate it.

With respect to your commands:

"10.1.23.19: Comes up and binds to the given IP.
"ip-10-1-23-19: Fails to resolve ip-10-1-23-19, and then fails to start
$(hostname -i):

ERROR: Parameter [fe80::42:f5ff:feb0:2cb1%docker0]does not start with --

"$(hostname -i)":

java.net.UnknownHostException: no such interface eth0 fe80::42:f5ff:feb0:2cb1%docker0 fe80::707a:26ff:feb3:dbb1%spartan fe80::8045:21ff:fe59:a821%veth6d620c6 fe80::a8f5:c6ff:fee7:3af3%veth8af0e4f 10.1.23.19 172.17.0.1 198.51.100.1 198.51.100.2 198.51.100.3

"$LIPPROCESS_IP": Starts up and binds to 198.51.100.1.

The issue here is that I've got no issues starting elasticsearch:latest in DC/OS. It'll bind to 198.51.100.1 and start, much the same as if I didn't provide the --default.network.publish_host argument. My hope was that your package would help with mesos/elasticsearch-scheduler having a bad time.

Regarding an existing env, here's the output of docker inspect --format '{{ .Config.Env }}' 7ef131bf3c5a | tr ' ' '\n' on the Universe-provided weavescope-probe container:

[MARATHON_APP_LABEL_DCOS_PACKAGE_SOURCE=https://universe.mesosphere.com/repo
MARATHON_APP_VERSION=2016-05-24T19:28:35.443Z
HOST=10.1.23.19
MARATHON_APP_RESOURCE_CPUS=0.05
MARATHON_APP_LABEL_DCOS_PACKAGE_REGISTRY_VERSION=2.0
PORT_10102=18179
MARATHON_APP_LABEL_DCOS_PACKAGE_RELEASE=1
MARATHON_APP_DOCKER_IMAGE=weaveworks/scope:0.15.0
MARATHON_APP_LABEL_DCOS_PACKAGE_NAME=weavescope-probe
MARATHON_APP_LABEL_DCOS_PACKAGE_VERSION=0.15.0
MESOS_TASK_ID=weavescope-probe.f4fc1fba-21e5-11e6-b902-e6205eb290e4
PORT=18179
MARATHON_APP_RESOURCE_MEM=256.0
PORTS=18179
MARATHON_APP_LABEL_DCOS_PACKAGE_IS_FRAMEWORK=true
MARATHON_APP_RESOURCE_DISK=0.0
MARATHON_APP_LABELS=DCOS_PACKAGE_RELEASE
DCOS_PACKAGE_SOURCE
DCOS_PACKAGE_REGISTRY_VERSION
DCOS_PACKAGE_VERSION
DCOS_PACKAGE_NAME
DCOS_PACKAGE_IS_FRAMEWORK
MARATHON_APP_ID=/weavescope-probe
PORT0=18179
LIBPROCESS_IP=10.1.23.19

jstabenow · 2016-05-29T21:53:59Z

Hey @jbirch
no problem :-) Please try ${ENV} instead of $ENV

$ docker run -it --net=host elasticsearch:latest --default.network.publish_host=${LIBPROCESS_IP}
$ docker run -it --net=host elasticsearch:latest --default.network.publish_host=${HOST}

This two ENV should work:

HOST=10.1.23.19
LIBPROCESS_IP=10.1.23.19

jstabenow · 2016-05-29T22:17:55Z

Ah sorry ... this can't work because it's not created by Mesos = No ENV ;-)
Please try my ES-Package again and replace ${LIBPROCESS_IP} with ${HOST}.
But that was supposed to be the same. Strange....

philwinder · 2016-05-30T07:18:46Z

Hi all. Thanks @jstabenow for continuing to help out on this. To answer a previous question:

The executors are elasticsearch. So the executors obtain their IP address according to the elasticsearch code. AFAIK, it's a typical java InetAddress call, which get's the first available adapter.

Remember that you can pass your own settings file and that the ES containers can be overridden. So I would oppose any core code changes that could otherwise be achieved by this.

jstabenow · 2016-05-30T14:51:52Z

Hey @philwinder
No problem. I will close my PR.

jbirch · 2016-05-30T23:50:00Z

Hi @philwinder,

We still have the case where mesos/elasticsearch-scheduler, when either installed via Universe or via the instructions at https://mesos-elasticsearch.readthedocs.io/en/latest/#how-to-install-on-marathon, fails to work 'out-of-the-box' where mesos/elasticsearch does work. It looks like this case might be limited to just the default resolver settings when you bring up the world in AWS, but I think (apropos of no hard data) that it'd be a common configuration.

Note that the thing that fails to do the binding is https://github.com/mesos/elasticsearch/blob/1.0.1/commons/src/main/java/org/apache/mesos/elasticsearch/common/util/NetworkUtils.java:30, not Elasticsearch itself.

Noting that there's myriad deployment options of the underlying platform on which mesos/elasticsearch-scheduler can run, I don't want to ask anyone to be in the business of making specific changes to support one particular option where it works generally.

Caveat here being that maybe it's actually totally fine and my environment is just screwed up :)

philwinder · 2016-05-31T15:54:59Z

@jbirch I did all my manual testing on AWS, so I'm surprised there's a problem here. But I used vanilla Mesos, not DCOS, so I assume it's some difference there.

Can you post the log that is showing the error? That might help decide what to do.

Thanks, Phil

jbirch · 2016-05-31T16:09:26Z

I'm almost certain it's an issue on our end, and isn't indicative of the package itself generally "not working".

I would expect something like dig -tANY $(hostname) @169.254.169.253 +short to work out-of-the-box on any AWS instance with DNS enabled in the VPC. In our case, it doesn't, and I think that's why we eventually fail to run mesos/elasticsearch-scheduler (I'd suspect the default resolver of 198.51.100.1 eventually chains up to it).

Tentatively let's call this one a layer 8 problem and I'll try and get things shored up on our end. It really does look more like "DNS isn't 100%" rather than "mesos/elasticsearch-schedular has a bug".

pereztr5 mentioned this issue Jul 7, 2016

Have CoreOS make a hosts file by default mesosphere-backup/packet-terraform#5

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ES Universe package on DC/OS Packet does not run #565

ES Universe package on DC/OS Packet does not run #565

olafmol commented May 16, 2016

philwinder commented May 16, 2016

olafmol commented May 16, 2016 •

edited

philwinder commented May 16, 2016

olafmol commented May 16, 2016

zsmithnyc commented May 16, 2016

zsmithnyc commented May 16, 2016

jstabenow commented May 24, 2016

jstabenow commented May 26, 2016

jstabenow commented May 27, 2016

jbirch commented May 28, 2016

jstabenow commented May 29, 2016 •

edited

jbirch commented May 29, 2016

jstabenow commented May 29, 2016

jstabenow commented May 29, 2016

philwinder commented May 30, 2016

jstabenow commented May 30, 2016

jbirch commented May 30, 2016 •

edited

philwinder commented May 31, 2016

jbirch commented May 31, 2016 •

edited

ES Universe package on DC/OS Packet does not run #565

ES Universe package on DC/OS Packet does not run #565

Comments

olafmol commented May 16, 2016

philwinder commented May 16, 2016

olafmol commented May 16, 2016 • edited

philwinder commented May 16, 2016

olafmol commented May 16, 2016

zsmithnyc commented May 16, 2016

zsmithnyc commented May 16, 2016

jstabenow commented May 24, 2016

jstabenow commented May 26, 2016

jstabenow commented May 27, 2016

jbirch commented May 28, 2016

jstabenow commented May 29, 2016 • edited

jbirch commented May 29, 2016

jstabenow commented May 29, 2016

jstabenow commented May 29, 2016

philwinder commented May 30, 2016

jstabenow commented May 30, 2016

jbirch commented May 30, 2016 • edited

philwinder commented May 31, 2016

jbirch commented May 31, 2016 • edited

olafmol commented May 16, 2016 •

edited

jstabenow commented May 29, 2016 •

edited

jbirch commented May 30, 2016 •

edited

jbirch commented May 31, 2016 •

edited