Skip to content

Releases: mesosphere/marathon

v1.8.222: Remove strict validation of external volume name (#7024)

10 Sep 23:07
86475dd
Compare
Choose a tag to compare

Changes from 1.8.218 to 1.8.212

External Volume Validation changes

Relaxed name validation

As there are some external volume providers which require options in the volume name, the strict validation of the name on the external volume is now removed.

As the uniqueness check is based on the volume name, this may lead to some inconsistencies, for the sake of uniqueness, the following volumes are distinct:

"volumes": [
      {
        "external": {
          "name": "name=volumename,option1=value",
        },
      }
    ],
"volumes": [
      {
        "external": {
          "name": "option1=value,name=volumename",
        },
      }
    ],

Optional uniqueness check

Previously, Marathon would validate that an external volume with the same name is only used once across all apps. This was due to the initial implementation being focused on Rexray+EBS. However, multiple external volume providers now
allow shared access to mounted volumes, so we introduced a way to disable the uniqueness check:

A new field, container.volumes[n].external.shared which defaults to false. If set to true, the same volume name can be used
by multiple containers. The shared flag has to be set to true on all external volumes with the same name, otherwise a conflict is reported on the volume without the shared=true flag.

  "container": {
    "type": "MESOS",
    "volumes": [
      {
        "external": {
          "size": 5,
          "name": "volumename",
          "provider": "dvdi",
          "shared": "true",
          "options": {
            "dvdi/driver": "pxd",
            "dvdi/shared": "true"
          }
        },
        "mode": "RW",
        "containerPath": "/mnt/nginx"
      }
    ],
  }

v1.8.218

25 Jul 15:56
Compare
Choose a tag to compare

Changes from 1.8.194 to 1.8.218

Revive and Suppress Refactoring

The revive and suppress logic was unified. In the past Marathon would keep reviving when
an instance with a reservation was expunged (case 1) or it would revive when instance should be started (case 2). When
no instance should be started Marathon would suppress offers which could conflict with case 1. With the refactoring
only one logic decides whether to revive or suppress and thus avoids the conflict. The change also required changing
the default --min_revive_offers_interval to thirty seconds. This should avoid overriding revive calls with a suppress
too quickly. The --[disable]_suppress_offers flag can switch off suppress calls all together. This should be used
when Marathon fails to clean up reservation which requires offers being sent.

Fixed issues

  • DCOS-54927 - Fixed an issue where two independent deployments could interfere with each other resulting in too many tasks launched and/or possibly a stuck deployment.

Changes from 1.8.180 to 1.8.194

Fixed issues

  • DCOS_OSS-5212 - Fixed an issue that prevented reserved instances created by older Marathon versions from being restarted

  • MARATHON-8623 - Fixed an issue that could cause /v2/deployments to become stale

  • MARATHON-8624 - Fixed issue where the presence of a TASK_UNKNOWN status could cause an API failure

  • DCOS-51375 - Fixed an issue where deployment cancellation could leak instances.

  • DCOS_OSS-5211 - The initial support for volume profiles would match disk resources with a profile, even if no profile was required. This behavior has been adjusted so that disk resources with profiles are only used when those profiles are required, and are not used if the service for which we are matching offers does not require a disk with that profile.

  • MARATHON-8631 - In order to prepare for the general availability of the DC/OS Storage Service (DSS), Marathon will now default to disk type Mount, if a persistent volume profileName is configured by the user without specifying the wanted disk type. Services like DSS will populate this field to allow users selecting the volumes they previously created. Mesos Root disks will not have a profileName set, so the default for persistent volumes that do not specify a profileName is still Root.

  • MARATHON-8422 - Kill unreachable tasks that came back. Marathon could get stuck waiting for terminal events but not issue a kill.

v1.5.15

10 Apr 11:18
8a92920
Compare
Choose a tag to compare

Introduce global throttling to Marathon health checks

Marathon health checks is a deprecated feature and customers are strongly recommended to switch to Mesos health checks for scalability reasons. However, we've seen a number of issues when an excessive number of Marathon health checks (HTTP and TCP) would overload parts of Marathon. Hence we introduced a new parameter --max_concurrent_marathon_health_checks that defines a maximum number (256 by default) of Marathon health checks (HTTP/S and TCP) that can be executed concurrently in the given moment. Note that setting a big value here and using many services with Marathon health checks will overload Marathon leading to internal timeouts and unstable behavior.

Fixed Issues

  • MARATHON-8596 Introduced global throttling to Marathon health checks
  • MARATHON-8575 Fixed a broken migration for app definitions with port mappings protocol "tcp,udp" which is no longer valid and should be "udp,tcp"
  • MARATHON-8566 Fixed a rare bug where deployment was sometimes not immediately visible through the v2/deployments endpoint after creation

Note: Previous 1.5.14 release introduced a regression where an unhealthy instance would not be killed. This will not happen anymore (promise) and we do not recommend using 1.5.14 release if you use health checks.

v1.5.13

10 Apr 11:16
07a35d6
Compare
Choose a tag to compare

Marathon 1.4 Compatible /v2/tasks

Marathon 1.5 had a major overhaul of networking which resulted in an unintended change to the porting of ports in /v2/tasks. In text/plain form, this information is used to configure load-balancers and routers. When the container is in host mode the reported port is the running host port. When the container is in bridge mode, the reported port is the dynamically created host port that will bridge to the internal container port. For Marathon 1.4, in USER mode, it reported the container port. Regardless of correctness, this feature was used by some customers and needs to be forward ported. Marathon 1.5.13 now provides that ability by using the compatibilityMode query parameter to the /v2/tasks end point. If compatibilityMode is not specified the 1.5 version is rendered. If /v2/tasks?compatibilityMode=1.4 is used it will provide the previous Marathon 1.4 rendering.

Apps names restrictions (breaking change)

From now on, apps which use ids which ends with "restart", "tasks", "versions" won't be valid anymore. Such apps already had broken behavior (for example it wasn't possible to use a GET /v2/apps endpoint with them), so we made that constraint more explicit. Existing apps with such names will continue working, however, all operations on them (except deletion) will result in an error. Please take care of renaming them before upgrading Marathon.

Fixed Issues

  • MARATHON-8493 Fixed precision bug associated with summing pod resources.
  • MARATHON-8498 Fixed secrets validator when changing secret env.
  • MARATHON-8466 Prohibit the use of reserved words in app and pod ids
  • COPS-4483 Provide a backward compatible way to produce container ports for text/plain GET requests against /v2/tasks when using USER networking consistent with Marathon 1.4.
  • MARATHON-8566 - We fixed a race condition causing v2/deployments not containing a confirmed deployment after HTTP 200/201 response was returned.

v1.7.189

30 Jan 21:15
48bfd60
Compare
Choose a tag to compare

Changes to 1.7.189

Marathon Supports Java 9+

Marathon build tools and dependencies have been adjusted to allow it to be compiled and run with Java 9 and we regularly build and test with Java 11. We currently still target Java 8 binary compatibility.

Fixed Issues

  • MARATHON-8539 - Marathon now responds back with Mesos source field exactly as it is received fixing an issue where the vendor information was lost as part of vendor information.
  • MARATHON-8466 - Marathon restricts the use of "reserved" words as Ids. The following is a list of restricted words: "restart", "tasks", "versions"
  • MARATHON-8498 - Marathon now validates the full app when partial updates are applied.
  • MARATHON-8493 - Fix to a precision bug associated with sum resource needs for Pods.
  • MARATHON-8453 - Marathon now respects the --kill_retry_timeout timeout.
  • MARATHON-8452 - Reduced logging; Marathon only logs zero-value offers if a scalar value is set.
  • MARATHON-8413 - Fix for broken versioning of Apps and Pods associated with changes in Java Timestamps.

v1.7.174

28 Nov 22:08
Compare
Choose a tag to compare

Changes to 1.7.174

Marathon framework ID generation is now very conservative

Previously, Marathon would automatically request a new framework ID from Mesos if the old one was marked as torn down in Mesos, or if the framework ID record was removed from Zookeeper. This has led to more trouble than it has helped. The new behavior is:

  • If Marathon's framework ID has been torn down in Mesos, or if the failover timeout has been exceeded, Marathon will crash, on launch, with a clear message.

  • If Marathon's framework ID record was deleted from Zookeeper or is otherwise inaccessible, and there are instances defined, Marathon will refuse to create a new Framework ID and crash.

For more information, refer to the framework id docs page.

Minimum Mesos version requirement has been increased to 1.5.0

In previous Marathon versions, we monitored offers as a surrogate terminal task status signal for resident tasks in order to work around a Mesos issue in which we would not receive terminal task status updates for agents that restarted. As of Mesos 1.4.0, this is been resolved, and we have removed this workaround.

There are still some edge cases where Mesos agent metadata is wiped (manually, by an operator) in a way that the agent ID will change, but reservations will be preserved. In these cases, Mesos will report a resident tasks as perpetually unreachable. Operators should use the MARK_AGENT_GONE call in such cases to get Mesos to mark the associated resident tasks as terminal, and therefore signal to Marathon that it should try to relaunch the resident task. This call was introduced in Mesos 1.5.0.

Native Packages

We have stopped publishing native packages for operating system versions that are past their end-of-life:

  • Ubuntu Yakkety
  • Ubuntu Wily
  • Ubuntu Vivid

Additionally, we have added support for Debian Stretch.

Docker image now allows user nobody; default user has been changed

Previously, the Marathon Docker container would only run as user root. The packaging has been updated so that the container is now run, by default, as the user nobody.

NOTE This is a breaking change! If you did not specify MARATHON_MESOS_USER before, and did not specify the container user of nobody when launching Marathon in a container before, then add the environment value MARATHON_MESOS_USER=root to the containerized Marathon.

Non-leader/standby Marathon instances respond to /v2/events with a redirect, rather than proxy

Previously, Marathon standby instances would proxy the event stream. This causes an unnecessary increase in event stream drops, as the connection will terminate if either the master or the standby restarts. Further, there have been occasional buffering issues.

Now, when a standby Marathon instance is asked for /v2/events, it responds with a 302, with a redirect response directing the client to /v2/events resource for the current leader. Clients that consume the event stream should be updated to follow redirect responses.

Event-proxying has the following deprecation schedule:

  • 1.7.x - Standby Marathon instances return redirect responses. The old behavior of proxying event streams can be brought back with the command-line argument --deprecated_features=proxy_events.
  • 1.8.x - Event stream proxying logic will be completely removed. If --deprecated_features=proxy_events is still specified, Marathon will refuse to launch, with an error.

Default for "max-open-connections" increased for asynchronous standby proxy, now configurable

In some clusters with heavy standby-proxy usage, a limit of 32 max-open-connections was too small. This default has been increased to 64. In addition, the flag --leader_proxy_max_open_connections has been introduced to tune the value further, if needed.

Maintenance Mode Support Production Ready, Now Default

Marathon now declines offers for agents with scheduled maintenance.

Previously, this behavior was enabled by --enable_features maintenance_mode. Operators should remove maintenance_mode from the --enable_features value list, as it now has no effect. In Marathon 1.8.x, including the term maintenance_mode in the --enable_features list will be considered an error.

The flag --disable_maintenance_mode has been introduced. To revert back to the default maintenance mode behavior in Marathon 1.6.x and earlier (ignore), operators can specify --disable_maintenance_mode.

Fixed Issues

  • MARATHON-8409 - You can now launch marathon in Docker as non-root user.
  • MARATHON-8017 - Fixed various issues when posting groups with relative ids.
  • MARATHON-7568 - We now redact any Zookeeper credentials from the /v2/info response endpoint.
  • MARATHON-8326 - Pods can be deleted together with persistent volumes, using a new wipe=true query parameter.
  • Updated version of Marathon UI to 1.3.1:
    • MARATHON-8255 - Marathon UI properly shows fetch URLs in the edit dialog, now.

New Exit Codes

Marathon will indicate with an exit code why it stopped itself. See the docs page for a list of all codes and their meanings.

Marathon 1.5.12

02 Nov 21:27
Compare
Choose a tag to compare

Changes from 1.5.11 to 1.5.12

Default for "kill_retry_timeout" was increased to 30 seconds

Sending frequent kill requests to an agent can in certain cases lead to overloading the Docker daemon (if the tasks are docker containers run by the Docker containerizer). Thirty seconds seems to be a more sensible default here.

Marathon framework ID generation is now very conservative

Previously, Marathon would automatically request a new framework ID from Mesos if the old one was marked as torn down in Mesos, or if the framework ID record was removed from Zookeeper. This has led to more trouble than it has helped. The new behavior is:

  • If Marathon's framework ID has been torn down in Mesos, or if the failover timeout has been exceeded, Marathon will crash, on launch, with a clear message.

  • If Marathon's framework ID record was deleted from Zookeeper or is otherwise inaccessible, and there are instances defined, Marathon will refuse to create a new Framework ID and crash.

For more information, refer to the framework id docs page.

Docker image now allows user nobody

Previously, the Marathon Docker container would only run as user root. The packaging has been updated so that the container can be run as the user nobody. The default user for running the container (and, subsequently, the default value for --mesos_user) has not been changed.

Docker image upgraded to Debian Stretch

The Docker image for Marathon now uses Debian Stretch as a base OS, since Debian Jessie is no longer receiving security updates.

Native Packages

We have stopped publishing native packages for operating system versions that are past their end-of-life:

  • Ubuntu Yakkety
  • Ubuntu Wily
  • Ubuntu Vivid

Additionally, we have added support for Debian Stretch.

Fixed Issues

  • MARATHON-7568 We now redact any Zookeeper credentials from the /v2/info response endpoint.
  • MARATHON-8413 Fixed a bug where versions feature did not work if Marathon was launched using Java 9.
  • MARATHON-8095 Fixed a bug where proxying the PATCH call was impossible due to Java limitations.
  • MARATHON-8430 Readiness checks now work with self-signed certificates.
  • Updated version of Marathon UI to 1.3.1:
    • MARATHON-8255 Marathon UI properly shows fetch URLs in the edit dialog, now.
  • MARATHON-7941 Default for unreachable strategy on PUT /apps now matches POST requests.
  • MARATHON-8084 Fix issue in which POST /v2/apps/{app_id}/restart would not proxy properly.
  • MARATHON-7390 Fix issue in which Marathon would become unresponsive for a long time if Zookeeper DNS cannot be resolved at launch.
  • Fixed a data migration issue in which UNIQUE constraint value was stripped when empty.

v1.6.549

20 Oct 00:33
aabf743
Compare
Choose a tag to compare

Change from 1.6.352 to 1.6.549

New Exit Codes

Marathon will indicate with an exit code why it stopped itself. See the docs page for a list of all codes and their meanings.

Native Packages

We have stopped publishing native packages for operating system versions that are past their end-of-life:

  • Ubuntu Yakkety
  • Ubuntu Wily
  • Ubuntu Vivid

Additionally, we have added support for Debian Stretch.

Limit maximum number of running deployments

New command line flag --max_running_deployments was added to limit the max number of concurrently running deployments. The default value is set to 100. Should the user try to submit more updates than set by this flag a HTTP 403 Error is returned with an explanatory error message. We introduced this flag because having lots of running deployments can lead to a significant performance decrease in the failover scenario during marathon initialization phase. Note that if you reach the maximum deployment number, you will have to use ?force=true parameter to cancel an existing deployment.

Zookeeper storage compaction interval

New command line flag --storage_compaction_interval was added to set zookeeper storage compaction interval in seconds. The default value is set to 30 seconds.

Deprecation Mechanism

Marathon has gained a new feature flag: --deprecated_features. For more information, see the docs.

Non-blocking API and Leader Proxying

Previously, when under substantial load, Marathon would time out a deployment initiating request (such as modifying an app) after some time, with "futures timed out". The timeout was not very helpful because Marathon would perform the work requested, regardless. This timeout has been removed. However, note that the client will time out if configured to do so.

To handle the potential increase in concurrent connections, deployment operations and leader request proxying now use nonblocking I/O. The nonblocking I/O proxying logic may have some subtle differences in how responses are handled, including more aggressive rejection of malformed HTTP requests. In the off-chance that this causes an issue in your cluster, the old behavior can be restored with the command line flag --deprecated_features=sync_proxy. sync_proxy is scheduled to be removed in Marathon 1.8.0.

Improved environment variable to command line argument mapping

As part of the fix for MARATHON-8254, the logic for receiving command-line options from environment variables has been reworked. "*" is properly propagated (previously, the glob-expanded result was getting passed), and spaces and new-lines are now preserved.

There's a small change in behavior for environments in which the launcher script is sourced, rather than executed. Unexported environment variables will not be converted in to parameters.

Optionally allow offer suppress

Marathon can now be configured to suppress offers from Mesos by specifying the flag --suppress_offers. This can improve offer-starvation scenarios in larger clusters at the cost of reservations taking longer to destroy. This is off by default.

New Metrics

Several new metrics have been added to improve detection of load-scenarios known to degrade Marathon's performance:

  • mesosphere.marathon.api.HTTPMetricsFilter.gzippedBytesWritten
  • mesosphere.marathon.api.HTTPMetricsFilter.bytesRead
  • mesosphere.marathon.api.HTTPMetricsFilter.bytesWritten
  • mesosphere.marathon.core.deployment.impl.DeploymentManagerActor.currentDeploymentCount
  • mesosphere.marathon.core.deployment.impl.DeploymentManagerActor.deploymentCount
  • mesosphere.marathon.core.flow.impl.ReviveOffersActor.reviveCount
  • mesosphere.marathon.core.flow.impl.ReviveOffersActor.suppressCount
  • mesosphere.marathon.core.group.impl.GroupManagerImpl.dismissedDeployments
  • mesosphere.marathon.core.group.impl.GroupManagerImpl.queueSize
  • mesosphere.marathon.core.matcher.base.util.OfferOperationFactory.launchOperationCount
  • mesosphere.marathon.core.matcher.base.util.OfferOperationFactory.launchGroupOperationCount
  • mesosphere.marathon.core.matcher.base.util.OfferOperationFactory.reserveOperationCount

Deprecated Features

/v2/schemas

The route /v2/schemas has been deprecated in favor of the RAML specifications. Clients that need to perform local validation of requests can access the RAML specifications with the prefix the /public/api. For example, to get the RAML definition for the apps resource, GET http://marathon:8080/public/api/v2/apps.raml.

The route /v2/schemas has the following deprecation schedule:

  • 1.6.x - /v2/schemas will continue to function as normal.
  • 1.7.x - The API will stop responding to /v2/schemas; requests to it will be met with a 404 response. The route can
    be re-enabled with the command-line argument --deprecated_features=json_schemas_resource.
  • 1.8.x - /v2/schemas is scheduled to be completely removed. If --deprecated_features=json_schemas_resource is
    still specified, Marathon will refuse to launch, with an error.

/v2/events

The default response format of the /v2/events is marked as deprecated and will be switched to the /v2/events?plan-format=light in the first 1.7.x release. The following deprecation schedule is planned for this endpoint:

  • 1.6.x - /v2/events will continue to function as normal
  • 1.7.x - The default /v2/events format will be switched to "light". You will still have the ability to use the command-line argument --deprecated_features=api_heavy_events to re-enable the heavy event response.
  • 1.8.x - The /v2/events format will be permanently switched to "light". If --deprecated_features=api_heavy_events is still specified, Marathon will refuse to launch, with an error.

Deprecation Details

The "lightweight" plan format can be already seen using the ?plan-format=light argument. In summary, this format drops the following fields from the deployment-related events in the event stream accessed via /v2/events:

  • plan.original - The current state of the root group
  • plan.target - The target state of the root group

Fixed Issues

  • MARATHON-7568 - We now redact any Zookeeper credentials from the /v2/info response endpoint.
  • Updated version of Marathon UI to 1.3.1:
    • MARATHON-8255 - Marathon UI properly shows fetch URLs in the edit dialog, now.
  • MARATHON-8124 Fix issue in which reservations lacking a persistent volume would not be destroyed.
  • MARATHON-7940 Fix connection-pool overflow issues with Marathon HTTP health checks by disabling connection pooling for them.
  • MARATHON-8136 Fix issues involving headers and URI filtering with Marathon HTTP healthchecks.
  • MARATHON-8083 Fix issue with datadog / graphite metric reporters in which several parameters were ignored.
  • MARATHON-8110 Fix issue in which Marathon would fail to accept offers for some resources from newer versions of Mesos.
  • MARATHON-2683 Deployments for run-specs with multiple health-checks now wait for all health checks to succeed.
  • MARATHON-8148 Pod last-failure-reason is now exposed via the API, as is done for apps.
  • MARATHON-8216 Fix Mesos HTTP health checks for non-host networking mode with containerPort=0 now work.
  • MARATHON-8064 Fix migration issue when store caching is disabled
  • MARATHON-8159 Fix migration issue which introduced erroneous taskKillGracePeriodSeconds values
  • MARATHON-8304 Fix rare bug in which Marathon would become unresponsive while connecting to Mesos.
  • MARATHON-7568 Zookeeper credentials are now redacted from logs and the /v2/info response.
  • MARATHON-7390 Fix issue in which Marathon would become unresponsive for a long time if Zookeeper DNS cannot be resolved at launch.
  • MARATHON-8084 Fix issue in which POST /v2/apps/{app_id}/restart would not proxy properly.
  • MARATHON-8326 Pod instances with persistent volumes can now be destroyed.
  • MARATHON-8095 Fix issue in which PATCH HTTP requests were not properly proxied.
  • Fix an issue in which resident tasks sometimes wouldn't be restarted.

v1.4.13

29 Aug 22:52
8d31177
Compare
Choose a tag to compare

Fixed issues

  • MARATHON-8397 Get rid of unbounded concurrency in the migration code

v1.5.11

11 Jul 01:08
Compare
Choose a tag to compare

Fixed issues