Skip to content

XArchived: Client server discovery lifecycle

eugene yokota edited this page Sep 10, 2017 · 1 revision

How should the server be discovered, started, and shut down?

Discussion thread on sbt-dev

Some Objectives

  • Clients should auto-start the server as needed
  • If several clients start at once (not unlikely e.g. on login), we want to end up with only one server, at least "almost always in practice," even if we aren't theoretically race-free
  • If a server dies, we want to auto-start a new one
  • We don't want people to need to manually kill orphaned (clientless) servers
  • Server is scoped "per-build" (single instance for each root project directory)
  • Ctrl+C or OS reboot or hotspot crash should not result in people having to do manual lockfile cleanup
  • Potentially controversial, but based on experience: operating system lockfile functionality tends to be platform-specific and flaky (e.g. not working on NFS/AFS, not working between threads, stuff like that), so avoid if possible. This is likely a main question (whether to use FileLock).
  • There are more robust platform-specific solutions available, such as D-Bus on Linux, COM or window-based mechanisms on Windows, and launchd on Mac. These are likely a total pain in the ass to use from the JVM, though. Is there any sane-from-JVM way to do a reliable named lock?

Concerns (possibly newly discovered)

  1. Passing sys.props down from forked process to the server. Things like: -Dhttp.proxyHost -Dsbt.version

Proposal 1: active.properties

This proposal is intended to "self-heal" and work in practice without asking users to kill processes or delete lockfiles - but nobody will call it elegant. The general idea is that a server suicides if the file pointing to that server is overwritten or deleted.

  • create a directory myproject/.build-server (let's call it SERVERDIR)
  • SERVERDIR/active.properties contains properties url, pid, and version
  • the url is expected to be protocol://127.0.0.1:PPPP always for now (don't listen on 0.0.0.0 since that will include external interfaces). PPPP should be an OS-assigned random port. Protocol is likely http. This url would be a base URL and there may be conventional paths within it.
  • the pid for now is just for use in debugging or manual cleanup by a human
  • the version is build server version in case we need to do something different based on this someday
  • when a client needs a server, it first loads active.properties and tries to ping the url for aliveness
    • if the url is alive, client uses it
    • otherwise, client forks a new build server and starts checking for active.properties to appear at short intervals
    • client gives up after some timeout (long enough for build server to reliably start even on lame computers)
    • whenever a client loses its server connection, it starts over discovering/starting a new server.
  • the "ping" for server liveness should verify that the pinged server goes with the project directory we care about (because the server could be for a different project that just happened to recycle the same url). So the ping could for example contain the canonical path to the active.properties file and fail if the server pinged is for a different path.
  • when the build server starts up, it needs to proceed in this order:
    1. ping any existing active.properties url and exit if the other server is alive
    2. listen on a port (so now we know our url and we are pingable)
    3. take over or fail to take over active.properties
    4. actually start up the sbt engine if it takes over
    5. exit without doing anything if it fails to take over
  • to take over active.properties, the build server does the following:
    1. we should already be pingable
    2. write out active.properties atomically (by creating a temp file and renaming over on unix, and doing whatever one can do on Windows)
    3. exit if active.properties ever changes so it no longer refers to us. (use file watches or just check at intervals.)
      • another server starting up may have pinged us and failed, then we both started listening, then we both wrote active.properties.
      • user may have moved the project directory.
      • user may have deleted active.properties by hand.
  • the build server should wait some reasonable timeframe for the first client to connect, and exit if no client ever connects. It should also exit a (possibly very short) time after the last client disconnects.
  • on clean exit, don't delete active.properties since we're probably just as likely to delete one that doesn't refer to us and cause trouble. Just rely on future pings failing - stale active.properties should be automatically handled.
  • "herd of server starts" problem: when a server exits and has 4 clients, they might all immediately try to restart a new server, which should work out but would launch a lot of pointless JVMs in the meantime. To avoid this, they could each wait a short random interval before restarting, OR they could each be assigned a "restart delay" by the server, OR something else.
  • note that we are not expecting pings to fail due to network problems, since we are on a local machine.

Proposal 2: Some other idea here

  • is there some way we can get an actual, reliable lock on a project directory, cross-platform?

The sbt launcher provides a locking facility.

  • currently used in practice to:
  • lock down the Ivy cache
  • lock parts of the sbt boot directory
  • locks access jvm-wide on a file
  • combination synchronized + file lock
  • only allows one thread across all (cooperating) Java processes
  • deals with some weird implementation details
  • overlapping file lock exceptions from Java when locking in the same JVM
  • workaround OS deadlock detection, which is often spurious in a multi-threaded application
  • won't magically work on NFS or other environments with poor/missing lock support
  • OS will clean up the lock if the process locking it dies
  • blocking only- no tryLock to only attempt a lock and return quickly if not available
  • API for File.createNewFile says it shouldn't be used for file locking (don't know why)
  • could use for server startup roughly like:
   val f = <project-lock-file>
   withLock(f) {
     val port = read(f)
     tryConnect(port)
       fail: port = start new process; write(f, port)
   }

Proposal 3: A combination of the above.

Here's a more concrete proposal of how launching a server would work. However, first, let's make some assumptions that are potentially contentious that lead to the following proposal:

  1. the sbt launcher's locking mechanism is sufficient for us to find a means to avoid double-server-startup.
  2. We want clients to have one means of attempting to fork the sbt server process. This is assumed to be a new "service locator" feature of the sbt-launcher.
  3. Upon network failure, or inability to communicate to the server, clients will attempt to spawn another server. This will only succeed if the server is actually down. Clients are also responsible for restoring any state they need on the server upon reconnect.
  4. The sbt server needs to outlive the original client that requested it, and only shut down when it has no more clients.
  5. The lock file location uniquely identifies the type of server launched. Only one server per lock-file. The launcher makes no assumptions about semantic meaning across launch configurations. Only the lock file.
  6. The protocol exposed by any server MUST be HTTP. While the URI returned may allow users to negotiate, making a HEAD request against the returned URI is guaranteed to succeed and becomes the mechanism of pinging a server for "up".

So, first We create a new interface for the sbt launcher, called xsbt.ServerMain:

package xsbti;

/** The main entry point for a launched service.
 * 
 * A class implementing this must:
 * 
 * 1. Expose an HTTP port that clients can listen on, returned via the start method.
 * 2. Accept HTTP HEAD requests against the returned URI. These are used as "ping" messages to ensure
 *    a server is still alive, when new clients connect.
 * 3. Create a new thread to execute its service
 */
public interface ServerMain {
  /**
   * This method should launch a new thread(s) which run the service.  After the service has
   * been started, it should return the port
   * 
   * @param configuration
   *          The configuration used to launch this service.
   * @return
   *    A URI denoting the Port which clients can connect to.  Note:  only HTTP protocol and 
   *    localhost/127.0.0.1/::1 addresses are supported in the URI. Any other return value will
   *    cause this service to be shutdown.
   */
  public java.net.URI start(ServerConfiguration configuration);
}

The interface has two methods. One responsible for starting the Server, and another which can be used to check whether or not another server is alive, based on a given set of server port connections.

The sbt launcher can now also support launching servers, via the following pseudocode:

def launchServer(config: ServerConfiguration): Unit = {
  val oldOut = System.out
  System.setErr(getErr(config))
  System.setOut(getOut(config))
  val uri = runServer(config)
  oldOut.println(uri)
  // Let this thread die, server thread will continue running.
}

This is done via a new configuration similar to AppConfiguration, only with an additional values. E.g. here's a possibly launch configuration for the sbt server:

[scala]
  version: auto

[server]
  org: org.scala-sbt
  name: sbt
  version: ${sbt.version-read(sbt.version)[${{sbt.version}}]}
  class: sbt.xServer
  components: xsbti,extra
  cross-versioned: false
  resources: ${sbt.extraClasspath-}
  lock: ${cwd}/.sbtserver/active.properties

[repositories]
  local
  maven-central

[boot]
  directory: ${sbt.boot.directory-${sbt.global.base-${user.home}/.sbt}/boot/}

[ivy]
  ivy-home: ${sbt.ivy.home-${user.home}/.ivy2/}
  checksums: ${sbt.checksums-sha1,md5}
  override-build-repos: ${sbt.override.build.repos-false}
  repository-config: ${sbt.repository.config-${sbt.global.base-${user.home}/.sbt}/repositories}

The only new features in the above launch configuration are:

  1. using [server] rather than [app] to inform the launcher which interface to expect.
  2. An additional lock attribute that specifies where the service-locator can find this servers lock file.

When launching a server, you'll note that the lock attribute is not inspected. Instead, the sbt-launcher will also support a "service locator" functionality where this lock will be critical.

The Service locator will be executed by clients via the command line:

java -jar sbt-launch.jar locate:MyServerBootConfigurationFile

This will return the URI of an active launched server. It will use the below pseudo-code as a possible implementation:

def serviceLocator(config: ServerConfiguration, launcher: Launcher): URI = {
  def isReachable(info: URI): Boolean =
     canMakeHeadRequest(info)

  withLockFile(config.lockFile) {
    readServerInfo(config.lockFile).filter(isReachable) match {
       case Some(info) => info
       case None => 
           val uri = startServer(config)
           writeServerInfo(config.lockFile, uri)
           uri
    }
  }
}

def startServer(config: ServerConfiguration): URI = {
  val bootProps = makeTemporaryBootProps(config)
  val launcherJar = lookUpFromMyself()
  val process = s"java -jar ${launcherJar} @${bootprops}".in(cwd).run
  val uri = fromStdOut(process)
  if(process isn't running) fail("Couldn't start server")
  uri
}

Sample code

The following accomplishes:

client: I want the application defined by boot.properties to run in cwd and use version X of protocol N
server-service: the application is running and listening on port P
client:
	val bootProperties: File = ...
	val connectionConfiguration = Map("x" -> "y")
	val dir: File = ...
	val (output: String, err: String, exit: Int) =
		(stdout, stderr, exitCode) of { "java -jar serverProvider.jar bootProperties connectionConfiguration(as key=value args)".in(dir).!!! }
	if(exit == 0)
		println("Server running on port: " + output.toInt)
	else
		println("Server connection failed: " + err)

server-provider:
	def main(args: Array[String])
	{
		val bootProperties: File = new File(args(0))
		val cwd: File = (new File(".")).getAbsoluteFile
		val connectionConfiguration: Map[String, String] =
			args.drop(1).map(_.split("=", 2) match { case Array(key,value) => (key,value) }).toMap
	
		val serverFile = serverFileLocation(bootProperties, cwd)
		val connection = withLockFile( serverFile ) {
			readServerInfo( serverFile ) match {
				case None => startAndConnect(boot.properties, cwd, connectionConfiguration)
				case Some(info) => connect(info, connectionConfiguration)
			}
		}
		System.out.println(connection.port)
		System.exit(0)
	}
	
	def connect(info: ServerInfo, connectionConfiguration: Map[String,String]): ConnectionInfo =
		requestConnection(info, connectionConfiguration) match {
			case Left(NoServerPresent) => startAndConnect(bootProperties, cwd, connectionConfiguration)
			case Left(e) => fail(e.toString)
			case Right(newConnection) => newConnection
		}
	def requestConnection(info: ServerInfo, connectionConfiguration: Map[String, String]): Either[<error>, ConnectionInfo] =
		try {
			openSocket(info.port) match {
				case Left(err) => Left(NoServerPresent)
				case Right(socket) =>
					sendConfiguration(socket, connectionConfiguration)
					ConnectionInfo(receivePort(socket))
			}
		} catch {
			case e: Exception => Left(e)
		}
			

	def fail(msg: String) {
		System.err.println(msg)
		System.exit(1)
	}
	def startAndConnect(bootProperties: File, cwd: File, connectionConfiguration: Map[String,String]): ConnectionInfo =
		connect(newServer(bootProperties, cwd), connectionConfiguration)
	def newServer(bootProperties: File, cwd: File, connectionConfiguration: Map[String,String]): ServerInfo =
	{
		val process = "java -jar sbt-launch.jar @bootProperties".in(cwd).run
			// this line is not quite right- reading a line from std output would be part of configuring the above line
		val readPort = fromStdOut(process)
		val info = ServerInfo(readPort, "faked")
		if(code != 0) fail("Couldn't start server")
		writeServerInfo(serverFile, info)
		info
	}

	def sendConfiguration(socket: Socket, connectionConfiguration: Map[String,String]): Unit
	def receivePort(socket: Socket): Int
	def withLockFile[T](file: File)(run: => T): T
	def serverFileLocation(bootProperties: File, cwd: File): File
	def writeServerInfo(file: File, info: ServerInfo): Unit
	def readServerInfo(file: File): Option[ServerInfo]
	final case class ServerInfo(port: Int, pid: String)
	final case class ConnectionInfo(port: Int)

server:
	val socket: ServerSocket = localSocket
	System.out.println(socket.getLocalPort)
	... 
	// probably part of an sbt command that is run initially
	while(true) {
		val config: Map[String, String] = receiveConfiguration(socket)
		val port = newClient(config)
		writePort(socket, port)
	}

	def receiveConfiguration(socket: ServerSocket): Map[String,String] =
	{
		// I forget the details- not sure if you have to recreate a ServerSocket
		//  for every accept or if you can accept multiple times on the same ServerSocket
		val s = socket.accept()
		readConfiguration(s)
	}
	def localSocket: ServerSocket = new ServerSocket(0, 0, InetAddress.getByName(null))
	def readConfiguration(socket: Socket): Map[String,String]
	def writePort(socket: Socket, port: Int): Unit
	def newClient(config: Map[String, String]): Int = {
		val socket: ServerSocket = localSocket
		<start a thread listening on socket, connect it to command processing, etc...>
		socket.getLocalPort
	}