Skip to content

Engine Security

Bjorn Stahl edited this page Jan 21, 2018 · 2 revisions

Security and Threat Model

This page describes the security model and attack surface that is targeted as a first step towards making Arcan a difficult target to find and exploit vulnerabilities in.

Bear in mind that, as stated elsewhere, the commitment is in making Arcan a safe and secure tool for building desktop environments or using as a display server or app framework, not a valid description of the current state. It would be foolish to believe that Arcan is safe and secure in its current state, as the project is still very much a moving target with other priorities to account for as well.

From the current roadmap, the latest branch (0.5) still lack necessary features, and the attack surface will expand somewhat until completion of the (0.7) one. It is also for these reasons that the default build does NOT include any of the normal mitigations, e.g. canaries, ASLR and so on - debugability is a much more immediate priority.

Not until the 0.9 release will software security related concerns be the primary focus (full review, all privilege boundaries fuzzable, test coverage for all privilege boundary transitions, red teaming attack surface, testing hardening and exploit mitigation mechanisms using injected vulnerabilities etc.) though that does not imply that we are not currently treading carefully.

The following figure illustrates the primary attack surface and threat model that we concern ourselves with currently, and the rest of this page will be used to describe it in additional detail.

attack surface

  1. Active Appl
  2. Lua Interface
  3. Engine
  4. SHMIF
  5. Sharing
  6. Sandboxing
  7. Mitigation
  8. Wayland/Xorg

Active Appl

We start with the appl layer, which is a partially trusted layer in the sense that it has access to define and modify the current data sharing rules and acts as an information routing policy layer.

This implies that any user-facing privileged operation, like instructing where data should be sent or shared-, authentication token retrieval, lock screens, file pickers, ..., is implemented at this level.

Therefore, privilege escalation from the context of the script to the engine is not a vulnerability itself, the tatics here are to reduce risk and minimize impact. These tactics can be summarised as:

  1. Limit the set of built-in functions to an absolute minimum through a build-time specific configuration, see engine/lua_boostrap.lua that reduces the initial namespace of exposed functions.

  2. Split dynamic loading capabilities to be limited to the contents of the appl- specific namespace and write related options to the appl- temporary specific namespace. By default, these are still mapped to the same folder, but overriding this by setting an environment variable effectively turns off the ability to dynamically generate and execute Lua code.

  3. Limit paths and filesystem resource loading semantics to only permit certain operations in certain namespaces and rejecting requests to use relative addressing schemes (../../) to traverse outside permitted namespaces.

  4. Process separate parsers, separate namespace for known offenders that need additional observation, primarily font files used.

There is also a number of scripting features that are designed to allow for the policy layer to implement additional security features. For an illustration of such features, look no further than Durden-Security and the article on One Night in Rio: Vacation Photos from Plan9.

Comments

The effectiveness of 1. is reduced by permitting additional functions at build time, and through the reliance on a third party dependency to implement the VM (Lua and LuaJit respectively).

The effectiveness of 2. and 3. is mostly limited by the lack of sandboxing frameservers via the chainloader, which is set to be fixed in 0.6. It is intended for an administrator to be able to circumvent file system restrictions through remapping namespaces directly during load time, or indirectly with symbolic links and bind/nullfs mounts.

Other risks inherent to this interface is the reliance on a third party to develop the actual code that runs in the VM, and particularly in the frameserver event callback implementation as the data there passes through multiple trust domains. Auditing activities here should therefore first and foremost focus on these event handlers, and then on the functions that explicitly interact with the filesystem.

The bigger concern with unauthorized execution in this context, however, is the possibility for data exfiltration. This can be limited on a case by case basis by providing builds that do not implement certain frameserver archetypes (net and record) and policy models or software firewalls can make sure that the running Arcan process is not permitted to open any network connections.

Another large concern, however, is the possibility of denial of service, this layer is sensitive towards programming mistakes that would accidentally put the engine in a live-lock state as the main context also drives the rendering. Plan is to allow for developer- informed mitigation to separated VM instances with controlled data exchange-/synchronization.

Lua Scripting Interface

This category focuses on the interaction between the scripts that comprise the running appl and of the engine itself through the mapped Lua interface.

This is a rather large surface area, covering some ~220+ functions, which also includes functions that allocate system resources (target_alloc) and spawn processes (define_recordtarget, launch_avfeed, ...).

A big focus of this interface is aggressively enforcing type and data-model, unsupported and unexpected arguments is, for the majority of functions, a terminal state transition.

Comments

There are a number of interfaces provided to test and monitor the information that flows across this border. At build-time there is the LUA_TRACE facility which is a macro that is added first in each function that bridges C and Lua and by default is defined to empty. Redefining it permits you to write function- specific tracing or dumping features, like the coverage example that dumps calls and arguments.

A planned change is the INSECURE build mode, that only permits an appl called insecure to be loaded. This appl gets access to new functions that can be used to test PoCs and mitigation strategies for common bug classes, e.g. use-after free, stack and heap overflow, buffer redirection etc.

A notable issue with the Lua VMs themselves is the standard implementation of print that has to be fixed as it currently leaks information about table and function addresses when printing such members, which is unacceptable.

Engine

There are a few key components in the engine that has strong security implications.

  1. First and foremost is the database, as it provides the storage for model based execution and -- in future releases -- will be the interface for more sensitive information like key-pairs for cryptography on devices where there is no access to a secure enclave. The permitted execution- related tables should be read-only from an engine perspective, and populated by an administrator manually or through some automation tools. It should not be possible for the running appl to add possible execution targets and there is, indeed, no Lua API to do this.

  2. Available parsers: this is a category known to include third party dependencies with a security related history ranging from awful to terrible. The parsers that are currently included as a third party dependency are Lua, sqlite (queries used are static and can be verified as such, all queries are performed as prepared statements), openCTM (compressed triangle mesh, used for 3D models, engine/arcan_3dbase.c), freetype and stbimage (engine/arcan_img.c). These will all be replaced with better solutions or moved to the decode frameserver. The possibility of somehow getting the engine to forward data to PNG/CTM parsers for remote-code execution or denial of service (exhaust memory or cpu consumption) is rather high. Currently, there is no mitigation available for that scenario. The Lua attack surface is a design trade-off, it will be made optional when the engine is re-purposed in library form, until then the points mentioned above in the section on Lua scripting language will suffice. Parts of the truetype interface can be controlled with the font processing being performed in another namespace and thus only permitting administrator controlled fonts to be loaded.

  3. Internal Parsers: outside the documented interfaces, care has been taken to avoid custom parsers as far as possible. There are two exceptions, one being the format strings used for fonts that still requires the scripting layer to properly escape '' into '\'. The other is used passing from higher to lower privilege in the terms of the ARCAN_ARG environment variable (Rules are simply ':' act as k/v pair separator, '=' is forbidden in a key and optional, v is optional, ':' is escaped with '\t', '\t' is forbidden). The font format string function furthermore has a special call mode where you split the string up in parts where expansion is not allowed to be performed. m

  4. Buffer Passing: this is difficult in the sense that using arcan as a display server means that it will need to receive buffers from lesser privileged clients. The current available means for having restricted child processes working against a more privileged display server quickly becomes useless when a child is hostile and denial of service is a valid attack strategy (control and monitor systems). It is possible for an application to starve arcan of GPU resources, or use driver issues to exfiltrate expired buffers. There are functions for regulating which GPU device that each connection controls but this still assumes the surrounding system restricts access to paths like /dev/dri/.

  5. Sensitive buffers: Since arcan can have access to the visible and audible content from other processes that may contain information that is sensitive to the user and useless for debugging, possible such buffers are marked and tracked separately to the point that they will not be included in core dumps and other debugging artifacts.

  6. Crash Resilience: This is covered in a lengthier article that can be found as Crash Resilient Wayland Compositing and is a prime protection against loss of work or loss of client identity tracking via denial of service attacks.

For performance reasons, most of the engine is implemented in C code, with all the safety concerns that entail. As previously stated, the build- system do not currently enable or verify-enable the normal set of mitigations, but will be once the other quality configuration-management and automated- testing work is in place but debugability takes priority.

The main codebase is almost- multithreading free, and care has been taken to avoid being callback driven (the ffunc_lut.c approach) so a static control-flow graph is entirely possible. There should be few attack surfaces that expose useful pivots or targets for return-to-libc style attacks, but heap- and JIT- spraying is possible from the Lua context, but since that layer act as the highest level of privilege - it is not reasonable to think of it as a direct threat to the engine as such.

Shmapi

In terms of attack surfaces, this is the more interesting one as it passes a lower privilege to higher privilege level and uses complex connection points (IPC: shared memory, named semaphores, domain sockets).

For this attack surface, some additional background and context is provided in Shared Memory Interface. There are three primary protections in place:

  1. Layout: the placement of structure members are not so much for optimizing size against alignment and cache but to protect against common buffer- like fields being overflowed resulting in visual or aural anomalies that may be unpleasant, but not a privilege escalation.

  2. Continous verification: both parent and child regularly (part of checking for resize requests) verify the integrity of certain structure members (combined with 1.) like version major/minor, compiler cookie and canaries. Failed verification terminates the connection. In addition, Audio and video buffers are added after the control header, overflowing into each-other or unbounded memory (crash). This follows the design that partial data corruption in this interface is recoverable.

  3. Event multiplexing and poll-only server-side: the last communication mechanism unfortunately expose engine internals for input events and other kinds of notifications. It is therefore a priority that a frameserver is not allowed to provide events with a type that doesn't match that of the engine, and it is not allowed to saturate the main input event queue. This is currently working to the point that (engine/arcan_event.c) events are explicitly copied and masked for their type, and transfers stop when the queue is saturated to a certain level (0.5 of capacity or so). What is missing is additional prioritization so that some frameservers may be permitted to exceed this ratio or have a certain ratio reserved.

There's one BIG thing that breaks the notion of isolating connecting clients right now, and it lies in a tradeoff that portability with OSX imposed (its Posix.2008 + realtime compliance is not very good). The trade-off is that all IPC primitives are currently set up using fairly predictable names rather than a parachute opening over an authenticated socket. This means you can race or inotify for connection primitives and man-in-the middle specific clients. The fix will take a while to verify, but cheap to implement when OSX support becomes expendable, so around 0.9.

Another side is, of course, if you are running linux and do not segment client trust on a per-user/group basis, there is an infinite set of opportunities to steal and monitor data via /proc alone. Therefore, we provide tools for digging through you data via that path directly, see src/tools/shmmon.

Data Sharing and input

Any data input routing is explicitly initiated from the script, and there is no direct correlation between the input device and the target so a connection considered malicious could even be fed synthesized / manipulated input like canary passwords and credit cards to people you feel deserve fraud (this is a joke, not a recommendation).

Since the input model covers other devices such as gamepads and sensors, applications that need access to accelerometers and similar nodes do not have to be granted access to input device nodes directly in order to access this data and can therefore also not be used to reconstruct keyboard data. The same also applies for audio.

Sandboxing

This layer is currently omitted as it has not been implemented to any meaningful degree. It is a principal feature for the 0.6 version but a big component of the initial design is that each frameserver archetype has a chainloader which prepares the sandbox for each archetype being launched. This blends well with syscall- filters like pledge. Resource consumption constraints can be extremely strict, and the default set of frameservers should eventually work even without a filesystem and little more than recvmsg, mmap, read and write as the necessary set of syscalls.

Tunable Mitigation

Due to the split between trusted and untrusted connections in the form of a whitelisted execution database, it is possible to entirely disallow external connections or rate limit them in order to prevent a number of denial- of- service attacks. It is also possible to direct them to a specific render-node so that third party connections do not get access to the same GPU (or any for that matter) as whitelisted ones. Ideally, the graphics layer in the engine itself ('agp') should have a minimal software- only implementation as well, but it is not as much of a priority. A dream scenario would be having an old-school style framebuffer GPU for the main desktop and compositing, VMs with dedicated GPU passthrough and some way of diode- forwarding passthrough GPU to the main one.

Fonts can be explicitly routed and defined per connection. While primarily a feature to deal with multiple displays that have different densities, it also allows for a visual distinction between privileged and unprivileged connections by supplying privileged connections with fonts that unprivileged connections do not have access to or could know the existence of. This breaks the ability for a surreptitious process to mimic the behavior and style of some trusted component.

Since each non-trusted connection is opened up explicitly and all subsegment requests are default-deny, it is easy to implement rate limiting or even suspending external connections altogether when the set of available resources are low, the durden desktop environment already does this.

Wayland and Xorg

One of the main reasons behind keeping the implementation of the Wayland protocol in separate processes and not defaulting to special considerations for the Xorg server is, indeed, security. For compatibility with such clients, special bridging processes are run that map client data to SHMIF.

There are multiple design choices in both the Wayland protocol and the server side API you are essentially forced to use, along with ugly corners in the underlying graphics subsystem itself that warrant process separation and a least-privilege syscall filter being applied, especially in the areas of resource allocation, life cycle management and how these can be taken advantage of to perform denial of service, impersonation/man-in-the-middle and side-channel attacks.

Therefore, the recommendation and strategy is to run the bridging service in the single-application exec- mode, preferably as a separate user - if possible, and not enabling the EGL/dma-buf related protocols for clients that are in the risk of being used for staging an attack.