Skip to content

Should it be atomic?

Stephen Dolan edited this page Sep 7, 2018 · 1 revision

As well as ordinary reads and writes, hardware provides atomic operations (e.g. exchange or compare-and-swap) with strong ordering guarantees. It's very tempting to use atomic operations throughout the GC (marking, promotion, etc.). However, atomic operations are much slower than ordinary reads or writes.

This page collects some estimates of the cost of using atomic operations for various purposes. The code snippets on this page are worst-case microbenchmarks, which serve to give an upper bound on the slowdown. There are larger benchmarks here, but be careful about treating them as representative of your program.

Remembered set

The remembered set keeps track of pointers from the major to the minor heap. These need to be updated when objects are promoted out of the minor heap, however, these updates must not conflict with writes made in parallel by other domains. Trunk OCaml need not worry about parallel writes and uses a plain store to do this update, but multicore uses a compare-and-swap.

The worst case is when there are very many words modified on the major heap, but they all point to the same few minor heap objects. That way, updating the remembered set becomes a large part of the work of the GC.

let () =
  let a = Array.make 10000000 (ref 1) in
  let r = ref 42 in
  for i = 0 to Array.length a - 1 do a.(i) <- r done;
  Gc.minor ();
  Printf.printf "%d\n" (!(a.(5354)))

Array.make has a special optimisation, so can't be relied upon to create many remembered set entries. The printf is present to ensure that the array is live across the Gc.minor call.

Changing plain writes to CAS updates for the remembered set, the runtime of this benchmark got 12% longer, with the minor GC phase taking 45% longer.

Promotion

When moving an object from the minor heap to the major heap, a forwarding pointer must be installed. This can be done with a plain write if synchronisation is achieved some other way (e.g. the current scheme of private local heaps), but must be done with a compare-and-swap if multiple domains might try to promote the same object in parallel.

The worst case is when there are very many objects to promote, as in this example:

type t = L | B of t * t
let rec mkt = function 0 -> L | n -> B (mkt (n-1), mkt (n-1))
let () =
  let a = Array.init 22 mkt in
  print_string "ok\n"

For testing, the cas-promotion branch does an (unnecessary) CAS on every promotion. On this benchmark, the runtime got ~1.5% longer, with the minor GC phase taking 5-10% longer.

This has much less than CAS for remembered sets, because for each promotion CAS there is a significant amount of other work to do: a major heap allocation, a copy, and the original minor allocation and initialisation. So, the updates to the header words were always a small proportion of total work, and making them atomic has low impact.

Marking

Multiple domains race to mark the same objects (since objects can be reachable from more than one place), and we could use atomic operations to synchronise the mark bits. (Currently, multicore uses are more subtle racy system, relying on the idempotence of marking).

If we made marking atomic, the worst case program would be when there is a lot of data to mark, e.g.:

type t = L | B of t * t let rec mkt = function 0 -> L | n -> B (mkt (n-1), mkt (n-1)) let () = let a = Array.init 22 mkt in Gc.major (); Gc.full_major (); match a.(10) with B _ -> print_string "ok\n" | L -> assert false

For testing, the atomic-marking branch does an (unnecessary) atomic exchange instead of a store on each mark. On this benchmark, the runtime got 15% longer, and major GC slices to 20% longer.

Just looking at the benchmark, this does not seem all that much worse than the remembered set example (15% vs 12% slowdown). However, this benchmark is much less contrived, and every program with a large major heap would see the ~15% slowdown. The average case and worst case are not far apart for marking.

Mutations

It would simplify the memory model greatly if all writes were atomic, which would mean changing the store in caml_modify to an atomic exchange.

The worst-case is when there are many writes that require little bookkeeping (i.e. do not cross generations), such as this program:

let () =
  let a = Array.make 100000 (ref 1) in
  let r = ref 42 in
  Gc.major ();
  for j = 1 to 1000 do
  for i = 0 to Array.length a - 1 do a.(i) <- r done
  done;
  Printf.printf "%d\n" (!(a.(5354)))

For testing, the xchg-write branch uses exchanges for caml_modify. On this benchmark, the runtime got 80% longer (!).