Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Collection of IO and GC improvements #2039

Open
Ngoguey42 opened this issue Aug 5, 2022 · 1 comment
Open

Collection of IO and GC improvements #2039

Ngoguey42 opened this issue Aug 5, 2022 · 1 comment

Comments

@Ngoguey42
Copy link
Contributor

Ngoguey42 commented Aug 5, 2022

@metanivek and I brainstormed areas of improvement for GC and the new IO


Benchmark impact of GC on main process performances

benchmark
status: ongoing, @Ngoguey42

Bench 1: Evaluate the impact of GCed pack store on replay speed. Could be done by comparing a gc-less replay starting from a non-gd'ed store versus a gc-less replay starting from a gd'ed store.

Bench 2: Evaluate the impact of GC-worker on main store. Could be done by comparing a gc-less replay versus a replay that constantly has GCs running but that never swaps.

Bench 1 will tell us if Add lower layer is worth it (ignoring the fact that the upper/lower could live on different disks)

Bench 2 will tell us if Use a sequential traverse that visits pages only once is worth it

New stats for stats trace

benchmark
status: ongoing, @Ngoguey42

Benchmark the impact of fsync on other processes depending on filesystem

benchmark
status: blocked, needs GC benchmark

Some filesystems may block all operations from all processes during an fsync


Review uses of finalise

code-correctness
status: TODO

Need to close FDs on error

#1957

Improve crash consistency

code-correctness
status: large unscheduled work

#2082

Catch decoding errors to reraise/return clean exceptions/errors

code-correctness
status: TODO

Some irmin-pack function are expected to not raise errors, but when using repr's decode bin we don't catch it's exception.

Let's check the whole file stack + GC code for such errors.

See: https://github.com/mirage/irmin/blob/main/src/irmin-pack/unix/traverse_pack_file.ml#L209


Log to disk all the important activities on a pack store

forensic
status: unscheduled and low priority

We could add a parameter to irmin-pack's repo that could default to journaling=false. We set journaling=true in Tezos.

Maybe using logs.

#1856


Change GC algorithm perform graph traversal from high to low offsets

reduce-gc-impact-on-main-process, improve-gc-worker-runtime
status: Ongoing, @art-w

#2085

Change GC algorithm to visit disk pages at most once and disable page-cache for it

reduce-gc-impact-on-main-process, improve-gc-worker-runtime
status: large unscheduled work

See initial work: https://github.com/Ngoguey42/segment_hangzhou/blob/34e300b94e1dbf01ab3f04e7667bbef604ae21e4//traverse.ml

Avoid suffix copies in GC

improve-disk-usage
status: large work, scheduled for Q4

How to not block finalize with unlink

reduce-gc-impact-on-main-process
status: unscheduled

#2091

Remove dead files on store opening

improve-disk-usage
status: Ongoing, @art-w

Filter the LRU instead of completely clearing it.

reduce-gc-impact-on-main-process
status: unscheduled, need new benchmark to inverstigate further

First attempt wasn't conclusive: #1993


Add back a non-forking GC

new-gc-use-case
status: unscheduled

#2000

Add lower layer

new-gc-use-case
status: large work, scheduled for Q4/Q1

For archive nodes.


Remove Lwt from the low-level (start/finalise) GC code

improve-code-quality
status: experimental implementation

keep Lwt in the high level API

#2064

Remove exceptions from the low-level (start/finalise) GC code

improve-code-quality
status: partial implementation

keep exception in the high level API

Done for GC worker: #2065

Rename gc.ml to gc_worker.ml and move GC code out of ext.ml to a new gc.ml

improve-code-quality
status: done

#2063

Remove disk-specific functions from irmin-pack/s.ml

improve-code-quality
status: done

#2081
#2084

Dead header size handling logic from apppend_only to dispatcher

improve-code-quality
status: to brainstorm

Evaluate where code documentation misses

improve-code-quality
status: Partially done

E.g. #1960

First batch: #2051

Remove mapping_consumers.

improve-code-quality
status: done

#2062

Improve error handling in new code

improve-code-quality

  • Refine errors types by using assert false to prune unreachable branches
  • Add stack tracking in error monads.
  • Refine the errors in register_dict function.
  • Don't use result monad in gc worker routine. Directly raise and catch at the end. Raise errors in gc worker #2065
  • Use less Errs.catch.
    • In ext, could be changed.
    • In gc_worker, could be made _exn.
    • In mapping, could be made _exn (because only used in worker).

Make offsets abstract or private

improve-code-quality
status: TODO

#1954

Make auto-flushes type safe

improve-code-quality
status: done

#2051 (comment)
#2088

@zshipko
Copy link
Contributor

zshipko commented Aug 5, 2022

I believe that Filter the LRU instead of completely clearing it is what was tried here: #1993 but it wasn’t necessarily an improvement over just clearing the LRU.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants