Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QUESTION]: <Snapshot Isolation Testing> #8952

Open
seedoilz opened this issue Aug 16, 2023 · 20 comments
Open

[QUESTION]: <Snapshot Isolation Testing> #8952

seedoilz opened this issue Aug 16, 2023 · 20 comments
Labels
kind/question Something requiring a response.

Comments

@seedoilz
Copy link

Question.

Snapshot Isolation Bug

Environment

Using Docker swarm to set up the Dgraph Cluster

There are 3 nodes in the cluster. The cluster is set on three servers based on the doc of the official website.

Docker version: 20.10.21

Dgraph version: 23.1.0(latest)

image-20230816095836580

image-20230816100011829

docker-compose.yml

version: "3"
networks:
  dgraph:
services:
  zero_1:
    image: dgraph/dgraph:latest
    volumes:
      - data-volume:/dgraph
    ports:
      - 5080:5080
      - 6080:6080
    networks:
      - dgraph
    deploy:
      placement:
        constraints:
          - node.hostname == aws01
    command: dgraph zero --my=zero_1:5080 --replicas 3 --idx 1
  zero_2:
    image: dgraph/dgraph:latest
    volumes:
      - data-volume:/dgraph
    ports:
      - 5081:5081
      - 6081:6081
    networks:
      - dgraph
    deploy:
      placement:
        constraints:
          - node.hostname == aws02
    command: dgraph zero -o 1 --my=zero_2:5081 --replicas 3 --peer zero_1:5080 --idx 2
  zero_3:
    image: dgraph/dgraph:latest
    volumes:
      - data-volume:/dgraph
    ports:
      - 5082:5082
      - 6082:6082
    networks:
      - dgraph
    deploy:
      placement:
        constraints:
          - node.hostname == aws03
    command: dgraph zero -o 2 --my=zero_3:5082 --replicas 3 --peer zero_1:5080 --idx 3
  alpha_1:
    image: dgraph/dgraph:latest
    hostname: "alpha_1"
    volumes:
      - data-volume:/dgraph
    ports:
      - 8080:8080
      - 9080:9080
    networks:
      - dgraph
    deploy:
      replicas: 1
      placement:
        constraints:
          - node.hostname == aws01
    command: dgraph alpha --my=alpha_1:7080 --lru_mb=2048 --zero=zero_1:5080
  alpha_2:
    image: dgraph/dgraph:latest
    hostname: "alpha_2"
    volumes:
      - data-volume:/dgraph
    ports:
      - 8081:8081
      - 9081:9081
    networks:
      - dgraph
    deploy:
      replicas: 1
      placement:
        constraints:
          - node.hostname == aws02
    command: dgraph alpha --my=alpha_2:7081 --lru_mb=2048 --zero=zero_1:5080 -o 1
  alpha_3:
    image: dgraph/dgraph:latest
    hostname: "alpha_3"
    volumes:
      - data-volume:/dgraph
    ports:
      - 8082:8082
      - 9082:9082
    networks:
      - dgraph
    deploy:
      replicas: 1
      placement:
        constraints:
          - node.hostname == aws03
    command: dgraph alpha --my=alpha_3:7082 --lru_mb=2048 --zero=zero_1:5080 -o 2
  alpha_4:
    image: dgraph/dgraph:latest
    hostname: "alpha_4"
    volumes:
      - data-volume:/dgraph
    ports:
      - 8083:8083
      - 9083:9083
    networks:
      - dgraph
    deploy:
      placement:
        constraints:
          - node.hostname == aws04
    command: dgraph alpha --my=alpha_4:7083 --lru_mb=2048 --zero=zero_1:5080 -o 3
  alpha_5:
    image: dgraph/dgraph:latest
    hostname: "alpha_5"
    volumes:
      - data-volume:/dgraph
    ports:
      - 8084:8084
      - 9084:9084
    networks:
      - dgraph
    deploy:
      placement:
        constraints:
          - node.hostname == aws05
    command: dgraph alpha --my=alpha_5:7084 --lru_mb=2048 --zero=zero_1:5080 -o 4
  alpha_6:
    image: dgraph/dgraph:latest
    hostname: "alpha_6"
    volumes:
      - data-volume:/dgraph
    ports:
      - 8085:8085
      - 9085:9085
    networks:
      - dgraph
    deploy:
      placement:
        constraints:
          - node.hostname == aws06
    command: dgraph alpha --my=alpha_6:7085 --lru_mb=2048 --zero=zero_1:5080 -o 5
volumes:
  data-volume:

How to send request

Use http request to send request

Code

import ast
import json

import requests

url = "http://175.27.241.31:8080/mutate?commitNow=true"
headers = {'Content-Type': 'application/json'}


def generate_txn(ops):
    query = "query { "
    query_count = 0
    mutations = []
    for op in ops:
        if op["t"] == "r":
            query_count += 1
            q = "get%d(func: uid(%d)) { value } " % (query_count, op["k"])
            query += q
        else:
            mutations.append({
                "set": {
                    "uid": op["k"],
                    "value": op["v"]
                }
            })
    query += "}"
    if len(mutations) == 0:
        mutations.append({
            "set": {
                "fake_data": 233
            }
        })
    return {
        "query": query,
        "mutations": mutations
    }


def send_request(req_data):
    return requests.post(url, headers=headers, json=req_data)


def fetch_ts(resp_content):
    resp_data = ast.literal_eval(resp_content.decode("utf-8"))
    ts = resp_data["extensions"]["txn"]
    return ts["start_ts"], ts["commit_ts"]


def main():
    operations = [{"t": "w", "k": 745, "v": 697}, {"t": "w", "k": 853, "v": 74},
                  {"t": "r", "k": 853}, {"t": "r", "k": 745}]
    request_data = generate_txn(operations)
    response = send_request(request_data)
    start_ts, commit_ts = fetch_ts(response.content)
    txn = {
        "tid": "my unique id",
        "sid": "my session id",
        "ops": operations,
        "sts": {
            "p": start_ts,
            "l": 0
        },
        "cts": {
            "p": commit_ts,
            "l": 0
        }
    }
    print(json.dumps(txn, indent=4))


if __name__ == '__main__':
    main()

How to generate the test cases

We use dbcop to generate test cases to do tests on Dgraph.

Bug

When we use our own algorithm to test the snapshot Isolation, we find that most of the transactions break the snapshot isolation.

I think that our own algorithm has no problem.

So, my question is that which step I have taken is wrong? Thank you in advance.

@seedoilz seedoilz added the kind/question Something requiring a response. label Aug 16, 2023
@mangalaman93
Copy link
Contributor

Hi @seedoilz, Could you elaborate how you identified that snapshot isolation is violated?

@seedoilz
Copy link
Author

It is a timestamp-based determination method. All transactions are sorted in ascending order by commit timestamps. Then iterates through each transaction and, based on its start timestamps, determines which committed transactions it should have read, checking for consistency with the data it actually read. It also checks to see if concurrent transactions have write conflicts.
We are working on a paper on this, but it's only in draft form at the moment, so if you don't mind and are interested, we'd be happy to share it.

@mangalaman93
Copy link
Contributor

If you could send across a way to reproduce the issue, we are happy to look into it. We saw something like this before too #8146 but later found out that the issue was with the application code.

@seedoilz
Copy link
Author

What you need to do is to download the [json file](https://box.nju.edu.cn/f/64a49141b4e44368bf41/) here and clone this [code](https://github.com/Tsunaou/dbcdc-runner). Then you need to set up the environment including Leiningen and Java (What Jepsen needs). Then you can run the following code in root directory of the code. (replacing the ${dbcop-workload-path} with the path of that json file).

lein run test-all -w rw \
--txn-num 120000 \
--time-limit 43200 \
-r 10000 \
--node dummy-node \
--isolation snapshot-isolation \
--expected-consistency-model snapshot-isolation \
--nemesis none \
--existing-postgres \
--no-ssh \
--database dgraph \
--dbcop-workload-path ${dbcop-workload-path} \
--dbcop-workload

@mangalaman93
Copy link
Contributor

--lru_mb is an old parameter. What version of Dgraph are you using? The compose file that you are using looks really old to me.

@seedoilz
Copy link
Author

Sorry. Actually this compose file is not the one I used. However, I accidentally deleted my compose file. But my compose file is based on the one I gave. So I think it is not a big deal.

@seedoilz
Copy link
Author

This compose file is the one I was using. You could change a bit (node name) and use it.

version: "3.2"
networks:
  dgraph:
services:
  zero:
    image: dgraph/dgraph:latest
    volumes:
      - data-volume:/dgraph
    ports:
      - 5080:5080
      - 6080:6080
    networks:
      - dgraph
    deploy:
      placement:
        constraints:
          - node.hostname == VM-0-7-tencentos
    command: dgraph zero --my=zero:5080 --replicas 3
  alpha1:
    image: dgraph/dgraph:latest
    hostname: "alpha1"
    volumes:
      - data-volume:/dgraph
    ports:
      - 8080:8080
      - 9080:9080
    networks:
      - dgraph
    deploy:
      placement:
        constraints:
          - node.hostname == VM-0-7-tencentos
    command: dgraph alpha --my=alpha1:7080 --security whitelist=10.0.0.0/8,172.0.0.0/8,192.168.0.0/16,127.0.0.1 --zero=zero:5080
  alpha2:
    image: dgraph/dgraph:latest
    hostname: "alpha2"
    volumes:
      - data-volume:/dgraph
    ports:
      - 8081:8081
      - 9081:9081
    networks:
      - dgraph
    deploy:
      replicas: 1
      placement:
        constraints:
          - node.hostname == VM-0-12-tencentos
    command: dgraph alpha --my=alpha2:7081 --security whitelist=10.0.0.0/8,172.0.0.0/8,192.168.0.0/16,127.0.0.1 --zero=zero:5080 -o 1
  alpha3:
    image: dgraph/dgraph:latest
    hostname: "alpha3"
    volumes:
      - data-volume:/dgraph
    ports:
      - 8082:8082
      - 9082:9082
    networks:
      - dgraph
    deploy:
      replicas: 1
      placement:
        constraints:
          - node.hostname == VM-0-14-tencentos
    command: dgraph alpha --my=alpha3:7082 --security whitelist=10.0.0.0/8,172.0.0.0/8,192.168.0.0/16,127.0.0.1 --zero=zero:5080 -o 2
volumes:
  data-volume:

@seedoilz
Copy link
Author

@mangalaman93 hello, could you help me?

@mangalaman93
Copy link
Contributor

Sorry about the delay. The compose files still doesn't look right because the same volume is mounted in all the alphas.

@mangalaman93
Copy link
Contributor

After removing the volume, I get this null pointer exception. Am I running it right?

INFO [2023-08-25 20:18:08,431] jepsen test runner - jepsen.db Tearing down DB
INFO [2023-08-25 20:18:08,433] jepsen test runner - jepsen.db Setting up DB
INFO [2023-08-25 20:18:08,434] jepsen test runner - jepsen.core Relative time begins now
WARN [2023-08-25 20:18:08,442] main - jepsen.core Test crashed!
java.lang.NullPointerException: null
	at disalg.dbcdc.client$open.invokeStatic(client.clj:56)
	at disalg.dbcdc.client$open.invoke(client.clj:51)
	at disalg.dbcdc.rw.Client.open_BANG_(rw.clj:107)
	at jepsen.core$run_case_BANG_$fn__9727.invoke(core.clj:220)
	at dom_top.core$real_pmap_helper$build_thread__213$fn__214.invoke(core.clj:146)
	at clojure.lang.AFn.applyToHelper(AFn.java:152)
	at clojure.lang.AFn.applyTo(AFn.java:144)
	at clojure.core$apply.invokeStatic(core.clj:667)
	at clojure.core$with_bindings_STAR_.invokeStatic(core.clj:1990)
	at clojure.core$with_bindings_STAR_.doInvoke(core.clj:1990)
	at clojure.lang.RestFn.invoke(RestFn.java:425)
	at clojure.lang.AFn.applyToHelper(AFn.java:156)
	at clojure.lang.RestFn.applyTo(RestFn.java:132)
	at clojure.core$apply.invokeStatic(core.clj:671)
	at clojure.core$bound_fn_STAR_$fn__5818.doInvoke(core.clj:2020)
	at clojure.lang.RestFn.invoke(RestFn.java:397)
	at clojure.lang.AFn.run(AFn.java:22)
	at java.base/java.lang.Thread.run(Thread.java:829)
WARN [2023-08-25 20:18:08,446] main - jepsen.cli Test crashed
java.lang.NullPointerException: null
	at disalg.dbcdc.client$open.invokeStatic(client.clj:56)
	at disalg.dbcdc.client$open.invoke(client.clj:51)
	at disalg.dbcdc.rw.Client.open_BANG_(rw.clj:107)
	at jepsen.core$run_case_BANG_$fn__9727.invoke(core.clj:220)
	at dom_top.core$real_pmap_helper$build_thread__213$fn__214.invoke(core.clj:146)
	at clojure.lang.AFn.applyToHelper(AFn.java:152)
	at clojure.lang.AFn.applyTo(AFn.java:144)
	at clojure.core$apply.invokeStatic(core.clj:667)
	at clojure.core$with_bindings_STAR_.invokeStatic(core.clj:1990)
	at clojure.core$with_bindings_STAR_.doInvoke(core.clj:1990)
	at clojure.lang.RestFn.invoke(RestFn.java:425)
	at clojure.lang.AFn.applyToHelper(AFn.java:156)
	at clojure.lang.RestFn.applyTo(RestFn.java:132)
	at clojure.core$apply.invokeStatic(core.clj:671)
	at clojure.core$bound_fn_STAR_$fn__5818.doInvoke(core.clj:2020)
	at clojure.lang.RestFn.invoke(RestFn.java:397)
	at clojure.lang.AFn.run(AFn.java:22)
	at java.base/java.lang.Thread.run(Thread.java:829)
Error parsing edn file 'null': Cannot open <nil> as a Reader.

@seedoilz
Copy link
Author

No, but I think this null pointer exception is raised because the data volume is not mounted on the host. As a result, jepsen can not find the data file.
In addition, in my servers, the same volume is mounted in all the alphas which works well.
image

@mangalaman93
Copy link
Contributor

If you mount the same volume inside all zero and alphas, they will end up using the same p directory which is a problem. And why is the test trying to read a file that dgraph has written? I'm not sure how this compose file working for you. Am I missing something?

@seedoilz
Copy link
Author

I know why you have the null pointer exception. It is because that I forgot to tell u that you need to put this file in dbcdc-runner/resources/
Of course if you set password for dgraph, you need to change this file a little bit.

@mangalaman93
Copy link
Contributor

It is running now, but I do not see any new predicate in the cluster. Is the code writing data into dgraph?

@seedoilz
Copy link
Author

Since we can not access the dgraph by 127.0.0.1, we use the public ip address to operate the database.
So you need to replace the 175.27.241.31 in dbcdc-runner/src/disalg/dbcdc/impls/dgraph.clj with your own ip address or localhost(127.0.0.1) if you could access the database with localhost.

@mangalaman93
Copy link
Contributor

It is still somehow hitting the 175.27.241.31 IP even after I have changed it everywhere as well run lein clean.

@seedoilz
Copy link
Author

Maybe you forgot to change the ip address in the .edn file that I gave you recently.

I know why you have the null pointer exception. It is because that I forgot to tell u that you need to put this file in dbcdc-runner/resources/ Of course if you set password for dgraph, you need to change this file a little bit.

@mangalaman93
Copy link
Contributor

That was it. I do see 1000 values for value predicate. How long is the test configured to run? What did you observe when it failed?

@mangalaman93
Copy link
Contributor

I am able to do the complete run of the test now though it fails in the analysis step due to limited memory on my laptop. I am thinking of running it on a bigger machine but before that is it possible for you to share results of your run where you concluded that it failed for you? And how did you conclude that?

@seedoilz
Copy link
Author

It is a timestamp-based determination method. All transactions are sorted in ascending order by commit timestamps. Then iterates through each transaction and, based on its start timestamps, determines which committed transactions it should have read, checking for consistency with the data it actually read. It also checks to see if concurrent transactions have write conflicts. We are working on a paper on this, but it's only in draft form at the moment, so if you don't mind and are interested, we'd be happy to share it.

By using this jar file.
java -jar TimeKiller.jar --history_path THE_JSON_FILE_PATH --enable_session false

In addition, what I quote is what we do to test the snapshot isolation. The code is here: https://github.com/FertileFragrance/TimeKiller

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/question Something requiring a response.
Development

No branches or pull requests

2 participants