Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increased number of timeouts after DBConnection change in Xandra.Connection #356

Open
harunzengin opened this issue Feb 9, 2024 · 9 comments
Labels

Comments

@harunzengin
Copy link
Contributor

After trying out 0.18.1, we noticed that we get an increased amount of timeouts, both in the tests in the CI and locally and in our staging environment.

It is unclear to me why this is the case. I suspected that it was somehow a race condition with the stream ids, since we create the stream ids with MapSet.new(1..5000) and we fetch them with Enum.at(stream_ids, 1), meaning we get deterministically always the same order of stream ids. I tried to fetch random ids, but the timeouts are still there, so that is ruled out.

@whatyouhide
Copy link
Owner

I tried to fetch random IDs

This shouldn't matter, as the IDs are just IDs and you could be using always the same and be fine.

Do you ever get over the 5000 concurrent requests, as far as you know? Because I’m looking at the code (which, to be clear, I wrote...) and I don't see handling of when we reach the max concurrent requests. Instead, I see my silly self is just doing

    {stream_id, data} =
      get_and_update_in(data.free_stream_ids, fn ids ->
        id = Enum.at(ids, 0)
        {id, MapSet.delete(ids, id)}
      end)

which even worse would result in storing in_flight_requests[nil] 🙈

@harunzengin
Copy link
Contributor Author

Do you ever get over the 5000 concurrent requests, as far as you know?

Yeah, this stacktrace #354 (comment) is exactly what happens when we get out of concurrent connections.

@peixian
Copy link

peixian commented Feb 23, 2024

@harunzengin we saw this as well when we canary deployed 0.18.0 a few months ago. We're still looking into the root cause though.

@harunzengin
Copy link
Contributor Author

harunzengin commented Feb 23, 2024

@harunzengin we saw this as well when we canary deployed 0.18.0 a few months ago. We're still looking into the root cause though.

@peixian Cool, please post if you find out more. I already asked a question in Stackoverflow: https://stackoverflow.com/questions/78035081/concurrent-cassandra-async-writes-leading-for-some-packages-to-get-lost

@peixian
Copy link

peixian commented Feb 23, 2024

@harunzengin ah, we're on Scylla Enterprise, so it's a little different. Although do you have a repro for your issue, like a minimal set of queries or streams open? I can try on my end to see if Scylla also has the same problems.

I think the thing I'm seeing is slightly different but possibly related (#357)

@whatyouhide
Copy link
Owner

@harunzengin are you still seeing this? Did you have any feedback from the Scylla community here?

@harunzengin
Copy link
Contributor Author

@peixian In our case, we insert multiple hundred times a second to our Cassandra cluster. I guess a minimal reproducable query would be sth. like this:

{:ok, conn} = Xandra.start_link(nodes: ["localhost"])
query = "insert into keyspace.table(id) values (:id)"
{:ok, prepared} = Xandra.prepare(conn, query)

Enum.each(1..100, fn _ ->
  Enum.each(1..200, fn _ ->
    Task.start(fn ->
      case Xandra.execute(conn, prepared, %{"id" => 12}, timeout: 5000) do
        {:error, %Xandra.ConnectionError{action: "execute", reason: :timeout}} -> IO.puts("timeout")
        _ -> :ok
      end
    end)
  end)

  Process.sleep(1000)
end)

as said, in comparison to Xandra v0.12, this version is causing way more timeouts. I also created a Grafana dashboard and deployed it to our staging environment, this is how it looks:

Screenshot 2024-05-03 at 20 32 53

However, I have implemented a RetryStrategy, so this is not too bad. I unfortunately couldn't find anything regarding to the async protocol having an increased amount of timeouts on Cassandra 4.0.10.

@peixian
Copy link

peixian commented May 3, 2024

@harunzengin did you see #362? Changing this value fixed the problem for us.

@harunzengin
Copy link
Contributor Author

@peixian Yeah, the version that I deployed to our staging environment has already the commit 6aedc4b, so it didn't really fix the timeout problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants