Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: inject IOChaos with multiple containers #4332

Open
wants to merge 18 commits into
base: master
Choose a base branch
from

Conversation

g1eny0ung
Copy link
Member

@g1eny0ung g1eny0ung commented Jan 28, 2024

What problem does this PR solve?

For a long time, we didn't support injecting IOChaos with multiple containers even if we have implemented new unified selectors with the containerNames field.

// TODO: support multiple different container to inject in one pod

What's changed and how it works?

To support injecting IOChaos with multiple containers, I followed below steps:

  1. Currently, the IOChaos server in Chaos Daemon will find toda process based on podiochaos status. It's a one-to-one relationship, so I'm not thinking about making podiochaos support managing multiple toda processes, which would require adding some status fields at least we need to get the containers that have been injected.
  2. An alternative to this is to generate multiple podiochaos based on containers. Because the default container selector will select the first container to inject, we can always get at least one container record from the record controller (if I am right). Then the thing we need to do is generate podiochaos based on container names.
  3. For example, if we have a pod to test named curlpod, and it has two containers named curl1 and curl2, the generated podiochaos will become:
    • curlpod-curl1
    • curlpod-curl2
  4. This is not difficult to do, we used to find podiochaos by the name of the pod, now we just need to add an extra key in the podiochaos manager to discover them. I renamed the Key to PodKey, and added a ChaosKey to find related podiochaos.

Related changes

  • This change also requires further updates to the website (e.g. docs)
  • This change also requires further updates to the UI interface

Cherry-pick to release branches (optional)

This PR should be cherry-picked to the following release branches:

  • release-2.6
  • release-2.5

Checklist

CHANGELOG

Must include at least one of them.

  • I have updated the CHANGELOG.md
  • I have labeled this PR with "no-need-update-changelog"

Tests

Must include at least one of them.

  • Unit test
  • E2E test
  • Manual test

Side effects

  • Breaking backward compatibility

DCO

If you find the DCO check fails, please run commands like below (Depends on the actual situations. For example, if the failed commit isn't the most recent) to fix it:

git commit --amend --signoff
git push --force

Signed-off-by: Yue Yang <g1enyy0ung@gmail.com>
@g1eny0ung g1eny0ung added type/bug-fix A bug needs to be fixed. chaos/io labels Jan 28, 2024
@g1eny0ung g1eny0ung requested review from STRRL and cwen0 January 28, 2024 22:39
@chaotic-prow
Copy link

chaotic-prow bot commented Jan 28, 2024

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from g1eny0ung. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Signed-off-by: Yue Yang <g1enyy0ung@gmail.com>
Copy link

codecov bot commented Jan 28, 2024

Codecov Report

Attention: Patch coverage is 0% with 44 lines in your changes are missing coverage. Please review.

Project coverage is 38.70%. Comparing base (f4a070b) to head (d0de8b2).
Report is 9 commits behind head on master.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #4332      +/-   ##
==========================================
+ Coverage   38.68%   38.70%   +0.01%     
==========================================
  Files         169      169              
  Lines       13950    13965      +15     
==========================================
+ Hits         5397     5405       +8     
- Misses       8105     8113       +8     
+ Partials      448      447       -1     
Files Coverage Δ
controllers/common/fx.go 48.17% <0.00%> (-5.53%) ⬇️

... and 2 files with indirect coverage changes


Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update eb84633...d0de8b2. Read the comment docs.

Signed-off-by: Yue Yang <g1enyy0ung@gmail.com>
@g1eny0ung
Copy link
Member Author

/hold

The e2e tests failed. I need to investigate the reason.

Signed-off-by: Yue Yang <g1enyy0ung@gmail.com>
Signed-off-by: Yue Yang <g1enyy0ung@gmail.com>
@STRRL
Copy link
Member

STRRL commented Feb 6, 2024

I have several points about the current design:

  • What's the expected behavior if the user does not declare containers in PodChaos, would Chaos Mesh inject chaos into all container? or the first container?
  • You mentioned that there are 2 way to make the multiple container podchaos, and you used the second one, which use multiple PodIOChaos for each container. IMO, I prefer to use the first one, which means we need to introduce new fields like containers in PodIOChaos. That would prevent the complexity, ann help understanding what PODIOChaos really does.

@g1eny0ung
Copy link
Member Author

I have several points about the current design:

  • What's the expected behavior if the user does not declare containers in PodChaos, would Chaos Mesh inject chaos into all container? or the first container?

The first container. This behavior is the same as the other type of chaos. The default container selector will select the first container if containerNames is not specified.

  • You mentioned that there are 2 way to make the multiple container podchaos, and you used the second one, which use multiple PodIOChaos for each container. IMO, I prefer to use the first one, which means we need to introduce new fields like containers in PodIOChaos. That would prevent the complexity, ann help understanding what PODIOChaos really does.

@STRRL Yes. I actually thought for hours to choose which way to implement but I finally chose the second way. My considerations are based on the following points:

  1. Introducing a new field containers in the spec of PodIOChaos is a painful thing. We need to be compatible with the old logic and again go for this new field, like if users do not specify the containerNames in IOChaos, we will still use the old way to inject. Or we can use containers to replace the old container, this will introduce a breaking change. My goal is to have IOChaos available for injection in multiple containers first, and don't want to think further about destructive changes anytime soon.
  2. Implementing the second way will take less time than the first, and we can document why your IOChaos and PodIOChaos are one-to-many, so I ended up going with the second way.

@STRRL
Copy link
Member

STRRL commented Feb 26, 2024

e2e tests keep failing because this assertion does not get acquired:

By("assert that io delay is effective again")
err = wait.Poll(5*time.Second, 1*time.Minute, func() (done bool, err error) {
chaos := &v1alpha1.IOChaos{}
err = cli.Get(ctx, chaosKey, chaos)
framework.ExpectNoError(err, "get io chaos error")
for _, c := range chaos.GetStatus().Conditions {
if c.Type == v1alpha1.ConditionAllInjected {
if c.Status != corev1.ConditionTrue {
return false, nil
}
} else if c.Type == v1alpha1.ConditionSelected {
if c.Status != corev1.ConditionTrue {
return false, nil
}
}
}
return true, err
})

other test cases does not check this, so they could get passed, which is not expected.

I am still profiling the reason why the status of IOChaos is not updated toInjected.

With the debugger, I noticed these lines of code does not get called by reconcile, I guess that it does not get the latest status of PodIOChaos, or one iteration of reconciliation does not get triggered properly.

if podiochaos.Status.ObservedGeneration >= iochaos.Status.Instances[record.Id] {
return v1alpha1.Injected, nil
}

@STRRL
Copy link
Member

STRRL commented Feb 26, 2024

yeah, I figure it out....

the current design of controllers, records strategy and PodChaos could not make "using multiple Pod Chaos match one Pod". in other words, the PodChaos should keep 1:1 mapping with the pod.

image

@g1eny0ung
Copy link
Member Author

g1eny0ung commented Feb 27, 2024

yeah, I figure it out....

the current design of controllers, records strategy and PodChaos could not make "using multiple Pod Chaos match one Pod". in other words, the PodChaos should keep 1:1 mapping with the pod.

image

@STRRL 👍🏼 Good catch! To correct this behavior, should I add a condition to check if namespacedName-containerName equals objName? Like below:

namespacedName, containerName, err := controller.ParseNamespacedNameContainer(record.Id)
// ...
namespacedNameContainer := k8sTypes.NamespacedName{
	Namespace: namespacedName.Namespace,
	Name:      namespacedName.Name + "-" + containerName,
}

if namespacedName == objName || namespacedNameContainer == objName {
//...
}

Signed-off-by: Yue Yang <g1enyy0ung@gmail.com>
Signed-off-by: Yue Yang <g1enyy0ung@gmail.com>
Signed-off-by: Yue Yang <g1enyy0ung@gmail.com>
Signed-off-by: Yue Yang <g1enyy0ung@gmail.com>
Signed-off-by: Yue Yang <g1enyy0ung@gmail.com>
Signed-off-by: Yue Yang <g1enyy0ung@gmail.com>
Signed-off-by: Yue Yang <g1enyy0ung@gmail.com>
Signed-off-by: Yue Yang <g1enyy0ung@gmail.com>
Signed-off-by: Yue Yang <g1enyy0ung@gmail.com>
@g1eny0ung
Copy link
Member Author

After discussing with @STRRL, we decided to abandon this implementation (an IOChaos corresponds to multiple PodIOChaos). Because it's against the current records strategy and we haven't figured out how to adapt it yet.

Currently, a record only matches a Pod, for example, if you inject a NetworkChaos, it will create a PodNetworkChaos to apply rules into a Pod. Or a PodChaos with a pod-kill action, will kill a Pod (Suppose it is named curl and in the default namespace) and insert a record with id: default/curl. But in this PR, there will be multiple records because multiple PodIOChaos are generated. This is the current shortcoming of Chaos Mesh, and the current records strategy does not accommodate the different kinds of Chaos. If we patch on top of that then there could be a lot of hidden bugs (just like this PR did 😑).

So the next step, we will do:

  1. Keep this PR open, as it's a temporary solution. A flawed implementation is better than none at all. 🧐
  2. Allow IOChaos to go to a specific container for injection. It is now only injected by default into the first container and cannot changed.
  3. Going a step further, PodIOChaos can manage multiple toda processes, enabling the injection of multiple containers.

For anyone who wants to inject IOChaos with multiple containers now, please compile the images based on this PR. The current code passes all the necessary tests, and it's working in my local test environment.

@eedjama
Copy link

eedjama commented Mar 22, 2024

Hi,
the solution should consider solving following cases:

  1. Multiple container mounting the same PV
  2. Multiple container mounting different PVs
  3. One single container mounts different PVs (This might be a special or a subcase?)

Thanks and br
Janine

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants