Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCPBUGS-29975: Allow multiple machine networks #6071

Open
wants to merge 12 commits into
base: master
Choose a base branch
from

Conversation

zaneb
Copy link
Member

@zaneb zaneb commented Mar 11, 2024

Until now, assisted-service has assumed that there would be either exactly one MachineNetwork specified for single-stack clusters, or for dual-stack clusters that there would be exactly one IPv4 and one IPv6 MachineNetwork specified (in that order).

None of these restrictions exist in OpenShift itself, which allows multiple MachineNetworks of each address family. This is necessary to support remote worker nodes on day 1, as well as distributed installations of a kind that are common in UPI deployments (where an external load balancer can easily balance across hosts in separate networks). (OCPBUGS-29975)

This change removes the assumptions about a single MachineNetwork per address family for clusters with UserManagedNetworking enabled, and reverts the change in #4867 that prevents users from specifying more networks at the API level.

(Clusters without UserManagedNetworking will still use Layer 2 reachability checks in the belongs-to-majority-group host validation, which effectively prevents using remote worker nodes when creating these clusters. See OCPBUGS-30730.)

List all the issues related to this PR

  • New Feature
  • Enhancement
  • Bug fix
  • Tests
  • Documentation
  • CI/CD

What environments does this code impact?

  • Automation (CI, tools, etc)
  • Cloud
  • Operator Managed Deployments
  • Agent-based installer
  • None

How was this code tested?

  • assisted-test-infra environment
  • dev-scripts environment
  • Reviewer's test appreciated
  • Waiting for CI to do a full test run
  • Manual (Elaborate on how it was tested)
  • No tests needed

Checklist

  • Title and description added to both, commit and PR.
  • Relevant issues have been associated (see CONTRIBUTING guide)
  • This change does not require a documentation update (docstring, docs, README, etc)
  • Does this change include unit-tests (note that code changes require unit-tests)

Reviewers Checklist

  • Are the title and description (in both PR and commit) meaningful and clear?
  • Is there a bug required (and linked) for this change?
  • Should this PR be backported?

It's confusing that GetMachineNetworksFromBootstrapHost() returns the
existing MachineNetworks in the cluster (and does *not* get them from
the bootstrap host) if they already exist there. Refactor to make the
logic clearer.
The network type is set for all platforms in getBasicInstallConfig().
There is no need to set it again in the none platform provider.
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Mar 11, 2024
@openshift-ci-robot
Copy link

@zaneb: This pull request references Jira Issue OCPBUGS-29975, which is invalid:

  • expected the bug to target the "4.16.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

Until now, assisted-service has assumed that there would be either exactly one MachineNetwork specified for single-stack clusters, or for dual-stack clusters that there would be exactly one IPv4 and one IPv6 MachineNetwork specified (in that order).

None of these restrictions exist in OpenShift itself, which allows multiple MachineNetworks of each address family. This is necessary to support remote worker nodes on day 1, as well as distributed installations of a kind that are common in UPI deployments (where an external load balancer can easily balance across hosts in separate networks). (OCPBUGS-29975)

This change removes the assumptions about a single MachineNetwork per address family for clusters with UserManagedNetworking enabled, and reverts the change in #4867 that prevents users from specifying more networks at the API level.

(Clusters without UserManagedNetworking will still use Layer 2 reachability checks in the belongs-to-majority-group host validation, which effectively prevents using remote worker nodes when creating these clusters. See OCPBUGS-30730.)

List all the issues related to this PR

  • New Feature
  • Enhancement
  • Bug fix
  • Tests
  • Documentation
  • CI/CD

What environments does this code impact?

  • Automation (CI, tools, etc)
  • Cloud
  • Operator Managed Deployments
  • Agent-based installer
  • None

How was this code tested?

  • assisted-test-infra environment
  • dev-scripts environment
  • Reviewer's test appreciated
  • Waiting for CI to do a full test run
  • Manual (Elaborate on how it was tested)
  • No tests needed

Checklist

  • Title and description added to both, commit and PR.
  • Relevant issues have been associated (see CONTRIBUTING guide)
  • This change does not require a documentation update (docstring, docs, README, etc)
  • Does this change include unit-tests (note that code changes require unit-tests)

Reviewers Checklist

  • Are the title and description (in both PR and commit) meaningful and clear?
  • Is there a bug required (and linked) for this change?
  • Should this PR be backported?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot added the jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. label Mar 11, 2024
@zaneb zaneb changed the title OCPBUGS-29975: Support multiple machine networks OCPBUGS-29975: Allow multiple machine networks Mar 11, 2024
@openshift-ci openshift-ci bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Mar 11, 2024
Copy link

openshift-ci bot commented Mar 11, 2024

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: zaneb
Once this PR has been reviewed and has the lgtm label, please assign rccrdpccl for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@zaneb
Copy link
Member Author

zaneb commented Mar 11, 2024

/jira refresh

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Mar 11, 2024
@openshift-ci-robot
Copy link

@zaneb: This pull request references Jira Issue OCPBUGS-29975, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.16.0) matches configured target version for branch (4.16.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @mhanss

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot requested a review from mhanss March 11, 2024 01:11
@zaneb
Copy link
Member Author

zaneb commented Mar 11, 2024

Infra issues with the CentOS repos
/retest

Copy link

codecov bot commented Mar 11, 2024

Codecov Report

Attention: Patch coverage is 27.90698% with 62 lines in your changes are missing coverage. Please review.

Project coverage is 68.36%. Comparing base (ad3e304) to head (ff9be84).
Report is 17 commits behind head on master.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #6071      +/-   ##
==========================================
- Coverage   68.40%   68.36%   -0.04%     
==========================================
  Files         239      239              
  Lines       35453    35474      +21     
==========================================
+ Hits        24251    24252       +1     
- Misses       9098     9119      +21     
+ Partials     2104     2103       -1     
Files Coverage Δ
internal/cluster/validator.go 96.15% <100.00%> (+0.44%) ⬆️
internal/host/validator.go 81.73% <100.00%> (ø)
internal/cluster/validations/validations.go 26.07% <0.00%> (+0.20%) ⬆️
internal/provider/external/external.go 0.00% <0.00%> (ø)
internal/provider/external/oci.go 52.38% <0.00%> (ø)
internal/provider/nutanix/installConfig.go 0.00% <0.00%> (ø)
internal/provider/vsphere/installConfig.go 0.00% <0.00%> (ø)
internal/bminventory/inventory.go 71.29% <75.00%> (-0.03%) ⬇️
internal/network/dual_stack_validations.go 0.00% <0.00%> (ø)
internal/network/machine_network_cidr.go 56.88% <45.83%> (+1.21%) ⬆️
... and 1 more

... and 7 files with indirect coverage changes

@zaneb zaneb force-pushed the multiple-machine-networks branch from 68f0945 to b996310 Compare March 11, 2024 09:49
Copy link
Contributor

@ori-amizur ori-amizur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not clear what are the restrictions here:
Do we allow multiple API VIPs per address family?
Do we have restriction per host-role to belong to a specific machine-network? If yes, this needs to be added as validation.
What if we have only 3 host cluster (only masters). Can such cluster have multiple IPv4 machine-networks?
Can a machine-network be stale (without hosts)?

@@ -21,7 +22,7 @@ const MinSNOMachineMaskDelta = 1
func parseCIDR(cidr string) (ip net.IP, ipnet *net.IPNet, err error) {
ip, ipnet, err = net.ParseCIDR(cidr)
if err != nil {
err = errors.Wrapf(err, "Failed to parse CIDR '%s'", cidr)
err = fmt.Errorf("Failed to parse CIDR '%s': %w", cidr, err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this was changed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because it's a pain to import both the standard library "errors" package and the deprecated old-timey hack "github.com/pkg/errors" package in the same file. This is a trivial refactor, the result is the same.

if err != nil {
return err
}

if overlap {
return errors.Errorf("CIDRS %s and %s overlap", aCidrStr, bCidrStr)
return fmt.Errorf("CIDRS %s and %s overlap", aCidrStr, bCidrStr)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please do not change the errors unless necessary

log.WithError(err).Warnf("Verify VIPs")
return common.NewApiError(http.StatusBadRequest, err)
}
}

} else {
primaryMachineNetworkCidr, err = network.CalculateMachineNetworkCIDR(network.GetApiVipById(&targetConfiguration, 0), network.GetIngressVipById(&targetConfiguration, 0), cluster.Hosts, matchRequired)
primaryMachineNetworkCidr, err := network.CalculateMachineNetworkCIDR(network.GetApiVipById(&targetConfiguration, 0), network.GetIngressVipById(&targetConfiguration, 0), cluster.Hosts, matchRequired)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be wrong according the concept presented by this PR, because we may have to machine-network - one per vip and one of both vips.

if err := checkCidrsOverlapping(c.cluster); err != nil {
return ValidationFailure, fmt.Sprintf("CIDRS Overlapping: %s.", err.Error())
if err := network.VerifyNoNetworkCidrOverlaps(c.cluster.ClusterNetworks, c.cluster.MachineNetworks, c.cluster.ServiceNetworks); err != nil {
return ValidationFailure, err.Error()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The validation error messages need to be agreed upon. They are presented by UI. Here it is not clear what is expected as text of this error message

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack. I did change them, you can see the change in the tests here: 51e76b1#diff-854f83029709261bb4e532a5a6839b51ddeb2beb2f6e921056016a74340d06a3

@@ -600,7 +600,7 @@ func (v *validator) belongsToMachineCidr(c *validationContext) (ValidationStatus
if c.inventory == nil || !network.IsMachineCidrAvailable(c.cluster) {
return ValidationPending, "Missing inventory or machine network CIDR"
}
if !network.IsHostInPrimaryMachineNetCidr(v.log, c.cluster, c.host) {
if !network.IsHostInMachineNetCidrs(v.log, c.cluster, c.host) {
return ValidationFailure, "Host does not belong to machine network CIDRs. Verify that the host belongs to every CIDR listed under machine networks"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This error message is not true now. Right ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yep, last part is not.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wording is awkward here, because we still have to belong to both IPv4 and IPv6 CIDRs if both appear in the list. What do you think of "Verify that the host belongs to a CIDR for every IP version listed under machine networks"?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That might be OK. @avishayt what do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Verify that the host belongs to a machine network CIDR for every used IP version"?

return models.VipVerificationUnverified, errors.Errorf("%s <%s> cannot be set if Machine Network CIDR is empty", vipName, vip)
}
if !ipInCidr(vip, machineNetworkCidr) {
return models.VipVerificationFailed, errors.Errorf("%s <%s> does not belong to machine-network-cidr <%s>", vipName, vip, machineNetworkCidr)
if machineNetworkCidr == "" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is irrelevant. The machine network cannot be valid and empty.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it can - validMachineNetwork will be true if any machine networks are defined, but machineNetworkCidr will only be set if the VIP is in one of them.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you don't need the test for valid-network

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is maintaining the previous behaviour. If there are no machineNetworks specified, we return VipVerificationUnverified. If there are machineNetworks specified and the VIP isn't in any of them, we return VipVerificationFailed.

I think what's confusing in the diff is that machineNetworkCidr was previously a parameter, and now it's a local variable.

@zaneb zaneb force-pushed the multiple-machine-networks branch from b996310 to c6e01dc Compare March 12, 2024 03:26
None of the subnets specified in any of the machineNetworks,
clusterNetworks, or serviceNetworks should overlap. Validate all of
these combinations, as openshift-installer does, instead of making
assumptions about indices being aligned to address families.
Don't make assumptions about a 1:1 mapping between MachineNetworks and
VIPs.
In the belongs-to-machine-cidr validation, allow the host to be a member
of any MachineNetwork. In a dual-stack cluster, require it to be a
member of both an IPv4 and an IPv6 network.

Previously it was assumed that the only reason for multiple
MachineNetworks to appear was that a dual stack cluster could contain
exactly one IPv4 and one IPv6 MachineNetwork.
Multiple MachineNetworks in the same address family and IPv6-primary
dual-stack clusters are a thing, so relax the dual-stack validation
requirements for machine networks to allow them.
Allow users to specify multiple machine networks of the same address
family. This is a documented and supported feature of OpenShift.

This reverts commit 873dd81.
Don't restrict ourselves to the first machine networking when looking
for an interface on a machine network to set the BootMACAddress.
OpenShift has ~always supported having machines in multiple
machineNetworks, so update the TODO comment to reflect that accounting
for this is already something we need to do to fully support
non-UserManagedNetworking clusters. (UserManagedNetworking clusters use
only the L3 connectivity check.)

See https://issues.redhat.com/browse/OCPBUGS-30730 for more details.
@zaneb zaneb force-pushed the multiple-machine-networks branch from c6e01dc to ff9be84 Compare March 12, 2024 10:12
@zaneb
Copy link
Member Author

zaneb commented Mar 12, 2024

Do we allow multiple API VIPs per address family?

No.

Do we have restriction per host-role to belong to a specific machine-network? If yes, this needs to be added as validation.

No.

What if we have only 3 host cluster (only masters). Can such cluster have multiple IPv4 machine-networks?

Yes (and in fact this is likely to be a common case for platform: none).

Can a machine-network be stale (without hosts)?

I don't see why not.

Copy link

openshift-ci bot commented Mar 12, 2024

@zaneb: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/edge-e2e-oci-assisted-4-14 ff9be84 link false /test edge-e2e-oci-assisted-4-14
ci/prow/edge-e2e-nutanix-assisted ff9be84 link false /test edge-e2e-nutanix-assisted

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@ori-amizur
Copy link
Contributor

ori-amizur commented Mar 13, 2024

Do we have restriction per host-role to belong to a specific machine-network? If yes, this needs to be added as validation.

No.

So workers do not have to belong to machine-networks of ingress-vips? Maybe we have to verify that we have at least 2 workers belonging to machine-network of ingress VIPs ? If not it will cause a mess. The logic is already complicated and if we don't set rules, users will be confused.

What if we have only 3 host cluster (only masters). Can such cluster have multiple IPv4 machine-networks?

Yes (and in fact this is likely to be a common case for platform: none).

None platform does not have machine-networks at all (At least how we implemented it).
Anyway if we have multiple machine-networks for masters, how can the API VIPs move between these hosts?

Can a machine-network be stale (without hosts)?

I don't see why not.

Why is it needed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants