Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

flightcontroller: ensure more security group rules #601

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

databus23
Copy link
Member

@databus23 databus23 commented Aug 6, 2021

This PR refactors the EnsureKubernikusRuleInSecurityGroup to support reconciling more then one rule.
It now also creates the rule lazily only if no other broader rules are already covering the wanted rules.

In addition to the podCIDR rule we had, it now also added egress rules for ntp and the cluster apiserver.

This PR create the following rules lazily if there isn't already a rule allowing the traffic:

egress:

  • all, to same security group (pod-to-pod)
  • udp port 53 (dns)
  • udp port 123 (ntp)
  • port 443 keppel.$region.cloud.sap, objectstore-3.$region.cloud.sap, keppel.global.global.cloud.sap, objectstore-3.eu-de-1.cloud.sap (for pulling kube-proxy,kubelet, wormhole, flannel images)

ingress:

  • all, from same security group (pod-to-pod)
  • port 30000 - 32767, from private network (load balancer to members)

TODO:

  • Modify our e2e tests to use a fresh security group without any rules to make sure the e2e tests passes with our enforced rules only.
  • Also we should probaply expand our network tests to include host <-> pod tests and also test the loadbalancer datapath. --> Switch to kube-detective for thoroughly testing all combinitions
  • Add metric for failing reconcilitions
  • remove rule for objectstore.eu-de-1.cloud.sap

func createLoadbalancerSecurityGroup(kluster *v1.Kluster, client *gophercloud.ServiceClient) (*securitygroups.SecGroup, error) {
sg, err := securitygroups.Create(client, securitygroups.CreateOpts{
Name: fmt.Sprintf(LoadBalancerSecurityGroupName, kluster.Spec.Name),
Description: fmt.Sprintf(`Kubernikus: SecurityGroup containing LoadBalancer Self-IPs for cluster "%s"`, kluster.Spec.Name),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Description: fmt.Sprintf(`Kubernikus: SecurityGroup containing LoadBalancer Self-IPs for cluster "%s"`, kluster.Spec.Name),
Description: fmt.Sprintf(`Kubernikus: SecurityGroup containing LoadBalancer VIP and Self-IPs for cluster "%s"`, kluster.Spec.Name),

},
}

if lbGroup != nil && lbGroup.ID != "" {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this defensive programming intentional? Because it seems like we ensure the lbGroup is not nil on top (e.g. we error out if it is). So I would be fine in simplifying this and moving it in the "static" wantedRules above

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be nil, eg. when there is not enough quota in the project for creating an additional security group. This ensures the other rules even if that's the case.

Copy link
Member Author

@databus23 databus23 Aug 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, good point but it looks to me like at the moment if creating the security group fails there is an early return (with en error) and no other security groups are enforced anyways.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is it returning? I think I'm ignoring the error while getting the security group.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this only exiting EnsureLoadbalancerSecurityGroup()? To me it looks like EnsureKubernikusRulesInSecurityGroup() runs anyway.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah yes, sorry I misread the code. Nvm

}
allLbPages, err := loadbalancers.List(c.NetworkClient, loadbalancers.ListOpts{}).AllPages()
if err != nil {
return false, err
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return false, err
return false, fmt.Errorf("Failed to list load balancers: %w", err)

}
allLoadbalancers, err := loadbalancers.ExtractLoadBalancers(allLbPages)
if err != nil {
return false, err
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return false, err
return false, fmt.Errorf("Failed to extract load balancers: %w", err)

}
allPortPages, err := ports.List(c.NetworkClient, portListOpts{ProjectID: kluster.Labels["account"], SecurityGroups: lbGroup.ID}).AllPages()
if err != nil {
return false, err
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return false, err
return false, fmt.Errorf("Failed to list ports: %w", err)

}
allPorts, err := ports.ExtractPorts(allPortPages)
if err != nil {
return false, err
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return false, err
return false, fmt.Errorf("Failed to extract ports: %w", err)

if !found {
_, err := ports.Update(c.NetworkClient, lb.VipPortID, ports.UpdateOpts{SecurityGroups: &sg}).Extract()
if err != nil {
return false, err
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return false, err
return false, fmt.Errorf("Failed to add security group %s to port %s: %w", err, lbGroup.ID, lb.VipPortID, err)

sg := []string{lbGroup.ID}
for _, lb := range allLoadbalancers {
if strings.HasPrefix(lb.Name, fmt.Sprintf("kube_service_%s", kluster.Spec.Name)) {
found := false
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest to replace the inner loop with

if !portsOfSecurityGroup.Has(lb.VipPortID) {
  //update port 
}

And to this before the whole loop construct:

portsOfSecurityGroup := sets.NewString()
for _, p := range allPorts {
  portsOfSecurityGroup.Insert(p.ID)
}

using https://pkg.go.dev/k8s.io/apimachinery/pkg/util/sets

continue
if lbGroup == nil {
lbGroup, err = createLoadbalancerSecurityGroup(kluster, c.NetworkClient)
if err != nil || lbGroup == nil {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure the || lbGroup == nil is required. Looking at createLoadbalancerSecurityGroup it returns an error when the create call fails. I don't see how err and lbGroup can be nil at the same time.

return q.String(), err
}

func getLoadbalancerSecurityGroup(kluster *v1.Kluster, client *gophercloud.ServiceClient) (*securitygroups.SecGroup, error) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we are getting two different security groups in this controller maybe its worthwhile to have a single getSecurityGroupByName(name string, client *gophercloud.ServiceClient) function instead if this special purpose function?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, makes sense.

Copy link
Member Author

@databus23 databus23 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM now. Would be awesome now also change the e2e test to actually test that all this works:

  • use an empty security group for e2e cluster, cleanup during teardown
  • add loadbalancer happy path url check to test loadbalancer <-> nodes connectivity is working.

@databus23 databus23 requested a review from auhlig as a code owner August 23, 2021 14:01
Copy link
Member Author

@databus23 databus23 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool. One small nitpick but looks good otherwise

@@ -37,6 +39,12 @@ func (s *SetupTests) Run(t *testing.T) {
require.True(t, running, "The Kluster must be Running")
}

func (s *SetupTests) CreateSecurityGroup(t *testing.T) {
sg, err := securitygroups.Create(s.OpenStack.Network, securitygroups.CreateOpts{Name: "kubernikus"}).Extract()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would create a dedicated security group that is named after the current soak test cluster.

@stale
Copy link

stale bot commented Jan 9, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix label Jan 9, 2022
@stale stale bot removed the wontfix label Apr 21, 2022
@databus23 databus23 force-pushed the securitygroup branch 3 times, most recently from 4b8c704 to b7c6e22 Compare May 12, 2022 11:30
@sandzwerg
Copy link

I haven't checked the code but only your comment but IIRC for UDP you want a port 123 Ingress rule for NTP and port 53 Ingress for DNS as well (because it's stateless so the firewall can't track the state). Also for DNS you might want a TCP port 53 Egress rule as well as DNS now can use both.

@databus23
Copy link
Member Author

databus23 commented Jun 29, 2022

I haven't checked the code but only your comment but IIRC for UDP you want a port 123 Ingress rule for NTP and port 53 Ingress for DNS as well (because it's stateless so the firewall can't track the state).

Are you sure that's the case? Because allowing udp egress 53 and 123 definitlty fixes dns issues and time sync issues. So it seems to me nsx-t is doing some "connection tracking" for udp packets as well (like iptables does, e.g. state ESTABLISHED,RELATED)

Also for DNS you might want a TCP port 53 Egress rule as well as DNS now can use both.

That's a valid point

@sandzwerg
Copy link

At least for NTP a customer had this issue today (not in a kubernikus environment) so I assume that the ingress rules are needed as well, but I haven't tested it myself. Here is the discussion: https://convergedcloud.slack.com/archives/C374AQJ3W/p1656505078940829

* add metrics for security group reconciliation errors
* add kluster event for securitu group reconcilicaiton errors
* e2e:
  * create empty security group for cluster
  * run kube-detective
  * call loadbalancer url
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants