Support Data Repository Associations for Lusture 2.12 or newer filesystems(e.g. `PERSISTENT_2` deployment type) #368

everpeace · 2023-12-27T08:46:39Z

Is this a bug fix or adding new feature?

new feature
fixes #367

What is this PR about? / Why do we need it?

This PR supports Data Repository Association(API reference) for Lusture 2.12 or newer filesystems(e.g. PERSISTENT_2 deployment type) like below:

kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
  name: fsx-sc
provisioner: fsx.csi.aws.com
parameters:
  subnetId: subnet-0d7b5e117ad7b4961
  securityGroupIds: sg-05a37bfe01467059a
  deploymentType: PERSISTENT_2
  perUnitStorageThroughput: "125"
  # User can specify multiple data repository associations like this
  dataRepositoryAssociations: |
    - batchImportMetaDataOnCreate: true
      dataRepositoryPath: s3://ml-training-data-000
      fileSystemPath: /ml-training-data-000
      s3:
        autoExportPolicy:
          events: ["NEW", "CHANGED", "DELETED" ]
        autoImportPolicy:
          events: ["NEW", "CHANGED", "DELETED" ]
    - batchImportMetaDataOnCreate: true
      dataRepositoryPath: s3://ml-training-data-001
      fileSystemPath: /ml-training-data-001
      s3:
        autoExportPolicy:
          events: ["NEW", "CHANGED", "DELETED" ]
        autoImportPolicy:
          events: ["NEW", "CHANGED", "DELETED" ]

  # NOTE: These parameters can't be set when using　dataRepositoryAssociations
  #       as document explained:: https://docs.aws.amazon.com/fsx/latest/APIReference/API_CreateFileSystemLustreConfiguration.html
  # s3ImportPath: s3://ml-training-data-000
  # s3ExportPath: s3://ml-training-data-000/export
  # autoImportPolicy: NEW_CHANGED

What testing is done?

make test

# with setting GINKGO_FOCUS=".*fsx-csi-e2e.*PERSISTENT_2.*"
make test-e2e

…t types

k8s-ci-robot · 2023-12-27T08:46:49Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: everpeace
Once this PR has been reviewed and has the lgtm label, please assign olemarkus for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

everpeace · 2023-12-27T14:12:36Z

/retest pull-aws-fsx-csi-driver-e2e

k8s-ci-robot · 2023-12-27T14:12:38Z

@everpeace: The /retest command does not accept any targets.
The following commands are available to trigger required jobs:

/test pull-aws-fsx-csi-driver-e2e
/test pull-aws-fsx-csi-driver-unit
/test pull-aws-fsx-csi-driver-verify

Use /test all to run all jobs.

In response to this:

/retest pull-aws-fsx-csi-driver-e2e

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

everpeace · 2023-12-27T14:13:02Z

/test pull-aws-fsx-csi-driver-e2e

jacobwolfaws · 2023-12-27T15:21:43Z

pkg/cloud/cloud.go

@@ -62,6 +64,10 @@ var (
 	// disks are found with the same volume name.
 	ErrMultiFileSystems = errors.New("Multiple filesystems with same ID")

+	// ErrMultiAssociations is an error that is returned when multiple
+	// associations are found with the same volume name.


If I understand correctly, this would be if there are multiple DRAs with the same association id, not multiple associations with the same volume.

Thanks. fixed in bba7484.

charts/aws-fsx-csi-driver/templates/controller-deployment.yaml

everpeace

@jacobwolfaws Thank you for the quick review. I addressed your feedbacks. PTAL 🙇

everpeace · 2023-12-28T06:36:19Z

charts/aws-fsx-csi-driver/values.yaml

@@ -65,6 +65,7 @@ controller:
    - effect: NoExecute
      operator: Exists
      tolerationSeconds: 300
+  provisionerTimeout: 5m


Should we make the default provisioner timeout longer in helm chart? It is because it often takes more time to prepare an FSx filesystem when it has data repository associations.

Single FSx for Lusture fielsystem can have up to 8 data repository associations.

In my experience, it usually takes around 7-10 minutes to make single data repository associations available even for empty S3 bucket.
Moreover, setting up data repository associations on the specific filesystem looks like sequential.

So, I think 90 min = 10x8 min (data repository associations) + 5 min (FSx filesystem) + <buffer> would be safe because the current CreateVolume operation is synchronous operation and not be safe when timeout happened.

What do you think??

I think keeping the default timer the same + clearly documenting the need to change the timeout if using DRAs would be the correct move. This ensures consistent behavior for users who aren't using DRAs. Extending it is a one way door (because reducing the timeout would break compatibility for users who are using a large number of DRAs).

Extending it is a one way door (because reducing the timeout would break compatibility for users who are using a large number of DRAs)

It makes sense.

the default timer the same + clearly documenting the need to change the timeout if using DRAs would be the correct move

OK. let me add the documentation.

addressed in below commits:

03a32dc

fafc9e9

I'm not sure about the information for users, it seems like users using DRAs will still be fine in most cases:
https://github.com/kubernetes-csi/external-provisioner?tab=readme-ov-file
https://github.com/kubernetes-csi/external-provisioner?tab=readme-ov-file#csi-error-and-timeout-handling
The CreateVolume will timeout and other ones will be made with an exponential backoff. It's only in the case of a large number of DRAs where this will be an issue.

everpeace · 2023-12-28T12:06:49Z

/test pull-aws-fsx-csi-driver-e2e

…x filesystem

…sing Data Repository Associasions

jacobwolfaws · 2024-02-27T21:24:53Z

deploy/kubernetes/base/controller.env

@@ -0,0 +1 @@
+CONTROLLER_PROVISIONER_TIMEOUT=5m


What's the value of creating a separate file for this vs. putting it in the values.yaml:
https://github.com/kubernetes-sigs/aws-fsx-csi-driver/blob/master/charts/aws-fsx-csi-driver/values.yaml#L42-L67

This file is for manifests only for kustomize. values.yaml is dedicated to helm chart. I understand this driver supports both kustomize and helm.

In kustomize, injecting parameter in building manifest needs a bit hack. This env file is needed for kustomize users to change timeout value. I also updated install.md as below:

https://github.com/everpeace/aws-fsx-csi-driver/blob/suppor-dra/docs/install.md#deploy-driver

# To set CSI controller's provisioner timeout, # Please follow the instruction $ cd $(mktemp -d) $ kustomize init $ kustomize edit add resource "github.com/kubernetes-sigs/aws-fsx-csi-driver/deploy/kubernetes/overlays/stable/?ref=release-1.1" $ kustomize edit add configmap fsx-csi-controller --from-literal=CONTROLLER_PROVISIONER_TIMEOUT=30m --behavior=merge $ kubectl apply -k .

I think we should avoid hacks when possible and this seems like an avoidable instance. If users want to configure their kustomize templates, they can download them, configure them, and deploy them freely. We should follow precedent in terms of implementation, which is to put it in the values.yaml.

jacobwolfaws · 2024-03-12T19:05:16Z

charts/aws-fsx-csi-driver/values.yaml

@@ -65,6 +65,7 @@ controller:
    - effect: NoExecute
      operator: Exists
      tolerationSeconds: 300
+  provisionerTimeout: 5m


I'm not sure about the information for users, it seems like users using DRAs will still be fine in most cases:
https://github.com/kubernetes-csi/external-provisioner?tab=readme-ov-file
https://github.com/kubernetes-csi/external-provisioner?tab=readme-ov-file#csi-error-and-timeout-handling
The CreateVolume will timeout and other ones will be made with an exponential backoff. It's only in the case of a large number of DRAs where this will be an issue.

jacobwolfaws · 2024-03-12T19:07:39Z

deploy/kubernetes/base/controller.env

@@ -0,0 +1 @@
+CONTROLLER_PROVISIONER_TIMEOUT=5m


I think we should avoid hacks when possible and this seems like an avoidable instance. If users want to configure their kustomize templates, they can download them, configure them, and deploy them freely. We should follow precedent in terms of implementation, which is to put it in the values.yaml.

jacobwolfaws · 2024-03-12T19:53:45Z

pkg/cloud/cloud.go

 	// target file system values
-	PollCheckTimeout = 10 * time.Minute
+	PollCheckTimeout = 15 * time.Minute


If PollCheckTimeout < provisionerTimeout, the provisionerTimeout will always kill the CreateVolume call before it the PollCheckTimeout is hit. I don't think incrementing this should make a difference

Support Data Repository Associations for filesystems in new deploymen…

f8dd1e2

…t types

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Dec 27, 2023

k8s-ci-robot requested a review from jacobwolfaws December 27, 2023 08:46

k8s-ci-robot requested a review from nckturner December 27, 2023 08:46

k8s-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Dec 27, 2023

everpeace changed the title ~~Support Data Repository Associations for PERSISTENT_2 deployment type filesystems~~ Support Data Repository Associations for Lusture 2.12 or newer filesystems(e.g. PERSISTENT_2 deployment type) Dec 27, 2023

everpeace force-pushed the suppor-dra branch from d7260f4 to c268728 Compare December 27, 2023 13:32

everpeace force-pushed the suppor-dra branch from c268728 to bf8d4ac Compare December 27, 2023 15:33

jacobwolfaws reviewed Dec 27, 2023

View reviewed changes

charts/aws-fsx-csi-driver/templates/controller-deployment.yaml Show resolved Hide resolved

Make csi-provisioner timeout customizable in helm chart

22c8a71

everpeace force-pushed the suppor-dra branch from bf8d4ac to 04efd39 Compare December 28, 2023 00:58

everpeace added 4 commits December 28, 2023 11:47

Add simple e2e test case for Data Repository Associations support

68ccdee

Add an example for Data Repository Associations support

fdd3643

fix comment on ErrMultiAssociations

bba7484

Add required IAM actions for Data Repository Associations in install.md

3120d26

everpeace force-pushed the suppor-dra branch from 96129a4 to 3120d26 Compare December 28, 2023 02:47

everpeace commented Dec 28, 2023

View reviewed changes

everpeace requested a review from jacobwolfaws December 28, 2023 06:23

everpeace commented Dec 28, 2023

View reviewed changes

DataRepositoryAssociations should set the same extra tags with the FS…

4d05480

…x filesystem

everpeace force-pushed the suppor-dra branch from 063cd00 to 4d05480 Compare December 28, 2023 19:24

everpeace added 2 commits January 18, 2024 13:29

Make csi-provisioner timeout customizable in kustomize

03a32dc

Add warning for setting loger controller's provisioner timeout when u…

fafc9e9

…sing Data Repository Associasions

everpeace force-pushed the suppor-dra branch from 8d23e81 to fafc9e9 Compare January 18, 2024 04:29

jacobwolfaws reviewed Feb 27, 2024

View reviewed changes

jacobwolfaws reviewed Mar 12, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support Data Repository Associations for Lusture 2.12 or newer filesystems(e.g. `PERSISTENT_2` deployment type) #368

Support Data Repository Associations for Lusture 2.12 or newer filesystems(e.g. `PERSISTENT_2` deployment type) #368

everpeace commented Dec 27, 2023 •

edited

k8s-ci-robot commented Dec 27, 2023

everpeace commented Dec 27, 2023

k8s-ci-robot commented Dec 27, 2023

everpeace commented Dec 27, 2023

jacobwolfaws Dec 27, 2023

everpeace Dec 28, 2023 •

edited

everpeace left a comment •

edited

everpeace Dec 28, 2023 •

edited

jacobwolfaws Jan 17, 2024

everpeace Jan 18, 2024

everpeace Jan 18, 2024 •

edited

jacobwolfaws Mar 12, 2024

everpeace commented Dec 28, 2023

jacobwolfaws Feb 27, 2024

everpeace Feb 28, 2024 •

edited

jacobwolfaws Mar 12, 2024

jacobwolfaws Mar 12, 2024

jacobwolfaws Mar 12, 2024

jacobwolfaws Mar 12, 2024

		@@ -0,0 +1 @@
		CONTROLLER_PROVISIONER_TIMEOUT=5m

		@@ -0,0 +1 @@
		CONTROLLER_PROVISIONER_TIMEOUT=5m

Support Data Repository Associations for Lusture 2.12 or newer filesystems(e.g. PERSISTENT_2 deployment type) #368

Are you sure you want to change the base?

Support Data Repository Associations for Lusture 2.12 or newer filesystems(e.g. PERSISTENT_2 deployment type) #368

Conversation

everpeace commented Dec 27, 2023 • edited

k8s-ci-robot commented Dec 27, 2023

everpeace commented Dec 27, 2023

k8s-ci-robot commented Dec 27, 2023

everpeace commented Dec 27, 2023

Choose a reason for hiding this comment

everpeace Dec 28, 2023 • edited

Choose a reason for hiding this comment

everpeace left a comment • edited

Choose a reason for hiding this comment

everpeace Dec 28, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

everpeace Jan 18, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

everpeace commented Dec 28, 2023

Choose a reason for hiding this comment

everpeace Feb 28, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Support Data Repository Associations for Lusture 2.12 or newer filesystems(e.g. `PERSISTENT_2` deployment type) #368

Support Data Repository Associations for Lusture 2.12 or newer filesystems(e.g. `PERSISTENT_2` deployment type) #368

everpeace commented Dec 27, 2023 •

edited

everpeace Dec 28, 2023 •

edited

everpeace left a comment •

edited

everpeace Dec 28, 2023 •

edited

everpeace Jan 18, 2024 •

edited

everpeace Feb 28, 2024 •

edited