Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HDDS-8342. AWS S3 Lifecycle Configurations doc #6589

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

mohan3d
Copy link
Contributor

@mohan3d mohan3d commented Apr 25, 2024

What changes were proposed in this pull request?

Design proposal for data retention (AWS S3 Lifecycle Configurations ) feature. Please comment inline on the markdown document to ask questions and post feedback.

What is the link to the Apache JIRA

HDDS-8342

How was this patch tested?

N/A

@mohan3d
Copy link
Contributor Author

mohan3d commented Apr 25, 2024

@ivandika3 @xichen01 Please take a look and let me know whether I need to amend or enhance it.

@ivandika3 ivandika3 added design documentation Improvements or additions to documentation labels Apr 26, 2024
Copy link
Contributor

@ivandika3 ivandika3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot @mohan3d for the design document. I left a few initial comments. Will review deeper in the following days.

Also, @kerneltime has left some comment in the ticket (https://issues.apache.org/jira/browse/HDDS-8342?focusedCommentId=17841064&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17841064). Could you help to address it as well?

@ivandika3
Copy link
Contributor

@ArafatKhan2198 @SaketaChalamchala @tanvipenumudy Could you help take a look as well when you have time?

@ivandika3 ivandika3 requested a review from xichen01 April 26, 2024 08:00
optional uint64 creationTime = 4;
repeated LifecycleRule rules = 5;
optional uint64 objectID = 6;
optional uint64 updateID = 7;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the doc @mohan3d. Do you think it would make sense to also add a required status field here to indicate whether the configuration is enabled or disabled?
Consequently, we might need definitions for Disabling and Enabling the lifecycle configurations.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SaketaChalamchala Yes it makes sense, and actually there is such flag on the Rule level.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SaketaChalamchala The required bool enabled = 3; in the LifecycleRule is use to indicate whether the configuration is enabled or disabled


message DeleteLifecycleConfigurationRequest {
required string volumeName = 1;
required string bucketName = 2;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would accepting an optional LifecycleFilter here and in List and Info configuration requests be useful?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be useful. I was not able to find such thing on AWS side that's why my implementation doesn't have such optional filter.

https://docs.aws.amazon.com/AmazonS3/latest/API/API_GetBucketLifecycleConfiguration.html
https://docs.aws.amazon.com/AmazonS3/latest/API/API_DeleteBucketLifecycle.html

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AWS S3 only supports to delete all the LifecycleConfiguration of a bucket, refer to:
https://docs.aws.amazon.com/AmazonS3/latest/API/API_DeleteBucketLifecycle.html

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see.

## Overview

### Functionality
- User should be able to create/remove/fetch lifecycle configurations for a specific S3 bucket.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have any thoughts on what Acl checks would be performed for creating a lifecycle configuration. Would it be restricted to the owners of the keys or an ozone administrator?
What would happen if keys with the same prefix have multiple owners. I one of the key owners creates a liecycle configuration on the prefix would all of the keys with the prefix be deleted?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have any thoughts on what Acl checks would be performed for creating a lifecycle configuration. Would it be restricted to the owners of the keys or an ozone administrator?

Maybe Need the 'WRITE' permission for the being operated bucket?
If a user has 'WRITE' permission on a bucket, it is possible to overwrite or delete another user's key in the bucket without going through the Lifecycle

When Lifecycle deletes a key, as long as the Rule is met, the key will be deleted, if we want to block users from removing or deleting objects from specific bucket, bucket owner should not give the WRITE permission for the other user.

When Lifecycle deletes a key, as long as the Rule is met, the key will be deleted, the deleting operation is executed by the om own, the om is a admin user. if we want to block users from removing or deleting objects from specific bucket, bucket owner should not give the WRITE permission for the other user on the specific bucket.

https://docs.aws.amazon.com/AmazonS3/latest/API/API_PutBucketLifecycleConfiguration.html

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the time of implementing this I had no imagination on how the ACLs will work on this. If I recall it was restricted for the owner.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initially I remember I implemented it this way, if the user who sat the lifecycle configuration doesn't have the right to delete the key, then the key should be skipped although it was eligible for deletion.

But later I changed it to delete the key anyway, I need to check the code to answer you accurately.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would happen when the 'WRITE' permission is revoked from a user. Would that trigger all the Lifecycle configurations owned by the user to be disabled?

Copy link
Contributor

@ivandika3 ivandika3 May 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As said by @mohan3d and @xichen01 , for simplicity sake, we can first restrict it to the management of bucket lifecycle configurations to the bucket owner since the bucket owner will not change for the lifecycle of the bucket.

Note: Bucket lifecycle configurations will need to be deleted before the bucket can be deleted.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mohan3d @xichen01 We need to check about the applicability for Native ACL and Ranger ACL (e.g. whether bucket ownership is applied in Ranger as well). Need to comply to both ACL models.

- A background retention service is responsible for scheduling and executing tasks at specified intervals.
- The Retention Manager retrieves lifecycle configurations associated with buckets.
- Then assigns each lifecycle configuration (attached to a bucket) to a threadpool (Configurable) for further processing.
- Each task will iterate through keys of a specific bucket and issue deletion request for eligible keys.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mohan3d , could you elaborate if a key is covered by multiple defined rules, what will be the final operation of this key, if there are conflicts between rules, or there are different expiration conditions between rules?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AWS S3 optimizes for cost. Which means whichever the decision to reduce the cost will be applied. In the case you mentioned the shorter expiration will be honored.

Further details: https://docs.aws.amazon.com/AmazonS3/latest/userguide/lifecycle-configuration-examples.html#lifecycle-config-conceptual-ex5

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This simply means that any rule that matches will be executed. Currently, the only "action" in our Lifecycle is to delete, so when checking the specified key, if any rule matches, then the key will be deleted.
This is also the rule for AWS S3 Lifecycle.

https://docs.aws.amazon.com/AmazonS3/latest/userguide/lifecycle-configuration-examples.html#lifecycle-config-conceptual-ex5

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's specify the multiple rules conflict resolution explicitly in the design document.

@ChenSammi
Copy link
Contributor

@mohan3d , thanks for working on this. The design documents looks straightforward. Could you fill it with more details? such as the new table format, how to handle the scale thing, the rule example, and what will be the final decision if multiple rules are defined. BTW, I prefer a new table too.

@mohan3d
Copy link
Contributor Author

mohan3d commented May 14, 2024

Thanks @ChenSammi for the review, I am not able to respond to your latest comment so I will do here.

@mohan3d , thanks for working on this. The design documents looks straightforward. Could you fill it with more details? such as the new table format, how to handle the scale thing, the rule example, and what will be the final decision if multiple rules are defined. BTW, I prefer a new table too.

Sure, I will be adding the new table format shortly, and more details on how the retention manager is designed (This should help us understand how it can scale and also a good opportunity to get some thoughts from the community).

how to handle the scale thing, the rule example, and what will be the final decision if multiple rules are defined.

This was answered in earlier comment.

Copy link
Contributor

@Tejaskriya Tejaskriya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mohan3d thanks for the document! I had a couple of questions, you can see the in-line comment.

- The lifecycle configurations will be executed periodically.
- Depending on the rules of the lifecycle configuration there could be different actions or even multiple actions.
- At the moment only expiration is supported (keys get deleted).
- The lifecycle configurations supports all buckets not only S3 buckets.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

S3 supports lifecycle configurations only for non-directory buckets. In that case, translating the same for Ozone, we would be supporting these only for object-store buckets and not for FSO buckets.
Do we plan to do as above or consider FSO buckets as well?
Also, how do we plan to handle legacy buckets? As we have the ozone.om.enable.filesystem.paths config to have flexibility on bucket behaviour.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Tejaskriya I didn't think of such case. Maybe @ivandika3 and @xichen01 can help more on this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For FSO buckets, due to the directory structure, whether an object can be deleted depends on its subdirectories, so Lifecycle cannot perform as expected.

For the legacy buckets, I think we can support it and there is no need to distinguish between legacy buckets and OBS buckets, because the deletion operations that Lifecycle can perform do not exceed what legacy buckets can do. (But I think support for legacy buckets is a feature that can be discussed)

message LifecycleRule {
optional string id = 1;
optional string prefix = 2;
required bool enabled = 3;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SaketaChalamchala Here is the status flag (Enables or not).

## Overview

### Functionality
- User should be able to create/remove/fetch lifecycle configurations for a specific S3 bucket.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the time of implementing this I had no imagination on how the ACLs will work on this. If I recall it was restricted for the owner.


message DeleteLifecycleConfigurationRequest {
required string volumeName = 1;
required string bucketName = 2;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be useful. I was not able to find such thing on AWS side that's why my implementation doesn't have such optional filter.

https://docs.aws.amazon.com/AmazonS3/latest/API/API_GetBucketLifecycleConfiguration.html
https://docs.aws.amazon.com/AmazonS3/latest/API/API_DeleteBucketLifecycle.html

- A background retention service is responsible for scheduling and executing tasks at specified intervals.
- The Retention Manager retrieves lifecycle configurations associated with buckets.
- Then assigns each lifecycle configuration (attached to a bucket) to a threadpool (Configurable) for further processing.
- Each task will iterate through keys of a specific bucket and issue deletion request for eligible keys.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AWS S3 optimizes for cost. Which means whichever the decision to reduce the cost will be applied. In the case you mentioned the shorter expiration will be honored.

Further details: https://docs.aws.amazon.com/AmazonS3/latest/userguide/lifecycle-configuration-examples.html#lifecycle-config-conceptual-ex5

## Overview

### Functionality
- User should be able to create/remove/fetch lifecycle configurations for a specific S3 bucket.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initially I remember I implemented it this way, if the user who sat the lifecycle configuration doesn't have the right to delete the key, then the key should be skipped although it was eligible for deletion.

But later I changed it to delete the key anyway, I need to check the code to answer you accurately.

- The lifecycle configurations will be executed periodically.
- Depending on the rules of the lifecycle configuration there could be different actions or even multiple actions.
- At the moment only expiration is supported (keys get deleted).
- The lifecycle configurations supports all buckets not only S3 buckets.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Tejaskriya I didn't think of such case. Maybe @ivandika3 and @xichen01 can help more on this.

@mohan3d
Copy link
Contributor Author

mohan3d commented May 26, 2024

@ChenSammi @SaketaChalamchala @xichen01 I forgot to submit my comments earlier and it was in the pending status. Please take a look into my comments.

- **Maximum Rules**: The table can store up to 1000 rules per lifecycle configuration.
- **Validation**: The configuration is considered valid if:
- The `volume`, `bucket`, and `owner` are not blank.
- The number of rules is between 1 and 1000.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we plan on introducing a property for setting the number of rules to be stored per lifecycle configuration?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really, each lifecycle configuration will have a list of rules. Hence there is no need to explicitly maintain the count of rules.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can limit it to 1000 rules per lifecycle configuration as per AWS documentation (https://docs.aws.amazon.com/AmazonS3/latest/userguide/intro-lifecycle-rules.html#intro-lifecycle-rule-id).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood, thank you for clarifying @ivandika3 and @mohan3d.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mohan3d Please help to add this in the design docs.

@xichen01
Copy link
Contributor

@mohan3d

At the time of implementing this I had no imagination on how the ACLs will work on this. If I recall it was restricted for the owner.

Yes, If we restrict that only the bucket owner can set a lifecycle, we can circumvent the permission issue when lifecycle-service delete keys, because the bucket owner have ALL Permission for the bucket resources.

Copy link
Contributor

@ivandika3 ivandika3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few comments from the community meeting. Please kindly help to address them when you have time.

  • Need to specify the permission requirements of AWS lifecycle configuration for both Native ACL and Ranger
    • For example, since Ranger does not seem to have a concept of "Bucket owner", what permission should the Ranger user need to be able to create and delete lifecycle configuration as well as what permission does the Ranger user need to be able to delete the keys.
  • Please help to provide more scenarios of conflicting rules (e.g. rule with root expire on 7 days and rule in subdirectory expire on 14 days, the keys under subdirectory will be deleted).
    • We can take some from the AWS documentation
    • @xichen01 Could you also help to add the conflict resolution rules for tag, prefix, etc?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
design documentation Improvements or additions to documentation
Projects
None yet
7 participants