Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve authorization feature #409

Open
LaPetiteSouris opened this issue Sep 2, 2022 · 8 comments
Open

Improve authorization feature #409

LaPetiteSouris opened this issue Sep 2, 2022 · 8 comments

Comments

@LaPetiteSouris
Copy link
Contributor

What

As mentioned in the docs, there are a few issues remained in the authorization feature (which is experimental):

  • (1) Authz is not synced across different nodes in the server.
  • (2) Add/Remove a specific permission is done manually.

Why

I want to create this issue to discuss possible routes for improvement on these 2 points.

IMHO, point (1) is more urgent than point (2). If (1) is implemented, then (2) can be considered as acceptable (aka: it will not be more painful than manually edit JSON-based IAM policies on AWS anyway)

@tylertreat
Copy link
Member

For syncing permissions across nodes, a few options come to mind that might be worth exploring:

  1. Use a Casbin watcher to sync permissions, e.g. we could use a NATS watcher. There may be questions on ensuring consistency with this, e.g. in the case of a dropped message. I have not looked at the Watcher interface at all, so I'm not sure exactly how it works.
  2. Apply permissions through Raft. We already use Raft for control plane operations (e.g. creating streams, joining consumer groups, etc.). This has the benefit that it ensures consistency across the cluster.
  3. Use an internal Liftbridge stream to apply permission changes. We use internal streams for things like cursors and the activity stream.

I am not quite sure yet which is the best approach, if any. We probably want to define what the permissions API should look like / how it should work from a user standpoint.

@LaPetiteSouris
Copy link
Contributor Author

For all options, I think:

  • (1) Casbin Watcher on NATS is not well maintained. Also, watcher code is not kept in the main repo of Casbin. That means the whole operation of Liftbridge would depends on something really unsure. It is a single point of failure of the whole cluster. I would hesitate to use that.
  • (3) Using Liftbridge stream would work. However, in the case of persisting authorization policy, I would argue that the amount of traffic is much lower than other normal use cases. I.e: it does not change a thousand times per second. The throughput of the change is low, but the requirement of consistency is high. A mismatch policy in authorization is not something we accept. This make Liftbridge stream to be not an ideal candidate IMHO.
  • (2) Raft provides strong consistency by nature. Replication through Raft is slow, but as we do not have to change that often, the tradeoff to go with strong consistency makes more sense to me. Moreover, as authorization is at control level of Liftbridge (just like metadata), I think Raft is a strong candidate.

@LaPetiteSouris
Copy link
Contributor Author

LaPetiteSouris commented Sep 25, 2022

A high level proposal

From user standpoint

sync_authz-Page-3 drawio

  • Admin would interact solely with the cli interface on Leader node (i.e metadata leader). Any change in permissions would be performed via the CLI. That means Liftbridge should expose Casbin Management API to allow admin to add/modify/delete a policy.
  • If successes, the change will be persisted on Raft as a log message. The choice of log message goes inline with Liftbridge, a log-persistent stream. It makes the whole thing easier to reason.
  • A server/node will have 2 layers to handle authorization change: one to sync with Raft message, the other to update the policy.csv which stores the policies.

The procedure to modify policy from admin standpoint would look like:

sync_authz-Page-2

Edge cases:

  • Server lost leadership while logs is being committed to Raft: commit would fail and Admin will have to try using the CLI on the new leader server.

From server standpoint

sync_authz-Page-1 drawio(2)

In general, the server should load the local policy.csv into a Casbin instance and apply the changes received from Raft. Once the change is persisted, it should ACK the Raft replication messages. This is to ensure strong consistency.

Finally, it sends a SIGHUP signal to take into account the new changes.

Edge cases:

  • The change received from Raft is in-consistent with the local policy.csv: either the CSV has been edited manually or for any reason, this may happens. In this case, admin needs to manually edit the policy.csv to handle the conflict. Only then the server can apply the Raft messages correctly.
  • As the policy.csv is still accessible by outsider (non-Raft processes), inconsistency is still always a possibility.

@tylertreat Sorry for the long comments, I just try to give as much detail as possible for the discussion. WDYT ?

@tylertreat
Copy link
Member

I agree with your proposal to use the existing Raft system to handle authz syncing. The existing Raft system handles a lot of the concerns you mention such as checking leader status, acking replication, etc. which should simplify the replication component. Liftbridge also handles forwarding Raft requests to the metadata leader if they are sent to a follower.

The CSV file being exposed is a legitimate concern and was one of the concerns that came to mind for me. I'm not sure if there is much we can do there though other than warn the operator to not manually modify that file?

@tylertreat
Copy link
Member

I don't love the idea of having to modify a CSV file to apply authz changes. In theory we don't even need the CSV file unless we are wanting to apply static policies? Seems like the Raft operation could update the Casbin instance directly, which would also negate the need for a SIGHUP?

@LaPetiteSouris
Copy link
Contributor Author

LaPetiteSouris commented Oct 14, 2022

I don't love the idea of having to modify a CSV file to apply authz changes. In theory we don't even need the CSV file unless we are wanting to apply static policies? Seems like the Raft operation could update the Casbin instance directly, which would also negate the need for a SIGHUP?

Indeed. The CSV for casbin is used to store data, but since with Raft, we are pretty certain that the data is persistent at all time, we can ditch the CSV logic.

However, I still think that that Casbin implementation requires a CSV file to be stored somewhere ( I need to confirm that part), but even if it is the case, as long as we "hide" the csv file under /tmp for example, we can protect the integrity of authz data.

Side effect:

  • Server crashes: Then upon re-startup, the server has to reload/re-apply all Raft changes before serving request. That may increase bootup time. This is also valid for server starts up.

Also, SIGHUP logic should be dropped, as it creates extra complexity for maintenance, and potentially be a source of raise-condition (e.g: casbin instance is required to reload while being sync with Raft).

@tylertreat
Copy link
Member

I like the idea of "hiding" the CSV as an implementation detail, though I suppose that breaks existing users and makes it more difficult for implementing statically defined policies. Perhaps we can support both statically-defined policies and "dynamic" policies applied via the Raft system?

@LaPetiteSouris
Copy link
Contributor Author

LaPetiteSouris commented Oct 24, 2022

I like the idea of "hiding" the CSV as an implementation detail, though I suppose that breaks existing users and makes it more difficult for implementing statically defined policies. Perhaps we can support both statically-defined policies and "dynamic" policies applied via the Raft system?

I have not spent lots of time looking into implementation details, but I think supporting both (Raft and CSV based) should be possible.

In fact, the current authz configuration allows the switching of both method quite easily.

tls:
  key: ./configs/certs/server/server-key.pem
  cert: ./configs/certs/server/server-cert.pem
  client.auth.enabled: true
  client.auth.ca: ./configs/certs/ca-cert.pem
  client.authz.enabled: true
  client.authz.model: ./configs/authz/model.conf
  client.authz.policy: ./configs/authz/policy.csv

client.authz.policy means to configure the location of policy persistence. We can imagine that if client.authz.policy is not present, the authorization data will be persisted on Raft.

Also, a small DeprecationWarning may be useful for csv-based authorization feature

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants