Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

metalearn swarm implementation #27

Open
cosmicBboy opened this issue Aug 22, 2020 · 0 comments
Open

metalearn swarm implementation #27

cosmicBboy opened this issue Aug 22, 2020 · 0 comments

Comments

@cosmicBboy
Copy link
Owner

basic logic for generating suggestions:

  • for each suggestion batch, generate n random initial points
  • for each initial point, the agent generates m action adjustments as
    a function of the hyperparameter values, the previous adjustments,
    and the previous reward.
  • this procedure results in n x m candidates
  • rank them based on predicted value, and select top n_suggestions
  • update controller based on rewards of the selected candidates

implement swarm logic for generating suggestions

  • create a local agent for each of n randomly initialized points
  • global agent is a metalearning agent that determines which
    candidate to select from the m actions for each of n agents.
  • given the suggestion, anchor observation and adjustment action,
    this global agent produces an estimate of the value and a probability
    between 0 and 1 to decide whether or not to include the candidate.
  • vanilla version can be a FFN that produces estimates sequentially,
    until n_suggestions have been generated.
  • the metalearning version would be an RNN that processes the rewards
    that decodes the rewards and actions, stopping when n_suggestions
    have been generated.
  • critic loss function would be reward - value estimate error
  • actor loss function would be log probability of "select candidate"
    action.
  • crazy idea: maybe use the global agent to further train the local
    agents by feeding them the reward estimates as the reward for
    those candidates that were not selected. This could potentially
    introduce a lot of bias into the system if the global agent's value
    estimates are really off.
  • idea: to "refresh" the random initial points, every t turns randomly
    perturb them.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant