Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Implement experimental DataCollector API 2 #2024

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

rht
Copy link
Contributor

@rht rht commented Feb 4, 2024

This is a sketch of another data collector API as discussed on Matrix.org. The implementation is very short, so as to give an idea of how it works, and to experiment with it.

Highlights:

  • Everything is an attribute, including the custom groups (custom groups are needed to avoid repetition of specifying a particular group)
  • The data collector object now tracks only attributes of model, and so its implementation is very simple

Downside: to group-by the data collection by group is not automatic, since the organization is completely flat, with no information about the group. OTOH, doing group-by in #2013 is very trivial. no longer the case. Already implemented.

Copy link

github-actions bot commented Feb 4, 2024

Performance benchmarks:

Model Size Init time [95% CI] Run time [95% CI]
Schelling small 🔵 +0.0% [-0.5%, +0.5%] 🔵 -0.3% [-0.6%, -0.1%]
Schelling large 🔵 -0.2% [-1.5%, +1.0%] 🔵 +1.1% [-0.4%, +2.7%]
WolfSheep small 🔵 +1.5% [+1.1%, +1.8%] 🔵 +0.2% [+0.0%, +0.3%]
WolfSheep large 🔵 +0.1% [-0.6%, +0.9%] 🔴 +5.1% [+3.9%, +6.3%]
BoidFlockers small 🔵 -0.0% [-0.8%, +0.9%] 🔵 -1.0% [-1.4%, -0.5%]
BoidFlockers large 🔵 -0.1% [-0.6%, +0.6%] 🔵 +0.1% [-0.7%, +1.0%]

@rht rht changed the title feat: Implement experimental DataCollector API feat: Implement experimental DataCollector API 2 Feb 4, 2024
@Corvince
Copy link
Contributor

Corvince commented Feb 4, 2024

Thank you very much for doing this, the data collector API looks quite clean to me. It's good to have this here on GitHub and not only on matrix.

However I am not sure opening up more and more PRs is a good way to bring the discussion forward. The implementation is missing a lot of parts, so that it's not even working on its own. I think this is intentional, but this way I don't see the added value of a PR. The API could have been presented in #1944, now the discussion is just spread further and further apart.

Hope this doesn't sound too negative, but please just post the API in #1944, so we can reach consensus before running into implementations. I think we are on the right track

@rht
Copy link
Contributor Author

rht commented Feb 4, 2024

The API could have been presented in #1944, now the discussion is just spread further and further apart.

I don't see the problem here. The link to this PR could be quickly added in #1944 or #1997, if you are concerned about keeping track of implementations/API variations. This PR is a summary of the discussion on Matrix, and would be too noisy to add the entire code to #1944, which is already too lengthy for anyone to read. The point I am bringing forward in this PR is the way the implementation sketch, not just the API, i.e. model.measure and model.measure.value, and group reuse. And PR's have inline comments, something that can't be done in discussions.

@rht
Copy link
Contributor Author

rht commented Feb 4, 2024

Added 2nd commit where the data collection is grouped by the groups.

@quaquel
Copy link
Contributor

quaquel commented Feb 4, 2024

Rather than open another PR, I just put it here for discussion.

In my datacollection branch, I started implementing my view of data collection as outlined in #1944. Note that this is still far from complete, but it works. For example, I haven't included the ideas of Group and Measures in this yet. Nor is the to_dataframe implementation entirely where I want it to be, but multiindex always trips me up.

The short version of the API is given below. See 2 examples for details of how it currently works.

    model = EpsteinCivilViolence()

    citizens = model.get_agents_of_type(Citizen)
    cops = model.get_agents_of_type(Cop)

    dc = DataCollector(model, [
        collect_from("n_quiescent", citizens, "condition",
                     lambda d: len([e for e in d if e["condition"] == "Quiescent"])),
        collect_from("n_active", cops, "condition",
                     lambda d: len([e for e in d if e["condition"] == "Active"])),
        collect_from("jailed", citizens, "jail_sentence",
                     lambda d: len([e for e in d if e["jail_sentence"] > 0])),
        collect_from("data", citizens, ["jail_sentence", "arrest_probability"])
    ])

    dc.collect_all()
    for _ in range(10):
        model.step()
        dc.collect_all()

    print(dc.n_quiescent.to_dataframe().head())
    print(dc.data.to_dataframe().head())

Note that the branch also contains some pub/sub stuff. This you can completely ignore.

@tpike3 tpike3 added the feature Release notes label label Feb 4, 2024
@rht
Copy link
Contributor Author

rht commented Feb 5, 2024

citizens = model.get_agents_of_type(Citizen)

Needs to be a group, which is already agreed upon, because of agent creation/death/group identity change.

collect_from("n_quiescent", citizens, "condition", lambda d: len([e for e in d if e["condition"] == "Quiescent"]))

I disagree, as stated in #1944 (comment). Summary: there should be only one obvious way to select agents, i.e. via AgentSet. And the selections should have been defined in Group.

On the Boltzmann wealth example: https://github.com/quaquel/mesa/blob/9256e6c1521857cef3d0bc20a10b44e2b13dbb5f/mesa/experimental/datacollection/examples/boltzmann.py#L96-L99, as a user, I find having to have a distinct AgentSet* class to be too complex.

Meta:

Rather than open another PR, I just put it here for discussion.

I think another PR would be better, because it looks like you want to repurpose ideas from #1933 and #1947 into Measure. A PR allows extensive inline comments.

@quaquel
Copy link
Contributor

quaquel commented Feb 5, 2024

@rht Please read my original message carefully. Your response annoys me.

  1. Note that the branch also contains some pub/sub stuff. This you can completely ignore.

  2. For example, I haven't included the ideas of Group and Measures in this yet.

@rht
Copy link
Contributor Author

rht commented Feb 5, 2024

I saw the posted working branch with the expectation that it is going to be reviewed. Regarding with pub/stuff, the distinction is not clear. I'm not sure if AgentSet* is a remnant of pub/sub stuff or not, and not sure if you complained about this part.

@quaquel
Copy link
Contributor

quaquel commented Feb 5, 2024

Ok. I clearly indicated that it was WIP and not yet complete. Given that you repeatedly said that you want to push forward with this, rather than wait until I have time to complete this branch, I shared it as a source of inspiration. You make 3 points, two of which (Group/Measure, and pub/sub) were covered explicitly in my message.

That leaves one relevant point. On the AgentSetCollector, I disagree. There is a clear distinction in behavior between collecting data from an AgentSet and collecting data from another object. So, it makes sense to have separate classes for this. How to expose this to the user is, of course, a relevant concern, which is why I added the collect_from factory function.

It might be that once I add Measure and Group, the AgentSetObserver becomes completely redundant, but we'll see.

@quaquel
Copy link
Contributor

quaquel commented Feb 5, 2024

Let's return to this PR and the broader discussion on data collection.

I am trying to understand the difference between Group and Measure.

Conceptually, I understand a Measure as some observable model state at a particular time instant that is a function of internal model objects (e.g., agents, space, etc.). It also seems that a Measure is not otherwise used by the model itself internally because, in that case, it could have been implemented as a simple attribute. So, Gini in the Boltzman Wealth Model would be a Measure because it is not used internally.

Conceptually, I understand a Group as a collection of agents that meet some particular selection criterion at a particular time instant. Key to the notion of a Group seems to be that membership is dynamic over time. So, in the Epstein model, citizens and cops are not groups because their membership does not change over time. In Wolf/Sheep, however, wolves and sheep might be groups because their membership changes over time (although you can always fall back on model.get_agents_of_type). Quiescent citizens, Active citizens, or Arrested citizens, however, would all be groups because membership in these groups changes over time and, importantly, membership is dependent on agent attribute values.

Is this in line with everyone else's understanding?

Then, looking at the current Group and Measure code provided in this PR, I failed to square it with my conceptual understanding. The current code for both classes does basically the same thing: apply a callable to an object. Am I missing something, or is the current implementation just a very early draft?

@rht
Copy link
Contributor Author

rht commented Feb 6, 2024

I admit I was being too impatient with the data collection progress. Regardless, thank you for reviewing the draft (I accidentally opened it as a non-draft).

I agree regarding with the distinction between a measure and an attribute. A measure is not necessary to advance the system state.

The current code for both classes does basically the same thing: apply a callable to an object. Am I missing something, or is the current implementation just a very early draft?

I consider the citizens and cops in the Epstein model to be a group. My definition of a group is somewhat different: an AgentSet that is a result of a model.agents.select. This way, the definition is not dependent on a case by case basis. If the group happens to be static, then there should be a way to cache the selection output so that it can be retrieved quickly next time.
As such, I use callable so that I can cover both static and dynamic groups.
Another reason why static AgentSet needs to be defined as a group, is the point 8 in the summary #1944 (comment), that it needs to be named so that it users can retrieves all measures associated with a group.

Regarding with getting the measure as a callable, there is a performance problem: model.measure.value at a given step may be computed several times, once for visualization, and another for data collection, which is wasteful. This could be solved by caching, with the simulations steps as the cache key.

@quaquel
Copy link
Contributor

quaquel commented Feb 6, 2024

As such, I use callable so that I can cover both static and dynamic groups.

That's a helpful clarification. I have to think a bit more about how to implement this. But in essence, a group is an AgentSet, so I am not sure we need a separate class for the group. We do however need some way of capturing the combination of the name and the optional callable that returns the group. This might be a a class or possibly just a simple function.

Regarding with getting the measure as a callable, there is a performance problem: model.measure.value at a given step may be computed several times, once for visualization, and another for data collection, which is wasteful. This could be solved by caching, with the simulations steps as the cache key.

I was playing with Measure this morning, and I was thinking exactly the same thing. That is, I was thinking of adding an _updated flag of some kind that is based on the time of the model. So you only update the measure once each timestep. This is also easily extendable in the future if we decide to add pub/sub.

@Corvince
Copy link
Contributor

Corvince commented Feb 6, 2024

That's a helpful clarification. I have to think a bit more about how to implement this. But in essence, a group is an AgentSet, so I am not sure we need a separate class for the group. We do however need some way of capturing the combination of the name and the optional callable that returns the group. This might be a a class or possibly just a simple function.

How I think about it is that a static group is just an AgentSet, while a dynamic group is something that returns an AgentSet. So we should be able to handle both equally with respect to data collection, but I think its a nice little distinction.

Btw, is it possible to have something like

static_group = model.agents.select(...)
dynamic_group = Group(...)

static_group => AgentSet
dynamic_group => AgentSet

What I mean is calling both groups by their name should return an AgentSet. Is that somehow possible? Maybe we can subclass AgentSet in Group, but keep the dynamic nature? Just tossing in ideas here.

I think that way we could get away with data collector only handling two types of value: Model attributes/measures and AgentSets/Groups or more abstract objects and sequences of objects

Regarding with getting the measure as a callable, there is a performance problem: model.measure.value at a given step may be computed several times, once for visualization, and another for data collection, which is wasteful. This could be solved by caching, with the simulations steps as the cache key.

I was playing with Measure this morning, and I was thinking exactly the same thing. That is, I was thinking of adding an _updated flag of some kind that is based on the time of the model. So you only update the measure once each timestep. This is also easily extendable in the future if we decide to add pub/sub.

I would be cautious with using steps as a cache key. I would expect measure to always return the current value of the measure. It depends on how much statistics we want to provide for measure, but while the idea is to use it primarily for data collection, I wouldn't exclude them from being used inside the model logic. And some measure might change within-step. I know it also depends on how you define step, but so far this is completely inside the hands of the users. For example in some of my models I used step 0 as a calibration step where the model was doing some things until an equilibrium was reached and only than the "real" steps started. Using measures to, well, measure the equilibrium would have been nice, but wouldn't be possible if they are cached to the model step.

@rht
Copy link
Contributor Author

rht commented Feb 7, 2024

And some measure might change within-step. I know it also depends on how you define step, but so far this is completely inside the hands of the users. For example in some of my models I used step 0 as a calibration step where the model was doing some things until an equilibrium was reached and only than the "real" steps started. Using measures to, well, measure the equilibrium would have been nice, but wouldn't be possible if they are cached to the model step.

I see the problem. There is no way to prevent the user from using the measures as if they were internal attributes that are changed several times within a step. One way to do it is to not cache by default, only to cache if the user wants to optimize for performance, and the user may specify their own cache key, just like a Solara's component's dependencies.

@rht
Copy link
Contributor Author

rht commented Feb 7, 2024

Here is an example of the dependencies for Solara components:

components_matplotlib.SpaceMatplotlib(
model, agent_portrayal, dependencies=[current_step.value]
)
. The object components_matplotlib.SpaceMatplotlib is updated only when the dependencies change.

@quaquel
Copy link
Contributor

quaquel commented Feb 7, 2024

What I mean is calling both groups by their name should return an AgentSet. Is that somehow possible? Maybe we can subclass AgentSet in Group, but keep the dynamic nature? Just tossing in ideas here.

We are thinking along similar lines. The issue with attribute-based access while still applying some logic within the attribute access is solvable through properties. And, as we discussed in the pub/sub context, this can be done dynamically. So yes, I believe it is possible to make this work.

I would be cautious with using steps as a cache key. I would expect measure to always return the current value of the measure. It depends on how much statistics we want to provide for measure, but while the idea is to use it primarily for data collection, I wouldn't exclude them from being used inside the model logic. And some measure might change within-step.

I was also concerned about this issue. One solution would be to have some force_update keyword argument that you can pass when doing a get on a measure. This would allow default updating to happen only once each tick, while still offering an override if desired. The drawback is that this might be a bit tricky for users to understand and thus trick them up easily.

The other option is to use pub/sub. This resolves the entire problem because Measure would always reflect the current underlying information on which it operates. To be clear: I favor first developing this new data collection mechanism without pub/sub, while designing it such that we shift to it later if we decide to add it. In fact, it might be an easy way for me to show its value as reqeusted/suggested by @ewout in #1947.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Release notes label
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants