Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling outputs from the same model with different likelihoods / score functions #1403

Open
ben18785 opened this issue Oct 11, 2021 · 29 comments
Labels

Comments

@ben18785
Copy link
Collaborator

ben18785 commented Oct 11, 2021

PKPD models will likely need outputs from the same model with different likelihoods. E.g. one output follows a normal model; another follows a log-normal. At the moment, this sort of thing is not possible in PINTS without lots of manual hacking. Similarly, it is not possible to natively handle an inference problem with multiple outputs where those outputs are measured at different times.

I wanted to open the forum on this. Here's a potential proposal.

We create a MasterProblemCollection with the following methods:

  • initialised with a forward model and values as is usual. It is also (crucially) initialised with some way of indexing which outputs correspond to which SubProblem (see below). At a simple level, this could be a list [1,1,1,2,2,2,2,3,3,3,3,3] which would indicate that the first three forward model outputs correspond to subproblem1, the second four to subproblem2 and the last five to subproblem3. It is also initiated with a list of times lists: one list of times for each SubProblem.
  • evaluate takes an argument index (in the above example, this would be either 1, 2 or 3) which specifies which output set to return. The first time evaluate is called, the result is cached so that the model is only solved once -- not once for each output set.
  • n_outputs also takes index and returns the relevant number of outputs for the SubProblem
  • times takes index and returns the appropriate time list

We also create a SubOutputProblem:

  • intialised with a MasterProblemCollection and an index
  • evaluate calls MasterProblemCollection.evaluate(index)
  • and so on for n_outputs etc

Example use case

User has a forward model with three outputs and would to model the first two using a GaussianLogLikelihood and the third with LogNormalLogLikelihood.

They would do:

times_1 = [1, 2, 4, 6]
times_2 = [3, 5]
times = [times_1, times_2]
index_list = [1, 1, 2]
master = MasterProblemCollection(forward_model, times, value, index_list)

subproblem_1 = SubProblem(master, 1)
subproblem_2 = SubProblem(master, 2)

loglikelihood_1 = GaussianLogLikelihood(subproblem_1)
loglikelihood_2 = LogNormalLogLikelihood(subproblem_2)
logprior_1 = ...
logprior_2 = ...

logposterior_1 = LogPosterior(loglikelihood_1, logprior_1)
logposterior_2 = LogPosterior(loglikelihood_2, logprior_2)

The user could then create a wrapper to create a LogPosterior class that just returns the sum of logposterior_1 and logposterior_2 which they could use for inference.

@martinjrobins @MichaelClerx @chonlei (and anyone else!) interested to hear thoughts on this

@MichaelClerx
Copy link
Member

MichaelClerx commented Oct 12, 2021

PKPD models will likely need outputs from the same model with different likelihoods. E.g. one output follows a normal model; another follows a log-normal.

I don't understand what you mean here. The PKPD model would be an ODE model, right? So where do the distributions come in?

Similarly, it is not possible to natively handle an inference problem with multiple outputs where those outputs are measured at different times.

Just for my intuition, can you give me examples of why you'd want or need to do this?

@martinjrobins
Copy link
Member

just a note to say we will definitely need this for the pkpdapp inference. Different outputs of the ODE model will have different noise models, hence the need for different distributions on different outputs. The different outputs might also be measured at different times, e.g. if you are measuring the concentration of the drug as well as the size of a tumour, these wouldn't neccessarily be at the same time.

@ben18785
Copy link
Collaborator Author

Thanks @MichaelClerx

Yes, PKPD models are just ODE models. What I mean is that, for these types of model, when you want to do inference, it seems to often be the case that some of the outputs would be modelling using Gaussian distributions; others with log-normal distributions.

Re: different outputs having different measured times, I can give you a good example from epidemiology but there are lots of cases in PKPD modelling it seems due to non-regular measurement of different things (e.g. tumour size and inflammation markers). For SEIR-type modelling, it is often the case that some outputs (e.g. cases and deaths) are measured regularly (e.g. daily) whereas others are measured less regularly (e.g. the proportion of population who are seropositive).

@MichaelClerx
Copy link
Member

Thanks both! Worth keeping #1372 in mind too I guess

@MichaelClerx
Copy link
Member

MichaelClerx commented Oct 12, 2021

Caching sounds useful whether or not we have a split analysis of the model output, so... separate ticket?
I'm assuming you want to evaluate all times in a single simulation? Then I'd say there's two obvious ways:

  1. Like you suggest, construct some kind of SuperProblem from SubProblems (let's shy away from "master" for now?). Then this problem would combine the requirements of individual subproblems, run a single sim (or retrieve a cached output), and then split the output up again. E.g. it would have to combine your times_1 = [1, 2, 4, 6] and times_2 = [3, 5] into times=[1,2,3,4,5,6] and then be clever enough to split the resulting data sets again.

  2. Instead, we could also consider filtering. E.g. have a class ProblemFilter(problem, filter) where filter is something like a boolean array of the same size as the expected output, or a lambda that the user provides. (Or to make it friendlier you could have explicit filters for times and for outputs, but that wouldn't be quite as flexible?) Then you'd (1) Create a MultiSeriesProblem as usual, (but set caching=True) (2) Create a bunch of filters, (3) Create a composed likelihood (as in your example)

@ben18785
Copy link
Collaborator Author

Yep, spot on with what I was thinking re: 1 and, yes, happy to shy away from "master"!

Re: filters, I'm not sure I get ya? Also, currently an issue with MultiSeriesProblem is that this doesn't allow different time arrays for each output, so I guess that'd need to be re-engineered?

@ben18785
Copy link
Collaborator Author

@MichaelClerx I'm going to give my suggestion a go here as need to make progress for PKPD app. But, look forward to your thoughts on the PR!

@MichaelClerx
Copy link
Member

Yep, spot on with what I was thinking re: 1 and, yes, happy to shy away from "master"!

Re: filters, I'm not sure I get ya? Also, currently an issue with MultiSeriesProblem is that this doesn't allow different time arrays for each output, so I guess that'd need to be re-engineered?

Hold on! Can I try explaining the filter thing a bit?

@ben18785
Copy link
Collaborator Author

Sure!

@MichaelClerx
Copy link
Member

Basically, in your proposal you start by creating a superproblem that then needs to combine some sub problems, run a sim, and split its output again.
I'm wondering why you don't cut out the first step, and just create methods to split a problem's output into sub outputs?

So you'd have an ordinary MultiSeriesProblem, but then you'd make sub problems based on its output, and finally combine those sub problems into a single likelihood (although you could hide output as well, using this method)

@MichaelClerx
Copy link
Member

MichaelClerx commented Oct 18, 2021

All the splitting code would need to do would be e.g.

def evaluate(parameters):
    all_values = parent_problem.evaluate(parameters)
    selected_values = all_values[self._numpy_filter]
    return selected_values

or

def evaluate(parameters):
    all_values = parent_problem.evaluate(parameters)
    selected_values = self._very_flexible_filter(all_values)
    return selected_values

@MichaelClerx
Copy link
Member

The thinking behind this is:

  • we've already got ways to combine problems into a single function
  • having it work with any old multi (or single) series problem is more flexible than requiring a special type
  • this system would potentially allow for more complicated filters than in your original proposal (the user just passes in either a numpy filter or a lambda function)
  • the efficiency comes from using caching in the problem class separately, so the code we write to do that will be useful to others as well

@ben18785
Copy link
Collaborator Author

Ok, thanks. So would the MultiSeriesProblem take all the times where all outputs were evaluated? E.g.:

times_1 = [1, 2, 4, 6] # output 1
times_2 = [3, 5] # output 2
times_all = [1,2,3,4,5,6]
problem = MultiOutputProblem(model, times_all, values_all)

# construct sub problems which are then used to construct individual log-likelihoods
subproblem_1 = SubProblem(problem, filter_1)
...

How would you handle values_all? In the current MultiOutputProblem this needs to be a numpy array of size n_times by n_outputs. But in the inference problems I'm thinking about we won't have observations for all times for all outputs.

@MichaelClerx
Copy link
Member

Oooh that's a good point, I was thinking about the simulation much more than the data 😅

Ideally I guess you'd have a None in the data set to denote no value, but may need to use a nan so that it fits into a numpy doubles array...

But if we start allowing that, then all the likelihoods etc. would need to do some clever filtering? Or... do we write that nan may appear in a multioutputseries problem's values, but that we expect the user to construct a clever score/likelihood so that this isn't a problem?

@MichaelClerx
Copy link
Member

So would the MultiSeriesProblem take all the times where all outputs were evaluated?

Yup!

@MichaelClerx
Copy link
Member

How were you planning to handle data in the super/sub problem set-up?

@ben18785
Copy link
Collaborator Author

So the way I'd handle data would be the same way as for times:

## times and values for first output
times_1 = [1, 2, 4, 6]
values_1 = [2.4, 3.4, 5.5, 5.5]

## times and values for second output
times_2 = [3, 5]
values_2 = [-0.3, 0.5]

## combine all
times_all = [times_1, times_2]
values_all = [values_1, values_2]
index_list = [1, 1, 2]
master = MasterProblemCollection(forward_model, times_all, values_all, index_list)

So, we could put nans in the numpy arrays, although this feels a little clunky. I do wonder though whether this filtering means users are more likely to make mistakes -- specifically if their filters don't grab the right values?

@MichaelClerx
Copy link
Member

Is index_list correct in that example?

@ben18785
Copy link
Collaborator Author

No, good spot. For this case, it would just be index_list=[1,2].

If you wanted to create subproblems which corresponded to multioutputs, I guess you could do:

## times and values for two-dimensional outputs
times_1 = [1, 2, 4, 6]
values_1 = [[2.4, 3.4, 5.5, 5.5],[4.5,......]]

## times and values for second output
times_2 = [3, 5]
values_2 = [-0.3, 0.5]

## combine all
times_all = [times_1, times_2]
values_all = [values_1, values_2]
index_list = [1, 1, 2]
master = MasterProblemCollection(forward_model, times_all, values_all, index_list)

then the index list I gave would make more sense.

@MichaelClerx
Copy link
Member

The efficiency bit does make this tricky!
I'm really tempted to say you should solve it on the ForwardModel level, not the Problem level, so that you end up with

subproblem1 = Problem(submodel1, times1, values1)
subproblem2 = Problem(submodel2, times2, values2)

but I can't think of a generic way to split a forwardmodel into 2 submodels that recognise they need to use a fixed time set and then filter

@MichaelClerx
Copy link
Member

model = MyNiceModel()
model_with_fixed_times = FixedTimesModel(model, times)
submodel1 = FilteredModel(model_with_fixed_times, times, outputs) --> can check its times are a subset of parent model's times
submodel2 = FilteredModel(...)

yuck!

@ben18785
Copy link
Collaborator Author

Interesting -- to me -- because the necessity to do this all stems from the inference problem (i.e. having different likelihoods for different outputs) I would be more tempted to solve it at that level than at the forward problem level...?

@MichaelClerx
Copy link
Member

MichaelClerx commented Oct 18, 2021

I think the root problem isn't so much different outputs, but different time series. We've gone from [(t1, z1), (t1, z2), ..., (tN, zN)] (where z is a scalar or a vector) to [[(t11, z11), (t11, z12), ..., (t1N, z1N)], [(t21, z21), (t21, z22), ..., (t2M, z2M)], ...]

To me that means we now have a list of Problems, for which we can define a list of likelihoods which we can then combine into a single likelihood using existing code.
But we've added the question of how to write a forward model that will have its output split over multiple Problems

@MichaelClerx
Copy link
Member

I suppose you were saying the same when you called it a ProblemCollection :D

@MichaelClerx
Copy link
Member

OK I'm coming round to your original proposal :D

Maybe we can drop index_list though, by assuming the outputs don't overlap, and that the user has control over the forwardmodel, and so can ensure that the outputs are ordered as e.g. [0, 0, 1] instead of [0, 1, 0]?

Then it could be:

# Time series for measurable A (2d)
times_1 = [1, 2, 4, 6]
values_1 = [[2.4, 3.4, 5.5, 5.5],[4.5,......]]

# Time series for measurable B (1d)
times_2 = [3, 5]
values_2 = [-0.3, 0.5]

# Combine all
collection = SharedModelProblem(forward_model, times_1, values_1, times_2, values_2)

which you'd implement as e.g.

def __init__(self, forward_model, *args):
    if len(args) < 2:
      must supply at least one time series
    if len(args) % 2 != 0:
      must supply times and values for each time series
    timeses = []
    valueses = []
    index_list = []
    for i in range(len(args) %2):
      timeses.append(args[i])
      valueses.append(args[1 + i])
      index_list.extend([i] * args[1 + i].shape[1])
    times = list(set(np.concatenate(etc)))

that kind of thing?

@ben18785
Copy link
Collaborator Author

That could work. But how would you point a particular SubProblem to a particular subset of outputs? Would you do:

collection = SharedModelProblem(forward_model, times_1, values_1, times_2, values_2)

## pertaining to times_1, values_1
problem_1 = SubProblem(collection, 1)

## pertaining to times_2, values_2
problem_2 = SubProblem(collection, 2)

I'd have thought keeping an explicit list of indices would mean it were safer than the implicitness above?

@ben18785
Copy link
Collaborator Author

I'd also prefer to call it a SharedProblemCollection or (as I had above) ProblemCollection since it's not actually going to be of type pints.Problem.

@MichaelClerx
Copy link
Member

I'd have thought keeping an explicit list of indices would mean it were safer than the implicitness above?

I just thought it'd be easier to have the problems all be disjoint.

You'd get a subproblem from the collection, I think:

problem_1 = collection.subproblem(0)
problem_2 = collection.subproblem(1)

@ben18785
Copy link
Collaborator Author

I like this!

problem_1 = collection.subproblem(0)
problem_2 = collection.subproblem(1)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants