Handling outputs from the same model with different likelihoods / score functions #1403

ben18785 · 2021-10-11T14:57:36Z

PKPD models will likely need outputs from the same model with different likelihoods. E.g. one output follows a normal model; another follows a log-normal. At the moment, this sort of thing is not possible in PINTS without lots of manual hacking. Similarly, it is not possible to natively handle an inference problem with multiple outputs where those outputs are measured at different times.

I wanted to open the forum on this. Here's a potential proposal.

We create a MasterProblemCollection with the following methods:

initialised with a forward model and values as is usual. It is also (crucially) initialised with some way of indexing which outputs correspond to which SubProblem (see below). At a simple level, this could be a list [1,1,1,2,2,2,2,3,3,3,3,3] which would indicate that the first three forward model outputs correspond to subproblem1, the second four to subproblem2 and the last five to subproblem3. It is also initiated with a list of times lists: one list of times for each SubProblem.
evaluate takes an argument index (in the above example, this would be either 1, 2 or 3) which specifies which output set to return. The first time evaluate is called, the result is cached so that the model is only solved once -- not once for each output set.
n_outputs also takes index and returns the relevant number of outputs for the SubProblem
times takes index and returns the appropriate time list

We also create a SubOutputProblem:

intialised with a MasterProblemCollection and an index
evaluate calls MasterProblemCollection.evaluate(index)
and so on for n_outputs etc

Example use case

User has a forward model with three outputs and would to model the first two using a GaussianLogLikelihood and the third with LogNormalLogLikelihood.

They would do:

times_1 = [1, 2, 4, 6]
times_2 = [3, 5]
times = [times_1, times_2]
index_list = [1, 1, 2]
master = MasterProblemCollection(forward_model, times, value, index_list)

subproblem_1 = SubProblem(master, 1)
subproblem_2 = SubProblem(master, 2)

loglikelihood_1 = GaussianLogLikelihood(subproblem_1)
loglikelihood_2 = LogNormalLogLikelihood(subproblem_2)
logprior_1 = ...
logprior_2 = ...

logposterior_1 = LogPosterior(loglikelihood_1, logprior_1)
logposterior_2 = LogPosterior(loglikelihood_2, logprior_2)

The user could then create a wrapper to create a LogPosterior class that just returns the sum of logposterior_1 and logposterior_2 which they could use for inference.

@martinjrobins @MichaelClerx @chonlei (and anyone else!) interested to hear thoughts on this

The text was updated successfully, but these errors were encountered:

MichaelClerx · 2021-10-12T09:31:00Z

PKPD models will likely need outputs from the same model with different likelihoods. E.g. one output follows a normal model; another follows a log-normal.

I don't understand what you mean here. The PKPD model would be an ODE model, right? So where do the distributions come in?

Similarly, it is not possible to natively handle an inference problem with multiple outputs where those outputs are measured at different times.

Just for my intuition, can you give me examples of why you'd want or need to do this?

martinjrobins · 2021-10-12T09:39:36Z

just a note to say we will definitely need this for the pkpdapp inference. Different outputs of the ODE model will have different noise models, hence the need for different distributions on different outputs. The different outputs might also be measured at different times, e.g. if you are measuring the concentration of the drug as well as the size of a tumour, these wouldn't neccessarily be at the same time.

ben18785 · 2021-10-12T09:40:21Z

Thanks @MichaelClerx

Yes, PKPD models are just ODE models. What I mean is that, for these types of model, when you want to do inference, it seems to often be the case that some of the outputs would be modelling using Gaussian distributions; others with log-normal distributions.

Re: different outputs having different measured times, I can give you a good example from epidemiology but there are lots of cases in PKPD modelling it seems due to non-regular measurement of different things (e.g. tumour size and inflammation markers). For SEIR-type modelling, it is often the case that some outputs (e.g. cases and deaths) are measured regularly (e.g. daily) whereas others are measured less regularly (e.g. the proportion of population who are seropositive).

MichaelClerx · 2021-10-12T09:56:49Z

Thanks both! Worth keeping #1372 in mind too I guess

MichaelClerx · 2021-10-12T10:26:22Z

Caching sounds useful whether or not we have a split analysis of the model output, so... separate ticket?
I'm assuming you want to evaluate all times in a single simulation? Then I'd say there's two obvious ways:

Like you suggest, construct some kind of SuperProblem from SubProblems (let's shy away from "master" for now?). Then this problem would combine the requirements of individual subproblems, run a single sim (or retrieve a cached output), and then split the output up again. E.g. it would have to combine your times_1 = [1, 2, 4, 6] and times_2 = [3, 5] into times=[1,2,3,4,5,6] and then be clever enough to split the resulting data sets again.
Instead, we could also consider filtering. E.g. have a class ProblemFilter(problem, filter) where filter is something like a boolean array of the same size as the expected output, or a lambda that the user provides. (Or to make it friendlier you could have explicit filters for times and for outputs, but that wouldn't be quite as flexible?) Then you'd (1) Create a MultiSeriesProblem as usual, (but set caching=True) (2) Create a bunch of filters, (3) Create a composed likelihood (as in your example)

ben18785 · 2021-10-13T14:51:27Z

Yep, spot on with what I was thinking re: 1 and, yes, happy to shy away from "master"!

Re: filters, I'm not sure I get ya? Also, currently an issue with MultiSeriesProblem is that this doesn't allow different time arrays for each output, so I guess that'd need to be re-engineered?

ben18785 · 2021-10-18T09:41:20Z

@MichaelClerx I'm going to give my suggestion a go here as need to make progress for PKPD app. But, look forward to your thoughts on the PR!

MichaelClerx · 2021-10-18T09:46:24Z

Yep, spot on with what I was thinking re: 1 and, yes, happy to shy away from "master"!

Re: filters, I'm not sure I get ya? Also, currently an issue with MultiSeriesProblem is that this doesn't allow different time arrays for each output, so I guess that'd need to be re-engineered?

Hold on! Can I try explaining the filter thing a bit?

ben18785 · 2021-10-18T09:51:58Z

Sure!

MichaelClerx · 2021-10-18T09:53:31Z

Basically, in your proposal you start by creating a superproblem that then needs to combine some sub problems, run a sim, and split its output again.
I'm wondering why you don't cut out the first step, and just create methods to split a problem's output into sub outputs?

So you'd have an ordinary MultiSeriesProblem, but then you'd make sub problems based on its output, and finally combine those sub problems into a single likelihood (although you could hide output as well, using this method)

MichaelClerx · 2021-10-18T09:58:14Z

All the splitting code would need to do would be e.g.

def evaluate(parameters):
    all_values = parent_problem.evaluate(parameters)
    selected_values = all_values[self._numpy_filter]
    return selected_values

or

def evaluate(parameters):
    all_values = parent_problem.evaluate(parameters)
    selected_values = self._very_flexible_filter(all_values)
    return selected_values

MichaelClerx · 2021-10-18T10:09:11Z

The thinking behind this is:

we've already got ways to combine problems into a single function
having it work with any old multi (or single) series problem is more flexible than requiring a special type
this system would potentially allow for more complicated filters than in your original proposal (the user just passes in either a numpy filter or a lambda function)
the efficiency comes from using caching in the problem class separately, so the code we write to do that will be useful to others as well

ben18785 · 2021-10-18T10:16:18Z

Ok, thanks. So would the MultiSeriesProblem take all the times where all outputs were evaluated? E.g.:

times_1 = [1, 2, 4, 6] # output 1
times_2 = [3, 5] # output 2
times_all = [1,2,3,4,5,6]
problem = MultiOutputProblem(model, times_all, values_all)

# construct sub problems which are then used to construct individual log-likelihoods
subproblem_1 = SubProblem(problem, filter_1)
...

How would you handle values_all? In the current MultiOutputProblem this needs to be a numpy array of size n_times by n_outputs. But in the inference problems I'm thinking about we won't have observations for all times for all outputs.

MichaelClerx · 2021-10-18T10:36:13Z

Oooh that's a good point, I was thinking about the simulation much more than the data 😅

Ideally I guess you'd have a None in the data set to denote no value, but may need to use a nan so that it fits into a numpy doubles array...

But if we start allowing that, then all the likelihoods etc. would need to do some clever filtering? Or... do we write that nan may appear in a multioutputseries problem's values, but that we expect the user to construct a clever score/likelihood so that this isn't a problem?

MichaelClerx · 2021-10-18T10:37:33Z

So would the MultiSeriesProblem take all the times where all outputs were evaluated?

Yup!

MichaelClerx · 2021-10-18T10:40:36Z

How were you planning to handle data in the super/sub problem set-up?

ben18785 · 2021-10-18T11:05:37Z

So the way I'd handle data would be the same way as for times:

## times and values for first output
times_1 = [1, 2, 4, 6]
values_1 = [2.4, 3.4, 5.5, 5.5]

## times and values for second output
times_2 = [3, 5]
values_2 = [-0.3, 0.5]

## combine all
times_all = [times_1, times_2]
values_all = [values_1, values_2]
index_list = [1, 1, 2]
master = MasterProblemCollection(forward_model, times_all, values_all, index_list)

So, we could put nans in the numpy arrays, although this feels a little clunky. I do wonder though whether this filtering means users are more likely to make mistakes -- specifically if their filters don't grab the right values?

MichaelClerx · 2021-10-18T11:34:36Z

Is index_list correct in that example?

ben18785 · 2021-10-18T11:40:29Z

No, good spot. For this case, it would just be index_list=[1,2].

If you wanted to create subproblems which corresponded to multioutputs, I guess you could do:

## times and values for two-dimensional outputs
times_1 = [1, 2, 4, 6]
values_1 = [[2.4, 3.4, 5.5, 5.5],[4.5,......]]

## times and values for second output
times_2 = [3, 5]
values_2 = [-0.3, 0.5]

## combine all
times_all = [times_1, times_2]
values_all = [values_1, values_2]
index_list = [1, 1, 2]
master = MasterProblemCollection(forward_model, times_all, values_all, index_list)

then the index list I gave would make more sense.

MichaelClerx · 2021-10-18T11:41:35Z

The efficiency bit does make this tricky!
I'm really tempted to say you should solve it on the ForwardModel level, not the Problem level, so that you end up with

subproblem1 = Problem(submodel1, times1, values1)
subproblem2 = Problem(submodel2, times2, values2)

but I can't think of a generic way to split a forwardmodel into 2 submodels that recognise they need to use a fixed time set and then filter

MichaelClerx · 2021-10-18T11:43:01Z

model = MyNiceModel()
model_with_fixed_times = FixedTimesModel(model, times)
submodel1 = FilteredModel(model_with_fixed_times, times, outputs) --> can check its times are a subset of parent model's times
submodel2 = FilteredModel(...)

yuck!

ben18785 · 2021-10-18T11:44:55Z

Interesting -- to me -- because the necessity to do this all stems from the inference problem (i.e. having different likelihoods for different outputs) I would be more tempted to solve it at that level than at the forward problem level...?

MichaelClerx · 2021-10-18T11:47:23Z

I think the root problem isn't so much different outputs, but different time series. We've gone from [(t1, z1), (t1, z2), ..., (tN, zN)] (where z is a scalar or a vector) to [[(t11, z11), (t11, z12), ..., (t1N, z1N)], [(t21, z21), (t21, z22), ..., (t2M, z2M)], ...]

To me that means we now have a list of Problems, for which we can define a list of likelihoods which we can then combine into a single likelihood using existing code.
But we've added the question of how to write a forward model that will have its output split over multiple Problems

MichaelClerx · 2021-10-18T11:55:18Z

I suppose you were saying the same when you called it a ProblemCollection :D

MichaelClerx · 2021-10-18T12:05:47Z

OK I'm coming round to your original proposal :D

Maybe we can drop index_list though, by assuming the outputs don't overlap, and that the user has control over the forwardmodel, and so can ensure that the outputs are ordered as e.g. [0, 0, 1] instead of [0, 1, 0]?

Then it could be:

# Time series for measurable A (2d)
times_1 = [1, 2, 4, 6]
values_1 = [[2.4, 3.4, 5.5, 5.5],[4.5,......]]

# Time series for measurable B (1d)
times_2 = [3, 5]
values_2 = [-0.3, 0.5]

# Combine all
collection = SharedModelProblem(forward_model, times_1, values_1, times_2, values_2)

which you'd implement as e.g.

def __init__(self, forward_model, *args):
    if len(args) < 2:
      must supply at least one time series
    if len(args) % 2 != 0:
      must supply times and values for each time series
    timeses = []
    valueses = []
    index_list = []
    for i in range(len(args) %2):
      timeses.append(args[i])
      valueses.append(args[1 + i])
      index_list.extend([i] * args[1 + i].shape[1])
    times = list(set(np.concatenate(etc)))

that kind of thing?

ben18785 · 2021-10-18T12:56:48Z

That could work. But how would you point a particular SubProblem to a particular subset of outputs? Would you do:

collection = SharedModelProblem(forward_model, times_1, values_1, times_2, values_2)

## pertaining to times_1, values_1
problem_1 = SubProblem(collection, 1)

## pertaining to times_2, values_2
problem_2 = SubProblem(collection, 2)

I'd have thought keeping an explicit list of indices would mean it were safer than the implicitness above?

ben18785 · 2021-10-18T13:12:30Z

I'd also prefer to call it a SharedProblemCollection or (as I had above) ProblemCollection since it's not actually going to be of type pints.Problem.

MichaelClerx · 2021-10-18T14:28:53Z

I'd have thought keeping an explicit list of indices would mean it were safer than the implicitness above?

I just thought it'd be easier to have the problems all be disjoint.

You'd get a subproblem from the collection, I think:

problem_1 = collection.subproblem(0)
problem_2 = collection.subproblem(1)

ben18785 · 2021-10-18T14:55:05Z

I like this!

problem_1 = collection.subproblem(0)
problem_2 = collection.subproblem(1)

MichaelClerx mentioned this issue Oct 12, 2021

Allow simulation output to be cached #1404

Open

MichaelClerx added the feature label Oct 12, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handling outputs from the same model with different likelihoods / score functions #1403

Handling outputs from the same model with different likelihoods / score functions #1403

ben18785 commented Oct 11, 2021 •

edited

MichaelClerx commented Oct 12, 2021 •

edited

martinjrobins commented Oct 12, 2021

ben18785 commented Oct 12, 2021

MichaelClerx commented Oct 12, 2021

MichaelClerx commented Oct 12, 2021 •

edited

ben18785 commented Oct 13, 2021

ben18785 commented Oct 18, 2021

MichaelClerx commented Oct 18, 2021

ben18785 commented Oct 18, 2021

MichaelClerx commented Oct 18, 2021

MichaelClerx commented Oct 18, 2021 •

edited

MichaelClerx commented Oct 18, 2021

ben18785 commented Oct 18, 2021

MichaelClerx commented Oct 18, 2021

MichaelClerx commented Oct 18, 2021

MichaelClerx commented Oct 18, 2021

ben18785 commented Oct 18, 2021

MichaelClerx commented Oct 18, 2021

ben18785 commented Oct 18, 2021

MichaelClerx commented Oct 18, 2021

MichaelClerx commented Oct 18, 2021

ben18785 commented Oct 18, 2021

MichaelClerx commented Oct 18, 2021 •

edited

MichaelClerx commented Oct 18, 2021

MichaelClerx commented Oct 18, 2021

ben18785 commented Oct 18, 2021

ben18785 commented Oct 18, 2021

MichaelClerx commented Oct 18, 2021

ben18785 commented Oct 18, 2021

Handling outputs from the same model with different likelihoods / score functions #1403

Handling outputs from the same model with different likelihoods / score functions #1403

Comments

ben18785 commented Oct 11, 2021 • edited

Example use case

MichaelClerx commented Oct 12, 2021 • edited

martinjrobins commented Oct 12, 2021

ben18785 commented Oct 12, 2021

MichaelClerx commented Oct 12, 2021

MichaelClerx commented Oct 12, 2021 • edited

ben18785 commented Oct 13, 2021

ben18785 commented Oct 18, 2021

MichaelClerx commented Oct 18, 2021

ben18785 commented Oct 18, 2021

MichaelClerx commented Oct 18, 2021

MichaelClerx commented Oct 18, 2021 • edited

MichaelClerx commented Oct 18, 2021

ben18785 commented Oct 18, 2021

MichaelClerx commented Oct 18, 2021

MichaelClerx commented Oct 18, 2021

MichaelClerx commented Oct 18, 2021

ben18785 commented Oct 18, 2021

MichaelClerx commented Oct 18, 2021

ben18785 commented Oct 18, 2021

MichaelClerx commented Oct 18, 2021

MichaelClerx commented Oct 18, 2021

ben18785 commented Oct 18, 2021

MichaelClerx commented Oct 18, 2021 • edited

MichaelClerx commented Oct 18, 2021

MichaelClerx commented Oct 18, 2021

ben18785 commented Oct 18, 2021

ben18785 commented Oct 18, 2021

MichaelClerx commented Oct 18, 2021

ben18785 commented Oct 18, 2021

ben18785 commented Oct 11, 2021 •

edited

MichaelClerx commented Oct 12, 2021 •

edited

MichaelClerx commented Oct 12, 2021 •

edited

MichaelClerx commented Oct 18, 2021 •

edited

MichaelClerx commented Oct 18, 2021 •

edited