NKT_mean output Nan, when the number of training sample is increased #198

zhangbububu · 2024-01-28T09:33:21Z

hi, i meet a confuse problem

init_fn, apply_fn, kernel_fn = stax.serial(
    stax.Dense(512, W_std=1.5, b_std=0.05), stax.Relu(do_stabilize=True),
    stax.Dense(512, W_std=1.5, b_std=0.05), stax.Relu(do_stabilize=True),
    stax.Dense(1, W_std=1.5, b_std=0.05)
)

s = 10
l = jnp.pi * -s
r = jnp.pi * s 
N_tr = 100
N_te = 5
train_xs = jnp.linspace(l, r , N_tr).reshape(-1, 1).astype(jnp.float64)
train_ys = jnp.sin(train_xs) + jnp.sin(2*train_xs).astype(jnp.float64)
test_xs = jnp.linspace(l, r, N_te).reshape(-1, 1).astype(jnp.float64)

predict_fn = nt.predict.gradient_descent_mse_ensemble(kernel_fn, train_xs,
                                                      train_ys, diag_reg=1e-4)
nkt_mean, nkt_covariance = predict_fn(x_test=test_xs, get='ntk',
                                        compute_cov=True)
print(f'{N_tr=}, {nkt_mean=}')

if i increate the number of training samples (N_tr), i will get a all NaN nkt_mean

The text was updated successfully, but these errors were encountered:

romanngg · 2024-01-28T20:40:11Z

I think the reason is that this 1D function is hard to fit with a Relu kernel, but sampling only 15 points makes it a simpler training objective, so it fits it with a lower diagonal regularizer. You can avoid NaNs by increasing diag_reg which I did below, but as you can see it's a poor fit in any case. (NTK prediction is orange with 1000 test points sampled).

1000 training points, diag_reg=1e-2:

100 training points, diag_reg=1e-3:

15 training points, diag_reg=1e-4:

zhangbububu · 2024-01-29T05:30:33Z

@romanngg

Thank you very much for your careful answer.

I am currently doing similar experiments. Can you tell me some ways to make NKT fit better for complex time series?

romanngg · 2024-01-29T16:41:51Z

I guess for this particular example, knowing your training targets, a periodic nonlinearity would fit better (stax.Sin(), diag_reg=1e-4):

Otherwise trying different architectures and plotting predictions or draws from the prior would be good to gain intuition for what works best. Note that for time series data of shape [batch_size, time_duration, n_features], I imagine you may want to use 1D-convolution stax.Conv/stax.ConvLocal over the time_duration axis, to incorporate time locality into your model.

romanngg added the question Further information is requested label Jan 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NKT_mean output Nan, when the number of training sample is increased #198

NKT_mean output Nan, when the number of training sample is increased #198

zhangbububu commented Jan 28, 2024 •

edited

romanngg commented Jan 28, 2024

zhangbububu commented Jan 29, 2024 •

edited

romanngg commented Jan 29, 2024

NKT_mean output Nan, when the number of training sample is increased #198

NKT_mean output Nan, when the number of training sample is increased #198

Comments

zhangbububu commented Jan 28, 2024 • edited

romanngg commented Jan 28, 2024

zhangbububu commented Jan 29, 2024 • edited

romanngg commented Jan 29, 2024

zhangbububu commented Jan 28, 2024 •

edited

zhangbububu commented Jan 29, 2024 •

edited