Skip to content
This repository has been archived by the owner on Oct 12, 2023. It is now read-only.

Error: No automatic parser available for 7b/. #315

Open
simon-tarr opened this issue Oct 1, 2018 · 13 comments
Open

Error: No automatic parser available for 7b/. #315

simon-tarr opened this issue Oct 1, 2018 · 13 comments

Comments

@simon-tarr
Copy link

What does the error Error: No automatic parser available for 7b/. mean?

I frequently see this after a job has completed. When it occurs I have to rebuild the pool in order to resume analysis.

@simon-tarr
Copy link
Author

simon-tarr commented Oct 1, 2018

Today this error isn't appearing frequently - it's happening every hour or so without fail. Has something changed somewhere on the Azure backend to cause this error? My pools are booting fine, I'll have a few results returned than bam Error: No automatic parser available for 7b/.

Really struggling to get anything done with Azure and this package at the moment, especially in tandem with issue #314!

@brnleehng
Copy link
Collaborator

Hi Simon,

The error comes from HTTR, the request package. It happens when the HTTR content parser wasn't able to parse the http response. For these jobs, are some of the tasks failing? If you download the results straight from Azure Storage, are you getting the correct output?

Nothing has changed from rAzureBatch package since August...

Thanks,
Brian

@simon-tarr
Copy link
Author

simon-tarr commented Oct 1, 2018

Hi Brian, none of the tasks are failing, no. At least not according to the output in the R console. I see this:

Id: job2018
chunkSize: 130
enableCloudCombine: TRUE
packages: 
	dismo; ncdf4; raster; 
githubPackages: 
	simon-tarr/NicheMapR; 
errorHandling: pass
wait: TRUE
autoDeleteJob: TRUE

Submitting tasks (511/511)
Submitting merge task. . .
Job Preparation Status: Package(s) being installed
Waiting for tasks to complete. . .
| Progress: 100.00% (511/511) | Running: 0 | Queued: 0 | Completed: 511 | Failed: 0 ||
Tasks have completed. Merging results.. Completed.
Error: No automatic parser available for 7b/.

Could this happen if my internet connection drops at the same moment the package is attempting to download the results? My uni internet connection is far too stable for this to be the issue all the. Today, for example, I have received this error about 15 times.

@simon-tarr
Copy link
Author

simon-tarr commented Oct 1, 2018

Some extra information: I first noticed the error ~ 2 weeks ago. It happened just the once so I shrugged it off thinking that it was a one off. The last 3-4 days, however, more than ~30% of my jobs resulted in this error. Today? I'd estimate that most, if not all, have stopped at some point with this error.

Is it possible that it's something to do with the way I'm writing my code? I've essentially wrapped by Azure code/loop within another loop that iterates through all the species I'm attempting to analyse. Something like this:

  for (i in 1:nrow(all_my_species)){ # Outer loop is by species

    inner_decade_loop<-list()

    for(j in 1:length(no_years)){ # doAzureParalllel runs within the function "run_models" 
      inner_decade_loop[[j]]<-run_models(decade=decade[[j]], model=model, rcp=rcp, global_points = global_points, mass=working_species[i,4], ctmax=working_species[i,2], ctmin=working_species[i,3], shape=working_species[i,5])
    }

    # Turns each decade list into a matrix, merges xy coordinates, renames columns, rounds to two decimal places
    abc<-lapply(inner_decade_loop, function(decade) matrix(unlist(decade), byrow = T, ncol=21))
    abc<-lapply(abc, function(merge_coords) cbind(global_points, merge_coords))
    abc<-lapply(abc,"colnames<-", c("x", "y", "mean_tb", "sd_tb", "mean_shade", "sd_shade",
                                    "mean_solar", "sd_solar", "mean_dep", "sd_dep", "mean_air", "sd_air",
                                    "mean_subtemp", "sd_subtemp", "mean_skytemp", "sd_skytemp",
                                    "mean_wind", "sd_wind", "mean_relhum", "sd_relhum",
                                    "act_hrs", "min_tb", "max_tb"))
    abc<-lapply(abc,round,2)
    species[[i]]<-abc

    for (k in 1:length(abc)){ # Writes outputs
      write.csv(abc[[k]], file=paste0(species_snakecase[[i]], "_",rcp ,"_", model, "_", decade=decade[[k]], ".csv"), row.names = FALSE )
    }

The inner loop will run fine (say, 50 times). Then, for apparently no reason, I'll get the 7b error.

@simon-tarr
Copy link
Author

Hi Brian,

Is there anything else that I can do to troubleshoot this, it's driving me crazy. Every single one of my loops(with very rare exceptions - see below) is crashing at some point or another with this error. None of the tasks have failed within the job (and neither have any of the tasks failed in any of the previous 15-20 iterations that have run). Azure will just pick a seemingly random job and fail with the error after the merge task has completed for that job.

Given our other discussion in #314, I thought I'd try running it on a much smaller subset of my data so that I could rule out other issues i.e. RAM/storage - still the same problem on Standard F16v2 machines which should be more than capable of running the pared down analysis.

The problem is made more strange because sometimes I can get a "stable" pool which will never crash - it'll happily run through all of the models and not error once (but this is very rare these days). Given that there's essentially no difference between "stable" pools and those which error constantly, I really don't understand what's wrong. The only thing I ccan really think of is that that the 7b error is correlated with the number of preempted nodes in a pool. Perhaps the preempted nodes don't restart paused tasks properly, despite the R console reporting no task errors? Could this possibly result in a 7b error?

I realise that in the absence of error logs it's really hard to diagnose problems like this but, given that I have no task errors prior to the 7b error, there's not really anything I can provide you with, to the best of my knowledge. Having said this, I'm hoping that maybe we could bounce some ideas around of different things that I could try to see if I can find a more stable configuration of settings.

Many thanks,
Simon

@brnleehng
Copy link
Collaborator

brnleehng commented Oct 2, 2018

In the inner loop, you are running the a doAzureParallel foreach loop? This looks fine for now...
You are setting wait to TRUE for doAzureParallel?

Based on the output,
"Tasks have completed. Merging results.. Completed."

That error text makes me think this has to do with a HTTP request (https://github.com/r-lib/httr/blob/master/R/content-parse.r#L40-L42 for reference). That's where the current error is displayed.

There's only two requests that we do after this output, job termination and get job results.
My thinking is it has to do with get job results because there's no output about job termination due to failures.

Can you run traceback()? This will give you a stack trace of where the error occurred.

traceback()

Can you verify that the output of the merge-result.rds is correct/valid?
Go to your Azure Storage Account > Click on blobs > Search for your job id container > Go to results folder
image

What region are you running in? I'm assuming 2 weeks ago, you weren't getting preempted a lot? Were you also using a different VM size?

Thanks,
Brian

@simon-tarr
Copy link
Author

simon-tarr commented Oct 2, 2018

Hi Brian,

The inner loop is running doAzureParallel, yes. Wait is set to TRUE.

I've copied below the results of the traceback immediately after the 7b error occurs:

7: stop("No automatic parser available for ", mt$complete, ".", 
       call. = FALSE)
6: parse_auto(raw, type, encoding, ...)
5: httr::content(response, content = "parsed")
4: e$fun(obj, substitute(ex), parent.frame(), e$data)
3: foreach(i = 1:nrow(global_points), .options.azure = opt, github = c("simon-tarr/NicheMapR"), 
       .packages = c("dismo", "ncdf4", "raster"), .errorhandling = "pass") %dopar% 
       {
           library(NicheMapR)
           library(raster)
           library(ncdf4)
           micro <- NicheMapR::micro_global(loc = global_points[i, 
               ], decade = decade, rcp = rcp, model = model, Usrhyt = 0.01, 
               runshade = 1)
           ecto <- NicheMapR::ectotherm(amass = mass, ctmax = ctmax, 
               ctmin = ctmin, VTMAX = ctmax, VTMIN = ctmin, TBASK = ctmin, 
               TEMERGE = ctmin, TPREF = 0.75 * ctmax, lometry = shape, 
               ABSMAX = 0.85, ABSMIN = 0.85, dayact = 1, nocturn = 1, 
               crepus = 1, CkGrShad = 1, burrow = 1, climb = 1, 
               shdburrow = 0, mindepth = 2, maxdepth = 4, minshade = 0, 
               maxshades = micro$MAXSHADES, nyears = micro$nyears, 
               REFL = micro$REFL, DEP = micro$DEP, metout = micro$metout, 
               shadmet = micro$shadmet, soil = micro$soil, shadsoil = micro$shadsoil, 
               soilmoist = micro$soilmoist, shadmoist = micro$shadmoist, 
               humid = micro$humid, shadhumid = micro$shadhumid, 
               soilpot = micro$soilpot, shadpot = micro$shadpot, 
               RAINFALL = micro$RAINFALL, ectoin = rbind(as.numeric(micro$ALTT), 
                   as.numeric(micro$REFL)[1], micro$longlat[1], 
                   micro$longlat[2]))
           mean_tb <- colMeans(matrix(ecto$environ[, 5], nrow = 24))
           mean_shade <- colMeans(matrix(ecto$environ[, 6], nrow = 24))
           mean_solar <- colMeans(matrix(ecto$environ[, 7], nrow = 24))
           mean_depth <- colMeans(matrix(ecto$environ[, 8], nrow = 24))
           mean_airtemp <- colMeans(matrix(ecto$environ[, 10], nrow = 24))
           mean_subtemp <- colMeans(matrix(ecto$environ[, 11], nrow = 24))
           mean_skytemp <- colMeans(matrix(ecto$environ[, 12], nrow = 24))
           mean_wind <- colMeans(matrix(ecto$environ[, 13], nrow = 24))
           mean_relhum <- colMeans(matrix(ecto$environ[, 14], nrow = 24))
           ecto <- cbind(mean(mean_tb), sd(mean_tb), mean(mean_shade), 
               sd(mean_shade), mean(mean_solar), sd(mean_solar), 
               mean(mean_depth), sd(mean_depth), mean(mean_airtemp), 
               sd(mean_airtemp), mean(mean_subtemp), sd(mean_subtemp), 
               mean(mean_skytemp), sd(mean_skytemp), mean(mean_wind), 
               sd(mean_wind), mean(mean_relhum), sd(mean_relhum), 
               ecto$yearout[, 9], ecto$yearout[, 13], ecto$yearout[, 
                   14])
       } at #3
2: run_models(decade = decade[[j]], model = model, rcp = rcp, global_points = global_points, 
       mass = working_species[i, 4], ctmax = working_species[i, 
           2], ctmin = working_species[i, 3], shape = working_species[i, 
           5]) at #26
1: run_nichemapr(taxon = "reptiles", rcp = "common", model = "ipsl", 
       species.start = 106, species.end = 120, decade.start = 1871, 
       decade.end = 2001, no_vms = 32, vm_cores = 16, iteration = 2)

With regards to locating merge-result.rds, I can't seem to find a container for the job that failed for me on this occasion- I'm not sure where it has gone (EDIT - The cluster got deleted without me realising so that's why I can't find the job result in the cluster. I'll have to wait until this error happens again and ensure that I don't automatically delete the cluster). I don't think that there's an issue merging the results, on the whole. In my use case, the inner loop iterating through a number of decades (35 iterations to be exact). After the 35th iteration, I then chop out all the data that I want and save it to a list, before writing the contents of this big list to individual CSVs (n=35). After our discusion yesterday I thought that maybe iterating 35 times is creating too much data to be kept in memory - I therefore reduced this to ~10 decades. The 7b error still appears. I can confirm that after the 35th iteration (if the loop has got that far) the expected results are outputting to 35 individual CSVs as expected - there's nothing wrong with the data as far as I can tell providing I get this far without an indivdual job throwing a 7b error.

With regards to region: I'm working from West Europe and I haven't changed the VM type/size in that time. I've always been using Standard F16v2 as it's the most economical for this kind of analysis. I'm not sure how often I was getting pre-empted in the past. When it first started happening it didn't occur to me at the time to check how many pre-empted nodes were present when a job failed.

Thanks,
Simon

@simon-tarr
Copy link
Author

I've had a thought - is there a limit on the number of active R sessions (i.e. WAIT=TRUE for multiple R sessions) that can be running doAzureParallel on a single workstation? I've noticed that if I have two R sessions running, I can easily run two lots of models without any 7b errors. As soon as I increase the sessions to 3+, the 7b error starts almost immediately.

@brnleehng
Copy link
Collaborator

There shouldn't be a limit on the number of active R sessions. I tried testing this (Although a simple workflow) and I wasn't able to reach this conclusion.

Are you running multiple R sessions with this script?

There is a limit on the number of active jobs you can have.. But there would be an error at job creation time, not the end of the job.

Also what's your cluster configuration file look like? Are you using maxTasksPerNode > 1?

@simon-tarr
Copy link
Author

I'm running the same script in each session yes - I just change the species over which I'm analysing. So session 1 might be running species 1:5, session 2 running 6:10 etc. etc. I broke up the species in this way so that I could run 8 x 512 core clusters so speed up the analysis overall. (After a lot of tinkering I found that 512 cores gave me the best speed to cost ratio, given how long each iteration takes to run, how long it takes to submit the tasks etc).

I have set maxTasksPerNode to equal the number of cores on the VM so in my case it's set to 16 as I'm running F16v2 VMs.

@brnleehng
Copy link
Collaborator

Hey Simon,

Are you able to reproduce this issue without running this specific workload? If I can have a similar example, I will try to reproduce this error. I was not able to reproduce it with a simple case of 4 R sessions running doAzureParallel

Thanks,
Brian

@simon-tarr
Copy link
Author

Hi Brian,

I have a pretty big deadline approaching at the moment so I don't have time to rework all my code to test another use scenario I'm afraid. From the testing I have carried out with this specific workload though, I think it's a possible problem with the Standard F16v2 VMs. About 30% of pools created with these VMs will complete successfully - the other 70% will fail with the 7b error. In each case, the resources utilisation (in terms of RAM, CPU, HDD space required) remains almost constant - perhaps +-2% between runs. Unless I'm really at the edge in terms of resources and some iterations of my workflow send these VMs over the edge, I just don't know.

There's one possible reason why the amount of compute resources might not be the issue - the models still crash with the 7b error when I drastically reduce the size of the analysis. My default is to run over 35 decades' worth of data. I can still get the 7b error if I run it over 10 or even 5 decades. In these cases, each VM should have more than enough resources to carry out the work, yet still fail.

I've instead moved over to the E-series machines and have had nnot a single issue with the 7b error. This would suggest that the 7b error is perhaps due to a lack of resources on the F16v2 machines? Although your explation of what this error means doesn't seem to tie into what I said in the last paragraphs (but I'm very ignorant of the technical side of things so who knows!)

Anyway, for now, everything is working with the E-series machines and that's the most important thing! Although we might be unable to fix this issue, it's perhaps something to keep in the back of your mind if anything similar crops up in the future with someone else? Sorry I couldn't have been more help in pinpointing the exact cause.

Thanks,
Simon

@simon-tarr
Copy link
Author

Hey Simon,

Are you able to reproduce this issue without running this specific workload? If I can have a similar example, I will try to reproduce this error. I was not able to reproduce it with a simple case of 4 R sessions running doAzureParallel

Thanks,
Brian

Hi Brian,

So this issue has returned, having been absent for the past few weeks. I thought back in October that it was possibly linked to the VM size I was using (F-series) where the core to memory ratio is 1:2 and there was insufficient memory in some cases for the jobs to complete sucessfully.

I therefore upped the ratio by using E-series machines and the problem seemed to have gone away. However, it has returned since I started using a custom docker image. The image builds fine locally and it installs on nodes with no errors. Within a pool, 10s or even 100s of jobs can complete successfully before the 7b error occurs. Sometimes it'll appear after only one or two iterations of the loop. The image is at arcalis/nichemapr:latest.

Clearly my hunch of there being some issue with using F-series machines was incorrect but the issue is nowhere near as bad as it was back in October. I'd say only about 25% of booted clusters fail with the 7b error.

I realise that you've asked for another example with a different workflow but this current analysis is literally my entire life at the moment and so I have no other examples to provide. Given the complexity of what I'm doing, I wonder if it's some sort of interaction between the scale of the jobs I'm running, the fact that I have to load in my own resource files and that I have to merge some pretty hefty dataframes at the end of every job.

I will see if I can generate a large-scale reproducible example but I guess the rough parameters for my analysis would be:

  • Boot a pool of 8 x E64s_v3 nodes (low-priority) while loading in 10 large resource files (typically 100MB each)
  • Running a model/calculation 70,000 times using the resource files as input.
  • The 70,000 model runs will need to be literated 9 times for a species. There would need to be at least 5 species. It would look like a nested loop:
for (i in 1:5){
    for (k in 1:9){
         # 70,000 model runs here, 9 times}
}

For now I can fudge my way through this analysis by rebooting the cluster when it messes up (as time-consuming as this is) but I do think it's an issue with long-term concequences that will probably need addressing at some point in the future. I'm pretty sure my analysis is small fry compared to what others after me will likely be running via Batch so I should imagine the error will appear for others, too...if it hasn't already.

Cheers,
Simon

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants