Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Phangorn distance functions crash R with segmentation fault when alignment contains missing values. #125

Open
ejurga opened this issue Feb 17, 2022 · 2 comments

Comments

@ejurga
Copy link

ejurga commented Feb 17, 2022

When a sequence in a DNAbin or phyDat alignment contains a missing value in place of a nucleotide, and the alignment is processed with a distance function from phangorn (e.g., dist.ml), R crashes from a sementation fault.

On linux (Ubuntu 18.04.5) R exits with the following error:

Process was terminated

Process finished with exit code 139 (interrupted by signal 11: SIGSEGV)

I understand that technically, an alignment probably shouldn't contain missing values, but the function vcfR::vcfR2DNAbin does output some NA's. Either way, these are valid values in a phyDat object and a DNAbin object, so probably phangorn's distance metrics should be able to handle them. At the very least, the behaviour is really annoying, and was pretty difficult to get to the root cause of.

Here is a MWE that is reproducable on my machine.

# Just make a sample sequence
make_seq <- function(size=size){
  nucl <- c('A', 'T', 'G', 'C')
  sample(nucl, replace = TRUE, size=size)
}

# Make a DNAbin object
fake_alignment <- function(size, n=10, add_na=FALSE){

  l = list()
  for (i in c(1:n)){
    seq <- c(make_seq(size=size))
    if (add_na==TRUE){ # Add an NA to the end of the alignment
      seq <- c(seq, NA)
    }
    l[[i]] <- seq
    }
  m <- do.call('rbind', l)
  ape::as.DNAbin(m)
}

# This works!
dna <- fake_alignment(n=2, size=1000, add_na=FALSE)
dist <- phangorn::dist.ml(dna)

# Process finished with exit code 139 (interrupted by signal 11: SIGSEGV)
dna_with_na <- fake_alignment(n=2, size=1000, add_na=TRUE)
dist <- phangorn::dist.ml(dna_with_na
@KlausVigo
Copy link
Owner

Hi @ejurga,
can you send me a short example with vcfR::vcfR2DNAbin code which produces this error?
What would be the proper or expected handling of NA's? Removing or coding as "?" or "-" ?
Thanks,
Klaus

@emmanuelparadis
Copy link

Hi,
This is a bug in ape: the conversion from "character" to "DNAbin" does not consider that some values could be NA.
You can fix this bug 'by hand' since as.DNAbin.character() is not hidden. So, do the usual fix(as.DNAbin.character), then insert this line:

x[is.na(x)] <- "?"

right at the top of the body. Save and close.
Another possibility is to do this manipulation before calling as.DNAbin().
Best,
Emmanuel

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants