Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add fastaref subcommand. #748

Open
wants to merge 13 commits into
base: develop
Choose a base branch
from
Open

Conversation

ThomasHickman
Copy link

@ThomasHickman ThomasHickman commented Dec 4, 2017

This adds a fastaref subcommand, which creates a fasta file containing references for a given file, obtained by looking at SQ lines. Note: this depends on samtools/htslib#589 getting merged, so the travis build will fail until that PR gets merged.

Usage: samtools fastaref [options] <in.bam> <out.fq>

Options:
  -k LIST  Output the specified keys into the fasta file (separated by commas).  [LN,AH,AN,AS,M5,SP]
  -l INT   Maximum length of outputted lines [60]

@ThomasHickman
Copy link
Author

ThomasHickman commented Dec 4, 2017

To fit the format of all the other samtools commands, the usage is now:

samtools fastaref [options] <in.bam>

Options:
  -o FILE  Output file name [stdout]
  -k LIST  Output the specified keys into the fasta file (separated by commas)  [LN,AH,AN,AS,M5,SP]
  -l INT   Maximum length of outputted lines [60]

Copy link
Contributor

@valeriuo valeriuo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please review the comments and don't hesitate to ask, if you are looking for a particular functionality.

bam_fastaref.c Outdated
#include <fcntl.h>
#include <unistd.h>

#include <htslib/sam.h>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually, HTSlib header files are first served from the current folder, then from the include path. So you should use double quotes ("htslib/sam.h") instead of angle brackets.

bam_fastaref.c Outdated
*
* Updates samFieldPosition to be the position after the sam field.
*
* @returns 1 if successful
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean returns 0 if successful.

bam_fastaref.c Outdated
*
* @returns 1 if successful
*/
int get_sam_field(char** samFieldPosition, char** samField){
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't you return samField as a result?

bam_fastaref.c Outdated
*
* This updates sequenceText after reading it.
*/
int getInfoForSQLine(char** sequenceText, char** SN, char** M5, kstring_t* restHeader, FastarefOptions *options){
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you just want to parse the @SQ line, you should look into the functions sam_hdr_parse (htslib/sam.c), sam_hdr_parse_ (htslib/cram/sam_header.c) and bam_header_to_cram (htslib/cram/cram_samtools.c).
Unfortunately, we don't have an unified and homogeneous header API (it's high on our TO-DO list), but there is plenty of functionality scattered around, which you can use to implement your feature. There is no need to rewrite it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@valeriuo There doesn't seem to be any functions in htslib/cram/sam_header.c to get all headers which a specific tag (it looks like sam_hdr_find only returns the first header item). Would the expected thing to do be to expose the whole of the SAM_hdr data structure to bam_fastaref.c, or add a function to sam_header.c to expose this functionality?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ThomasHickman, check sam_hdr_find_key function (sam_header.c), with an example in cram_io.c:1869. Whatever is in sam_header.h is already exposed by HTSlib and you can use it in samtools.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sadly that function isn't exposed. Only the htslib/.h files are considered as external API. The cram/.h are internal and shouldn't be used outside htslib. (Although in one case I did, out of sheer frustration!)

This is doubly annoying because we do have a header API, namely in the cram subdirectory, but not one blessed by htslib as OK to use. This header API was derived from io_lib and arrived when I ported over CRAM, simply because I needed one for my own work. We do really need to write down a list of requirements and then evaluate what we have to see if the current code fits it or whether we want to start afresh / amend it.

@ThomasHickman
Copy link
Author

ThomasHickman commented Dec 6, 2017

I've created the function parseSQLine, which generates a khash containing the fields of the SQ line and changed the code according to proposed changes

bam_fastaref.c Outdated
#include "htslib/cram.h"
#include "sam_header.h"

KHASH_INIT(s2s, char*, char*, 1, kh_str_hash_func, kh_str_hash_equal)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use KHASH_MAP_INIT_STR instead.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used this as KHASH_MAP_INIT_STR specifies const char * as the type of the keys. The keys which I specify are dynamically allocated, so need to be freed and therefore need to be declared as a char *

Copy link
Contributor

@valeriuo valeriuo Dec 8, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, but we usually go around this limitations with strdup, for setting the key, and explicit cast to (char *) when freeing the key memory space. Take a look at bedidx.c:207 and bedidx:252. The idea is that keys in a map should be constant.

bam_fastaref.c Outdated
char* outFileName; // NULL if this is to output to stdout
} FastarefOptions;

void destroy_malloced_s2s_khash(khash_t(s2s)* khash){
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

KHASH_INIT macro already defines a type (khash_s2s_t, in this case). Use this instead of khash_t(s2s).
Also, the a name like khash_s2s_destroy would probably be more appropriate, since we tend to put the action at the end.

bam_fastaref.c Outdated
}

char* key = malloc(3);
memcpy(key, fieldStart, 2);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check key for nullness.

bam_fastaref.c Outdated
key[2] = '\0';

char* value = malloc(valueLength + 1);
memcpy(value, fieldStart + 3, valueLength);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check value for nullness.

bam_fastaref.c Outdated
const char* fieldEnd;

while(1){
fieldEnd = strpbrk(fieldStart, "\t\n\0");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

strpbrk doesn't check for the string terminator (\0), so you can remove it from the list.

bam_fastaref.c Outdated
value[valueLength] = '\0';

int ret;
int keyPointer = kh_put(s2s, out_khash, key, &ret);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually the value pointer.

bam_fastaref.c Outdated
int numOutputKeys;
char* samFileName;
char* outFileName; // NULL if this is to output to stdout
} FastarefOptions;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is preference for lower snake case in the names of structs. Also, since it's a type, it should be suffixed with _t (fastaref_options_t)

bam_fastaref.c Outdated
while(1){
fieldEnd = strpbrk(fieldStart, "\t\n\0");

int valueLength = fieldEnd - fieldStart - 3;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your code relies on the fact that the header text is terminated with a \n\0 sequence, but you should check the result of strpbrk for nullness, as a good practice.

bam_fastaref.c Outdated
int atStartOfFile = 1;

while(1){
for(; !(*readPosition == '\0'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic is a bit hard to follow, but I am guessing you are trying to find the beginning of @SQ lines. Why don't you use strstr function?

bam_fastaref.c Outdated

*numKeys = (keysStrLength + 1) / 3;
char** outputKeys = malloc(*numKeys * sizeof(char*));

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check the result of malloc.

Using strsep, which in it's use here is less efficient, but clearer than the previous implementation. Also this modifies the file text and input strings to form data structures rather than mallocing them separately
If this is present, this fails to compile in gcc 4.8
bam_fastaref.c Outdated
char* field;
kh_s2s_t* khash = kh_init(s2s);

while((field = strsep(&parsePosition, "\t")) != NULL){
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, strsep is not C99 compliant, which is the standard we are following. The C99 alternative is strtok, but it's not recommended due to its side effects. Since you are only looking for tab delimiters ('\t'), why don't you use a simple for loop?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jrandall suggested that I used a c function here, but I don't think either me or him realised that this was not C99 compliant or that you are targeting C99. I'll revert back to the previous implementation

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added an implementation of strsep instead, as it keeps code readability, and doesn't change what I've written too much

bam_fastaref.c Outdated
|| fputs(SN, outFile) < 0){
print_error("fastaref", "failed to write to output file");
char* line;
while((line = strsep(&readPosition, "\n")) != NULL){
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment regarding strsep, as above.

bam_fastaref.c Outdated
FILE* outFile = NULL;
int returnCode = 0;
bam_hdr_t* header = NULL;
hFILE* ref = NULL;
khash_t(s2s)* khash = kh_init(s2s);
kh_s2s_t* khash = NULL;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please give a proper name to khash.

bam_fastaref.c Outdated
kh_s2s_t* parseSQLine(char* sqLine){
char* parsePosition = sqLine + 4; // 4 = len("@SQ\t")
char* field;
kh_s2s_t* khash = kh_init(s2s);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't find the code that frees the memory allocated to khash.

Creating a replacement function, to keep the readability of strsep, such that it is portable
Copy link
Contributor

@valeriuo valeriuo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think your solution is almost ready. We will wait, however, for samtools/htslib#589 to be cleaned up and merged before we do this one.

bam_fastaref.c Outdated
/**
* Replacement in C99 for strsep
*/
char* _strsep(char **stringp, const char *delim){
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make it static.

bam_fastaref.c Outdated
char* readPosition = header->text;

outFile = options->outFileName ? fopen(options->outFileName, "w") : stdout;
if(outFile == NULL){
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please be consistent with your coding style. In the if statement above, you leave spaces on both sides of the parenthesis, which I think is the better looking style. You should apply it to the rest of your code.

bam_fastaref.c Outdated
goto cleanup;
}

char* readPosition = header->text;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your parsing logic operates directly on the text string of the header structure. Not only does _strsep change the original string, but it could make the initial text pointer NULL, if it doesn't find a delimiter. You would then leak memory.
I suggest making a copy of the header text (with strdup) and work on the copy. You should also save the initial pointer of the copy and use it at the end, when freeing the memory allocated to the copy.
Thus, you decouple your logic from the header layer.

Based on a modified version of:
IndentWidth: 4
PointerAlignment: Left
AllowShortIfStatementsOnASingleLine: true
PenaltyExcessCharacter: 0
PenaltyBreakString: 10
BreakBeforeBinaryOperators: NonAssignment
BraceWrapping:
  BeforeElse: false
, where the above is a clang-format style file
@valeriuo
Copy link
Contributor

The code looks OK and I was able to build it. However, we'll wait for samtools/htslib#589 to be merged in order to do some testing, before merging this PR in develop.

@jmarshall
Copy link
Member

The man page section added just says

samtools fastaref [options] <in.bam>

Generates a fasta reference file, by reading SQ lines.

It's really unclear what this new command actually does. Presumably it uses the CRAM reference-finding machinery to output the reference bases in FASTA form, looking them up based on M5 tags in the @SQ headers? What if your BAM file doesn't have M5 tags?

It would be great if the man page described all this.

The options list has a -k/-o typo. And what you're calling keys everything else in the user interface calls (header) tags.

@ThomasHickman
Copy link
Author

ThomasHickman commented Mar 26, 2018

@jmarshall

It's really unclear what this new command actually does. Presumably it uses the CRAM reference-finding machinery to output the reference bases in FASTA form, looking them up based on M5 tags in the @sq headers? What if your BAM file doesn't have M5 tags?

This is an accurate description of what this is trying to do. If the reference finding code fails to find a reference (maybe due to their being no SQ lines), this command fails with an error printed to stdout. I could have a look at changing the documentation to be a bit clearer along with fixing samtools/htslib#589 when I get time to have a look at this.

@tseemann
Copy link

@ThomasHickman will this extract an embedded reference inside a CRAM file?

@ThomasHickman
Copy link
Author

@tseemann That sounds like an interesting idea, but this doesn't support extracting an embedded reference from a cram at the moment - this just uses the M5 field in the SQ line to query the REF_PATH, REF_CACHE or query https://www.ebi.ac.uk/ena/cram/md5/

@tseemann
Copy link

tseemann commented Mar 30, 2018

The reason I ask is that it isn't actually possible to do right now.

See #723 and #720

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants