-
Notifications
You must be signed in to change notification settings - Fork 572
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add fastaref subcommand. #748
base: develop
Are you sure you want to change the base?
Conversation
To fit the format of all the other samtools commands, the usage is now:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please review the comments and don't hesitate to ask, if you are looking for a particular functionality.
bam_fastaref.c
Outdated
#include <fcntl.h> | ||
#include <unistd.h> | ||
|
||
#include <htslib/sam.h> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Usually, HTSlib header files are first served from the current folder, then from the include path. So you should use double quotes ("htslib/sam.h"
) instead of angle brackets.
bam_fastaref.c
Outdated
* | ||
* Updates samFieldPosition to be the position after the sam field. | ||
* | ||
* @returns 1 if successful |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean returns 0 if successful.
bam_fastaref.c
Outdated
* | ||
* @returns 1 if successful | ||
*/ | ||
int get_sam_field(char** samFieldPosition, char** samField){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why don't you return samField
as a result?
bam_fastaref.c
Outdated
* | ||
* This updates sequenceText after reading it. | ||
*/ | ||
int getInfoForSQLine(char** sequenceText, char** SN, char** M5, kstring_t* restHeader, FastarefOptions *options){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you just want to parse the @SQ
line, you should look into the functions sam_hdr_parse
(htslib/sam.c
), sam_hdr_parse_
(htslib/cram/sam_header.c
) and bam_header_to_cram
(htslib/cram/cram_samtools.c
).
Unfortunately, we don't have an unified and homogeneous header API (it's high on our TO-DO list), but there is plenty of functionality scattered around, which you can use to implement your feature. There is no need to rewrite it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@valeriuo There doesn't seem to be any functions in htslib/cram/sam_header.c
to get all headers which a specific tag (it looks like sam_hdr_find
only returns the first header item). Would the expected thing to do be to expose the whole of the SAM_hdr
data structure to bam_fastaref.c
, or add a function to sam_header.c
to expose this functionality?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ThomasHickman, check sam_hdr_find_key
function (sam_header.c
), with an example in cram_io.c:1869
. Whatever is in sam_header.h
is already exposed by HTSlib and you can use it in samtools.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sadly that function isn't exposed. Only the htslib/.h files are considered as external API. The cram/.h are internal and shouldn't be used outside htslib. (Although in one case I did, out of sheer frustration!)
This is doubly annoying because we do have a header API, namely in the cram subdirectory, but not one blessed by htslib as OK to use. This header API was derived from io_lib and arrived when I ported over CRAM, simply because I needed one for my own work. We do really need to write down a list of requirements and then evaluate what we have to see if the current code fits it or whether we want to start afresh / amend it.
I've created the function |
bam_fastaref.c
Outdated
#include "htslib/cram.h" | ||
#include "sam_header.h" | ||
|
||
KHASH_INIT(s2s, char*, char*, 1, kh_str_hash_func, kh_str_hash_equal) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use KHASH_MAP_INIT_STR
instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used this as KHASH_MAP_INIT_STR
specifies const char *
as the type of the keys. The keys which I specify are dynamically allocated, so need to be freed and therefore need to be declared as a char *
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, but we usually go around this limitations with strdup
, for setting the key, and explicit cast to (char *)
when freeing the key memory space. Take a look at bedidx.c:207 and bedidx:252. The idea is that keys in a map should be constant.
bam_fastaref.c
Outdated
char* outFileName; // NULL if this is to output to stdout | ||
} FastarefOptions; | ||
|
||
void destroy_malloced_s2s_khash(khash_t(s2s)* khash){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
KHASH_INIT
macro already defines a type (khash_s2s_t
, in this case). Use this instead of khash_t(s2s)
.
Also, the a name like khash_s2s_destroy
would probably be more appropriate, since we tend to put the action at the end.
bam_fastaref.c
Outdated
} | ||
|
||
char* key = malloc(3); | ||
memcpy(key, fieldStart, 2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check key
for nullness.
bam_fastaref.c
Outdated
key[2] = '\0'; | ||
|
||
char* value = malloc(valueLength + 1); | ||
memcpy(value, fieldStart + 3, valueLength); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check value
for nullness.
bam_fastaref.c
Outdated
const char* fieldEnd; | ||
|
||
while(1){ | ||
fieldEnd = strpbrk(fieldStart, "\t\n\0"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
strpbrk
doesn't check for the string terminator (\0
), so you can remove it from the list.
bam_fastaref.c
Outdated
value[valueLength] = '\0'; | ||
|
||
int ret; | ||
int keyPointer = kh_put(s2s, out_khash, key, &ret); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is actually the value pointer.
bam_fastaref.c
Outdated
int numOutputKeys; | ||
char* samFileName; | ||
char* outFileName; // NULL if this is to output to stdout | ||
} FastarefOptions; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is preference for lower snake case in the names of structs. Also, since it's a type, it should be suffixed with _t
(fastaref_options_t
)
bam_fastaref.c
Outdated
while(1){ | ||
fieldEnd = strpbrk(fieldStart, "\t\n\0"); | ||
|
||
int valueLength = fieldEnd - fieldStart - 3; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your code relies on the fact that the header text is terminated with a \n\0
sequence, but you should check the result of strpbrk
for nullness, as a good practice.
bam_fastaref.c
Outdated
int atStartOfFile = 1; | ||
|
||
while(1){ | ||
for(; !(*readPosition == '\0' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic is a bit hard to follow, but I am guessing you are trying to find the beginning of @SQ
lines. Why don't you use strstr
function?
bam_fastaref.c
Outdated
|
||
*numKeys = (keysStrLength + 1) / 3; | ||
char** outputKeys = malloc(*numKeys * sizeof(char*)); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check the result of malloc
.
Using strsep, which in it's use here is less efficient, but clearer than the previous implementation. Also this modifies the file text and input strings to form data structures rather than mallocing them separately
If this is present, this fails to compile in gcc 4.8
bam_fastaref.c
Outdated
char* field; | ||
kh_s2s_t* khash = kh_init(s2s); | ||
|
||
while((field = strsep(&parsePosition, "\t")) != NULL){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, strsep
is not C99 compliant, which is the standard we are following. The C99 alternative is strtok
, but it's not recommended due to its side effects. Since you are only looking for tab delimiters ('\t'), why don't you use a simple for
loop?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jrandall suggested that I used a c function here, but I don't think either me or him realised that this was not C99 compliant or that you are targeting C99. I'll revert back to the previous implementation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added an implementation of strsep
instead, as it keeps code readability, and doesn't change what I've written too much
bam_fastaref.c
Outdated
|| fputs(SN, outFile) < 0){ | ||
print_error("fastaref", "failed to write to output file"); | ||
char* line; | ||
while((line = strsep(&readPosition, "\n")) != NULL){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment regarding strsep
, as above.
bam_fastaref.c
Outdated
FILE* outFile = NULL; | ||
int returnCode = 0; | ||
bam_hdr_t* header = NULL; | ||
hFILE* ref = NULL; | ||
khash_t(s2s)* khash = kh_init(s2s); | ||
kh_s2s_t* khash = NULL; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please give a proper name to khash
.
bam_fastaref.c
Outdated
kh_s2s_t* parseSQLine(char* sqLine){ | ||
char* parsePosition = sqLine + 4; // 4 = len("@SQ\t") | ||
char* field; | ||
kh_s2s_t* khash = kh_init(s2s); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldn't find the code that frees the memory allocated to khash
.
Creating a replacement function, to keep the readability of strsep, such that it is portable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think your solution is almost ready. We will wait, however, for samtools/htslib#589 to be cleaned up and merged before we do this one.
bam_fastaref.c
Outdated
/** | ||
* Replacement in C99 for strsep | ||
*/ | ||
char* _strsep(char **stringp, const char *delim){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please make it static.
bam_fastaref.c
Outdated
char* readPosition = header->text; | ||
|
||
outFile = options->outFileName ? fopen(options->outFileName, "w") : stdout; | ||
if(outFile == NULL){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please be consistent with your coding style. In the if
statement above, you leave spaces on both sides of the parenthesis, which I think is the better looking style. You should apply it to the rest of your code.
bam_fastaref.c
Outdated
goto cleanup; | ||
} | ||
|
||
char* readPosition = header->text; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your parsing logic operates directly on the text
string of the header structure. Not only does _strsep
change the original string, but it could make the initial text pointer NULL, if it doesn't find a delimiter. You would then leak memory.
I suggest making a copy of the header text (with strdup
) and work on the copy. You should also save the initial pointer of the copy and use it at the end, when freeing the memory allocated to the copy.
Thus, you decouple your logic from the header layer.
Based on a modified version of: IndentWidth: 4 PointerAlignment: Left AllowShortIfStatementsOnASingleLine: true PenaltyExcessCharacter: 0 PenaltyBreakString: 10 BreakBeforeBinaryOperators: NonAssignment BraceWrapping: BeforeElse: false , where the above is a clang-format style file
The code looks OK and I was able to build it. However, we'll wait for samtools/htslib#589 to be merged in order to do some testing, before merging this PR in develop. |
The man page section added just says
It's really unclear what this new command actually does. Presumably it uses the CRAM reference-finding machinery to output the reference bases in FASTA form, looking them up based on It would be great if the man page described all this. The options list has a |
This is an accurate description of what this is trying to do. If the reference finding code fails to find a reference (maybe due to their being no |
@ThomasHickman will this extract an embedded reference inside a CRAM file? |
@tseemann That sounds like an interesting idea, but this doesn't support extracting an embedded reference from a cram at the moment - this just uses the M5 field in the SQ line to query the |
This adds a fastaref subcommand, which creates a fasta file containing references for a given file, obtained by looking at SQ lines. Note: this depends on samtools/htslib#589 getting merged, so the travis build will fail until that PR gets merged.