Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

small code change for use of the cdd2cog.pl script with COG20 #14

Open
dspeth opened this issue Jul 1, 2021 · 0 comments
Open

small code change for use of the cdd2cog.pl script with COG20 #14

dspeth opened this issue Jul 1, 2021 · 0 comments

Comments

@dspeth
Copy link

dspeth commented Jul 1, 2021

Hi!

not really an issue, but putting this here in case the original author doesn't have time to modify the cdd2cog.pl script for use with the updated COG20 database. I guess this would be more properly done via pull request, but I figure more people might see an opened issue. I also realise this is a "band-aid" style fix, but might be useful to someone nonetheless.

if you're interested in using the cdd2cog.pl script with the COG20 database (which can be found here https://ftp.ncbi.nih.gov/pub/COG/COG2020/data/) only a small change is required. The body of the script will work perfectly fine, but the information parsed from the fun.txt and whog files now needs to be parsed from files with a slightly different name and format.

fun.txt is now replaced by fun-20.tab
whog can be replaced by cog-20.def.tab
both of these files can be downloaded from the link above

to retrieve the relevant info from these files, the subroutines all the way at the end of cdd2cog.pl need to be modified. For clarity and ease of copy/pasting, I have pasted the entire subroutine here. The orignal code is still in place, but commented out. The 4 added lines are present under # code to parse fun-20.tab file and # code to parse cog-20.def.tab.

after the modifications are made, the script can be run using:
cdd2cog.pl -r rps-blast.out -c cddid.tbl -f fun-20.tab -w cog-20.def.tab

Hope this is helpful!

###############
# Subroutines #
###############

### Subroutine to parse the 'cddid.tbl', 'fun' and 'whog' file contents and store in hash structures
sub parse_cdd_cog {

    ### 'cddid.tbl'
    open (my $cddid_fh, "<", "$CDDid_File");
    print "\nParsing CDDs '$CDDid_File' file ...\n"; # status message
    while (<$cddid_fh>) {
        chomp;
        my @line = split(/\t/, $_); # split line at the tabs
        if ($line[1] =~ /^COG\d{4}$/) { # search for COG CD accessions in cddid
            $CDDid{$line[0]} = $line[1]; # hash to store info; $line[0] = PSSM-Id
        }
    }
    close $cddid_fh;

    ### 'fun.txt'
    open (my $fun_fh, "<", "$Fun_File");
    print "Parsing COGs '$Fun_File' file ...\n"; # status message
    while (<$fun_fh>) {
        chomp;
	
	# code to parse fun-20.tab file
	my @line = split(/\t/, $_); # split line at the tabs
	$Fun{$line[0]} = {'desc' => $line[2], 'count' => 0}; # anonymous hash in hash
        # $line[0] = single-letter functional category, $line[2] = description of functional category
        # count used to find functional categories not present in the query proteins for final overall assignment statistics	

	# code and comments to parse original fun.txt file
	#$_ =~ s/^\s*|\s+$//g; # get rid of all leading and trailing whitespaces
        #if (/^\[(\w)\]\s*(.+)$/) {
        #    $Fun{$1} = {'desc' => $2, 'count' => 0}; # anonymous hash in hash
        #    # $1 = single-letter functional category, $2 = description of functional category
        #    # count used to find functional categories not present in the query proteins for final overall assignment statistics

        #}
    }
    close $fun_fh;

    ### 'whog'
    open (my $whog_fh, "<", "$Whog_File");
    print "Parsing COGs '$Whog_File' file ...\n"; # status message
    while (<$whog_fh>) {
        chomp;
	# code to parse cog-20.def.tab
	my @line = split(/\t/, $_); # split line at the tabs
	$Whog{$line[0]} = {'function' => $line[1], 'desc' => $line[2]}; # anonymous hash in hash
	# $line[1] = single-letter functional categories, maximal four per COG in COG20 (COG5032 no longer exists)
	# $line[0] = COG#, $line[2] = COG protein description

	#code and comments to parse the original whog file
        #$_ =~ s/^\s*|\s+$//g; # get rid of all leading and trailing whitespaces
        #if (/^\[(\w+)\]\s*(COG\d{4})\s+(.+)$/) {
        #    $Whog{$2} = {'function' => $1, 'desc' => $3}; # anonymous hash in hash
        #    # $1 = single-letter functional categories, maximal five per COG (only COG5032 with five)
	#    # $2 = COG#, $3 = COG protein description
        #}
    }
    close $whog_fh;

    return 1;
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant