Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

There seem to be some minor issues with downloading pre-generated databases using databases. #253

Open
TigerWindWood opened this issue Mar 19, 2024 · 2 comments

Comments

@TigerWindWood
Copy link
Contributor

Thank you for inventing such an excellent tool. Due to network issues, I initially downloaded the required databases (such as afdb_swissprot.tar.gz and afdb_swissprot.version) on Windows from https://foldseek.steineggerlab.workers.dev/. Then, on Linux, I used the command foldseek databases Alphafold/Swiss-Prot afdb_swissprot tmp and placed the downloaded files in the tmp path. However, I noticed that it still attempted to re-download the afdb_swissprot.tar.gz file. Upon checking the download.sh script, I found an error on line 130, where the filename used to check for existence was alphafold_swissprot.tar.gz. Therefore, I manually added the alphafold_swissprot.tar.gz file, allowing the command to run successfully. Similar errors also occurred during the download of afdb_proteome (line 121).

`#!/bin/sh -e
fail() {
echo "Error: $1"
exit 1
}

notExists() {
[ ! -f "$1" ]
}

hasCommand () {
command -v "$1" >/dev/null 2>&1
}

ARR=""
push_back() {
# shellcheck disable=SC1003
CURR="$(printf '%s' "$1" | awk '{ gsub(/'''/, "'''\''''''"); print; }')"
if [ -z "$ARR" ]; then
ARR='''$CURR'''
else
ARR=$ARR' ''$CURR'''
fi
}

STRATEGY=""
if hasCommand aria2c; then STRATEGY="$STRATEGY ARIA"; fi
if hasCommand curl; then STRATEGY="$STRATEGY CURL"; fi
if hasCommand wget; then STRATEGY="$STRATEGY WGET"; fi
if [ "$STRATEGY" = "" ]; then
fail "No download tool found in PATH. Please install aria2c, curl or wget."
fi

downloadFile() {
URL="$1"
OUTPUT="$2"
set +e
for i in $STRATEGY; do
case "$i" in
ARIA)
FILENAME=$(basename "${OUTPUT}")
DIR=$(dirname "${OUTPUT}")
aria2c --max-connection-per-server="$ARIA_NUM_CONN" --allow-overwrite=true -o "$FILENAME" -d "$DIR" "$URL" && return 0
;;
CURL)
curl -L -o "$OUTPUT" "$URL" && return 0
;;
WGET)
wget -O "$OUTPUT" "$URL" && return 0
;;
esac
done
set -e
fail "Could not download $URL to $OUTPUT"
}

downloadFileList() {
URL="$1"
OUTPUT_DIR="$2"
INPUT_FILE="$OUTPUT_DIR/input.txt"
downloadFile "$URL" "$INPUT_FILE"
set +e
for i in $STRATEGY; do
case "$i" in
ARIA)
aria2c -c --max-connection-per-server="$ARIA_NUM_CONN" --allow-overwrite=true --dir="$OUTPUT_DIR" --input-file="$INPUT_FILE" && return 0
;;
CURL)
(cd "$OUTPUT_DIR"; xargs -n 1 curl -C - -L -O < "$INPUT_FILE") && return 0
;;
WGET)
wget --continue -P "$OUTPUT_DIR" --input-file="$INPUT_FILE" && return 0
;;
esac
done
set -e
rm -f "$OUTPUT/input.txt"
fail "Could not download $URL to $OUTPUT"
}

check number of input variables

[ "$#" -ne 3 ] && echo "Please provide " && exit 1;
[ ! -d "$3" ] && echo "tmp directory $3 not found!" && mkdir -p "$3";

SELECTION="$1"
OUTDB="$2"
TMP_PATH="$3"

INPUT_TYPE=""
case "${SELECTION}" in
"Alphafold/UniProt")
if notExists "${TMP_PATH}/afdb.tar.gz"; then
downloadFile "https://foldseek.steineggerlab.workers.dev/afdb.tar.gz" "${TMP_PATH}/afdb.tar.gz"
downloadFile "https://foldseek.steineggerlab.workers.dev/afdb.version" "${TMP_PATH}/version"
fi
tar xvfz "${TMP_PATH}/afdb.tar.gz" -C "${TMP_PATH}"
push_back "${TMP_PATH}/afdb"
INPUT_TYPE="FOLDSEEK_DB"
;;
"Alphafold/UniProt50-minimal")
if notExists "${TMP_PATH}/afdb50.tar.gz"; then
downloadFile "https://foldseek.steineggerlab.workers.dev/afdb50.tar.gz" "${TMP_PATH}/afdb50.tar.gz"
downloadFile "https://foldseek.steineggerlab.workers.dev/afdb50.version" "${TMP_PATH}/version"
fi
tar xvfz "${TMP_PATH}/afdb50.tar.gz" -C "${TMP_PATH}"
push_back "${TMP_PATH}/afdb50"
INPUT_TYPE="FOLDSEEK_DB"
;;
"Alphafold/UniProt50")
if notExists "${TMP_PATH}/afdb50.tar.gz"; then
downloadFile "https://foldseek.steineggerlab.workers.dev/afdb50.tar.gz" "${TMP_PATH}/afdb50.tar.gz"
downloadFile "https://foldseek.steineggerlab.workers.dev/afdb50clusearch.tar.gz" "${TMP_PATH}/afdb50clusearch.tar.gz"
downloadFile "https://foldseek.steineggerlab.workers.dev/afdb50.version" "${TMP_PATH}/version"
fi
tar xvfz "${TMP_PATH}/afdb50.tar.gz" -C "${TMP_PATH}"
tar xvfz "${TMP_PATH}/afdb50clusearch.tar.gz" -C "${TMP_PATH}"
push_back "${TMP_PATH}/afdb50"
INPUT_TYPE="FOLDSEEK_DB"
;;
"Alphafold/Proteome")
if notExists "${TMP_PATH}/alphafolddb.tar.gz"; then
downloadFile "https://foldseek.steineggerlab.workers.dev/afdb_proteome.tar.gz" "${TMP_PATH}/afdb_proteome.tar.gz"
downloadFile "https://foldseek.steineggerlab.workers.dev/afdb_proteome.version" "${TMP_PATH}/version"
fi
tar xvfz "${TMP_PATH}/afdb_proteome.tar.gz" -C "${TMP_PATH}"
push_back "${TMP_PATH}/afdb_proteome"
INPUT_TYPE="FOLDSEEK_DB"
;;
"Alphafold/Swiss-Prot")
if notExists "${TMP_PATH}/alphafold_swissprot.tar.gz"; then
downloadFile "https://foldseek.steineggerlab.workers.dev/afdb_swissprot.tar.gz" "${TMP_PATH}/afdb_swissprot.tar.gz"
downloadFile "https://foldseek.steineggerlab.workers.dev/afdb_swissprot.version" "${TMP_PATH}/version"
fi
tar xvfz "${TMP_PATH}/afdb_swissprot.tar.gz" -C "${TMP_PATH}"
push_back "${TMP_PATH}/afdb_swissprot"
INPUT_TYPE="FOLDSEEK_DB"
;;
"ESMAtlas30")
downloadFileList "https://raw.githubusercontent.com/facebookresearch/esm/main/scripts/atlas/v0/highquality_clust30/foldseekdb.txt" "${TMP_PATH}/"
printf "v0 %s\n" "$(date "+%s")" > "${TMP_PATH}/version"
push_back "${TMP_PATH}/highquality_clust30"
INPUT_TYPE="FOLDSEEK_DB"
;;
"PDB")
if notExists "${TMP_PATH}/pdb.tar.gz"; then
downloadFile "https://foldseek.steineggerlab.workers.dev/pdb100.tar.gz" "${TMP_PATH}/pdb.tar.gz"
downloadFile "https://foldseek.steineggerlab.workers.dev/pdb100.version" "${TMP_PATH}/version"
fi
tar xvfz "${TMP_PATH}/pdb.tar.gz" -C "${TMP_PATH}"
push_back "${TMP_PATH}/pdb"
INPUT_TYPE="FOLDSEEK_DB"
;;
esac

if notExists "${OUTDB}.dbtype"; then
case "${INPUT_TYPE}" in
"FOLDSEEK_DB")
eval "set -- $ARR"
IN="${*}"
for SUFFIX in ".source" "_mapping" "_taxonomy"; do
if [ -e "${IN}_seq${SUFFIX}" ]; then
mv -f -- "${IN}_seq${SUFFIX}" "${OUTDB}_seq${SUFFIX}"
fi
if [ -e "${IN}${SUFFIX}" ]; then
mv -f -- "${IN}${SUFFIX}" "${OUTDB}${SUFFIX}"
fi
done

    for SUFFIX in "" "_ss" "_h" "_ca"; do
        if [ -e "${IN}_seq${SUFFIX}.dbtype" ]; then
            # shellcheck disable=SC2086
            "${MMSEQS}" mvdb "${IN}_seq${SUFFIX}" "${OUTDB}_seq${SUFFIX}" || fail "mv died"
        fi
        # shellcheck disable=SC2086
        "${MMSEQS}" mvdb "${IN}${SUFFIX}" "${OUTDB}${SUFFIX}" || fail "mv died"
    done

    if [ -e "${IN}_clu.dbtype" ]; then
        # shellcheck disable=SC2086
        "${MMSEQS}" mvdb "${IN}_clu" "${OUTDB}_clu" || fail "mv died"
    fi
;;

esac
fi

if [ -n "${TAXONOMY}" ] && notExists "${OUTDB}_mapping"; then
case "${INPUT_TYPE}" in
"FOLDSEEK_DB")
eval "set -- $ARR"
IN="${*}"
mv -f -- "${IN}_mapping" "${OUTDB}_mapping"
mv -f -- "${IN}_taxonomy" "${OUTDB}_taxonomy"
if [ -e "${IN}_seq.dbtype" ]; then
mv -f -- "${IN}_seq_mapping" "${OUTDB}_seq_mapping"
mv -f -- "${IN}_seq_taxonomy" "${OUTDB}_seq_taxonomy"
fi
;;
esac
fi

if notExists "${OUTDB}.version"; then
mv -f "${TMP_PATH}/version" "${OUTDB}.version"
fi

if [ -n "${REMOVE_TMP}" ]; then
rm -f "${TMP_PATH}/download.sh"
fi
`

@milot-mirdita
Copy link
Member

Thank you for catching this issue. Would you like to fix it and submit a pull request?

@TigerWindWood
Copy link
Contributor Author

Thank you for your reply. I'm happy to fix the code and submit a pull request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants