Situation
Probably you use blast and sometimes you need to run it with a huge number of reads. Usually, it works in a sequentially way. There are different ways to improve the runtime and obtain more accurated results, i.e using parameteres like:
- evalue - max_target_seqs - perc_identity
or increasing the threads availables:
- num_threads
But the most efficient way is to split the query fasta file in as many pieces as you could run in parallel in your cluster/local cpu.
Solution
Steps:
#split fasta query file
pyfasta split - n { blastJobs } { input . query };
#create database from the reference fasta file
makeblastdb - in { input . base } - dbtype { params . dbtype } - out blastemp / blastDB ;
#blast each split query fasta
blastn - db blastemp / blastDB - query { input . fastas } - outfmt 6 - out { output } - num_threads { params . blasthreads }
#merge results
cat { input } > { output }
The implementation is easier using workflows tools like snakemake or nextflow . Here there is a solution of snakemake, to use in a single cpu, multi-core. A version to use on a cluster is available on my github .
import subprocess , sys
from os.path import join , basename
# Globals ---------------------------------------------------------------------
#database fasta file
FASTA_DB = config [ "fadb" ]
#query fasta file
FASTA_QUERY = config [ "faqy" ]
#output prefix from query fasta file
prefix = basename ( FASTA_QUERY ). split ( "." )[ 0 ]
#nucl or prot
dbtype = config [ "dbtype" ]
if dbtype == "nucl" :
rblast = "blastn"
else :
rblast = "blastp"
#number of blast jobs to run (split the query.fasta file).
blastJobs = 5
#threads for each blast run.
blasthreads = 12
SAMPLESX = []
for i in range ( 0 , blastJobs ):
SAMPLESX . append ( prefix + '.%01d' % i )
rule all :
input :
prefix + '.blast.tsv'
rule splitFASTA :
input :
base = FASTA_DB ,
query = FASTA_QUERY ,
output :
trimmed = expand ( 'blastemp/{sample}.fasta' , sample = SAMPLESX ),
datab = 'blastemp/blastDB.nsq'
params :
prefix = prefix ,
dbtype = dbtype
shell : """
pyfasta split -n {blastJobs} {input.query};
makeblastdb -in {input.base} -dbtype {params.dbtype} -out blastemp/blastDB;
mv {params.prefix}.*.fasta blastemp/ && mv {params.prefix}.fasta.* blastemp/;
"""
FASTA_DIR = 'blastemp/'
SAMPLES = glob_wildcards ( join ( FASTA_DIR , '{sample,[^/]+}.fasta' ))
PATTERN = '{sample}.fasta'
rule blast :
input :
fastas = join ( FASTA_DIR , PATTERN ),
datab = 'blastemp/blastDB.nsq'
output :
'blastemp/{sample}.tsv'
params :
blasthreads = blasthreads ,
blast_type = rblast
shell : """
{params.blast_type} -db blastemp/blastDB -query {input.fastas} -outfmt 6 -out {output} -num_threads {params.blasthreads}
"""
#add -evalue 0.01
rule catblast :
input :
expand ( "blastemp/{sample}.tsv" , sample = SAMPLESX )
output :
prefix + '.blast.tsv'
shell : """
cat {input} > {output}
rm -r blastemp/
"""