Inputs
There are a few different ways to input proteins and/or genomes into the Nextflow workflow.
Local files
Genomes
To run the pipeline using already downloaded/local genbank files (e.g. .gbk
or .gbff
) provide the path to the files via --local_genbank
. This can be a glob pattern.
Note
If you are entering parameters on the command line and using a glob, make sure to enclose the path/glob in single quotes to prevent expansion.
Note
The genbank files must already contain protein sequences (SocialGene doesn't currently do any gene/ORF prediction).
nextflow run \
socialgene/sgnf \
--local_genbank '/path/to/genbank/files/*.gbk' \
...
...
Proteins
You can also input non-genomic proteins using local protein FASTA files (e.g. .faa
). These will be connected to "nucleotide" and "assembly" nodes with a filename identifier.
Provide the path to the files via --local_fasta
. This can be a glob pattern.
Note
If you are entering parameters on the command line and using a glob, make sure to enclose the path/glob in single quotes to prevent expansion.
nextflow run \
socialgene/sgnf \
--local_genbank '/path/to/protein/fasta/files/*.fasta' \
...
...
Retrieve genomes from NCBI
ncbi-genome-download (preferred)
The Nextflow workflow contains Kai Blin's ncbi-genome-download
tool which can be used to retrieve genomes from NCBI. This can be done by using the Nextflow workflow parameter ncbi_genome_download_command
, which simply passes an argument string to the ncbi-genome-download commmand. See the tool's website for examples.
For example, the following will download and run the workflow on all "Paraburkholderia acidicola" genomes available within GenBank.
nextflow run \
socialgene/sgnf \
--ncbi_genome_download_command 'bacteria --section genbank --genera "Paraburkholderia acidicola"' \
...
...
Warning
It is very easy to download a LOT of genomes/data with this tool. To get an idea of how many genomes a query might return, you can first do an interactive search of a taxon, etc, here: https://www.ncbi.nlm.nih.gov/datasets/genome
NCBI datasets
NCBI has a new-ish command line tool for downloading genomes, called NCBI datasets. A command may be passed to datasets download
by using the Nextflow pipeline ncbi_datasets_command
.
e.g. For all assemblies within the genus Micromonospora you could use: ncbi_datasets_command = 'genome taxon "micromonospora"'
e.g. For the strain Micromonospora sp. B006
you could use: ncbi_datasets_command = 'genome accession GCF_003408515.1'
nextflow run \
socialgene/sgnf \
--ncbi_datasets_command 'genome taxon "micromonospora"' \
...
...
Warning
It is very easy to download a LOT of genomes/data with this tool. To get an idea of how many genomes a query might return, you can first do an interactive search of a taxon, etc, here: https://www.ncbi.nlm.nih.gov/datasets/genome
Related to the above warning, the download step of the workflow may take some time depending on your internet speed and number of genomes to be downloaded.
HMM models
Prebuilt models
The Nextflow workflow is able to download HMM models from any or all of the following: ["antismash","amrfinder","bigslice","classiphage", "ipresto","pfam","prism","resfams","tigrfam","virus_orthologous_groups"].
These can be selected by using the hmmlist
parameter and comma-separated string:
e.g. --hmmlist 'resfams,antismash'
You can also use --hmmlist all
to use all models from all of the databases SocialGene knows about.
Note
Depending on your location/internet speed this step can take some time to download.
Because they don't change, HMM models are downloaded and stored for long-term use between workflow runs to --outdir_download_cache
. Where possible SocialGene pulls versioned HMM models. The versions used can be modified using workflow parameters found here.
Custom models
To use your own HMM model use the parameter: --custom_hmm_file
e.g. --custom_hmm_file '/path/to/my/hmm.hmm'
The file should be a valid HMMER HMM model.
For info on HMMs see: https://www.ebi.ac.uk/training/online/courses/pfam-creating-protein-families/