Precomputed Databases hosted on AWS S3
There is an additional tutorial here that explains how to download and use the databases on an AWS EC2 instance.
Instructions for download and use
Install the latest version of the AWS CLI using the instructions on the AWS CLI website.
Using Neo4j Community Edition
First, download one of the database dump files from the AWS S3 bucket (files listed below). For example, to download the Micromonospora database dump file to the current directory:
aws s3 cp s3://socialgene-open-data/2023_v0.4.1/micromonospora/neo4j_db_micromonospora_base.dump .
Then follow the instructions for restoring from a full database dump/backup to rehydrate the Neo4j database. Note that the instructions use Docker, so you will need to have Docker installed or manually install Neo4j.
Using Neo4j Enterprise Edition
Ensure you have a valid license for Neo4j Enterprise Edition. Install the latest version of the Neo4j Enterprise Edition using the instructions on the Neo4j website.
Follow the directions at https://neo4j.com/docs/operations-manual/current/backup-restore/restore-backup/#restore-cloud-storage for restoring a database directly from cloud storage.
Files Explanation
All files
The following files are included in the AWS S3 bucket:
.
└── 2023_v0.4.1
├── actinomycetota
│ ├── neo4j_db_actinomycetota_base.dump
│ └── run_info
│ ├── execution_report_2023-12-02_06-51-28.html
│ ├── execution_timeline_2023-12-02_06-51-28.html
│ ├── execution_trace_2023-12-02_06-51-28.txt
│ ├── params_2023-12-02_15-33-12.json
│ └── pipeline_dag_2023-12-02_06-51-28.html
├── documentation
│ ├── structure.md
│ └── summary.md
├── hmm_models
│ ├── hmminfo.gz
│ ├── socialgene_nr_hmms_file_with_cutoffs_1_of_1.hmm.gz
│ └── socialgene_nr_hmms_file_without_cutoffs_1_of_1.hmm.gz
├── md5checksums.txt
├── methods_comparison
│ ├── methods_comparison.dump
│ └── run_info
│ ├── execution_report_2024-06-20_14-56-41.html
│ ├── execution_timeline_2024-06-20_14-56-41.html
│ ├── execution_trace_2024-06-20_14-56-41.txt
│ ├── params_2024-06-20_15-10-51.json
│ └── pipeline_dag_2024-06-20_14-56-41.html
├── micromonospora
│ ├── neo4j_db_micromonospora_base.dump
│ └── run_info
│ ├── execution_report_2023-12-03_11-49-37.html
│ ├── execution_timeline_2023-12-03_11-49-37.html
│ ├── execution_trace_2023-12-03_11-49-37.txt
│ ├── params_2023-12-03_16-54-34.json
│ └── pipeline_dag_2023-12-03_11-49-37.html
├── refseq
│ ├── command_to_build_neo4j_database_with_docker.sh
│ ├── import
│ │ ├── antismash_results.jsonl.gz
│ │ ├── genomic_info
│ │ │ ├── 2687a0b8f048c5081fbfe919b52c1727.assemblies.gz
│ │ │ ├── 51f1cf08d20aa569b0822d6c2cf859c9.assembly_to_taxid.gz
│ │ │ ├── 8bb70ed3e5f7ee8f8d740f2184207c19.locus_to_protein.gz
│ │ │ ├── b3ed6e17dba5be04143622d89f77e7dd.loci.gz
│ │ │ └── d29bdf974a8769a329af5cc5dc5f91c6.assembly_to_locus.gz
│ │ ├── goterms
│ │ │ ├── 2068e8e87a576280156e1ec92161d019.goterm_edgelist
│ │ │ ├── e4916a1a9abc084c587c6172f7509118.goterms
│ │ │ └── versions.yml
│ │ ├── hmm_info
│ │ │ ├── 20263cb059a54d1c773a2e7a23b2c073.sg_hmm_nodes
│ │ │ └── 527acf97d838e23ff39ae8df6d8261a2.all.hmminfo
│ │ ├── mmseqs2_cluster
│ │ │ └── 4e75f1c51535471c3225a8bd78dd2c32.mmseqs2_results_cluster.tsv.gz
│ │ ├── neo4j_headers
│ │ │ ├── assembly.header
│ │ │ ├── assembly_to_locus.header
│ │ │ ├── assembly_to_taxid.header
│ │ │ ├── go_to_go.header
│ │ │ ├── goterms.header
│ │ │ ├── hmm_source.header
│ │ │ ├── hmm_source_relationships.header
│ │ │ ├── locus.header
│ │ │ ├── locus_to_protein.header
│ │ │ ├── mmseqs2.header
│ │ │ ├── parameters.header
│ │ │ ├── protein_ids.header
│ │ │ ├── protein_to_go.header
│ │ │ ├── protein_to_hmm_header.header
│ │ │ ├── sg_hmm_nodes.header
│ │ │ ├── taxid.header
│ │ │ ├── taxid_to_taxid.header
│ │ │ ├── tigrfam_mainrole.header
│ │ │ ├── tigrfam_role.header
│ │ │ ├── tigrfam_subrole.header
│ │ │ ├── tigrfam_to_go.header
│ │ │ ├── tigrfam_to_role.header
│ │ │ ├── tigrfamrole_to_mainrole.header
│ │ │ ├── tigrfamrole_to_subrole.header
│ │ │ └── versions.yml
│ │ ├── parameters
│ │ │ ├── d89feb1a1348b51bb4bcf295af700f51.socialgene_parameters.gz
│ │ │ └── versions.yml
│ │ ├── parsed_domtblout
│ │ │ ├── 40dcdeb59968818c8ef3fffa35971947.parseddomtblout.gz
│ │ │ └── versions.yml
│ │ ├── protein_info
│ │ │ ├── c04b2d3997942e7cb5ac7c292aa73afb.protein_to_go.gz
│ │ │ └── ecb0e13c7f82cc39004f9e318bdecd98.protein_ids.gz
│ │ ├── taxdump_process
│ │ │ ├── 842f6c6514f6c81e4ca6a30ce7ec9772.nodes_taxid.gz
│ │ │ ├── e1973763d9fb63342e7169968a572b7c.taxid_to_taxid.gz
│ │ │ └── versions.yml
│ │ └── tigrfam_info
│ │ ├── 2deedec3965b082a1536f2a7612820d7.tigrfamrole_to_mainrole.gz
│ │ ├── 3a5edc8721c146058e104678b250fff2.tigrfam_subrole.gz
│ │ ├── 4da23e7a3d5bb06e3cf41d8a398aeb99.tigrfam_to_go.gz
│ │ ├── 5ad847d09afb9716bc3c155cba2f89f3.tigrfam_mainrole.gz
│ │ ├── 93ad0a4066afbbf2ceb20a21a73e7178.tigrfam_role.gz
│ │ ├── a0d6530f10e4593fd79ba286a407ac90.tigrfamrole_to_subrole.gz
│ │ ├── fc03559179b378cda7d37ba580601588.tigrfam_to_role.gz
│ │ └── versions.yml
│ ├── neo4j_db_refseq_base.dump
│ └── run_info
│ ├── params_2023-11-30_18-13-52.json
│ └── pipeline_dag_2023-12-01_13-24-10.html
├── refseq_antismash_bgcs
│ ├── neo4j_db_refseq_antismash_bgcs_base.dump
│ └── run_info
│ ├── execution_report_2023-12-11_15-30-04.html
│ ├── execution_timeline_2023-12-11_15-30-04.html
│ ├── execution_trace_2023-12-11_15-30-04.txt
│ ├── params_2023-12-12_06-00-22.json
│ └── pipeline_dag_2023-12-11_15-30-04.html
└── streptomyces
├── neo4j_db_streptomyces_base.dump
└── run_info
├── execution_report_2023-12-02_18-46-19.html
├── execution_timeline_2023-12-02_18-46-19.html
├── execution_trace_2023-12-02_18-46-19.txt
├── params_2023-12-03_15-35-49.json
└── pipeline_dag_2023-12-02_18-46-19.html
Database files
The included database dumps and disk space requirements are described in precomputed_databases/2023_v0.4.1/general.
The paths to just the dumps are:
- All RefSeq:
s3://socialgene-open-data/2023_v0.4.1/refseq/neo4j_db_refseq_base.dump
- All RefSeq Actinomycetota:
s3://socialgene-open-data/2023_v0.4.1/actinomycetota/neo4j_db_actinomycetota_base.dump
- All RefSeq Streptomyces:
s3://socialgene-open-data/2023_v0.4.1/streptomyces/neo4j_db_streptomyces_base.dump
- All RefSeq Micromonospora:
s3://socialgene-open-data/2023_v0.4.1/micromonospora/neo4j_db_micromonospora_base.dump
- All RefSeq antiSMASH-7.0 BGCs:
s3://socialgene-open-data/2023_v0.4.1/refseq_antismash_bgcs/neo4j_db_refseq_antismash_bgcs_base.dump
- Three genomes used for protein similarity method comparisons
s3://socialgene-open-data/2023_v0.4.1/methods_comparison/methods_comparison.dump
HMM models
The files in 2023_v0.4.1/hmm_models
are the less-redundant HMM models used to annotate proteins in each of the 2023_v0.4.1
databases. Therefore, the socialgene_nr_hmms_file_with_cutoffs_1_of_1.hmm.gz
and socialgene_nr_hmms_file_without_cutoffs_1_of_1.hmm.gz
files are the same for all of the databases. For functions like the SocialGene BGC search, these files are required. The hmminfo.gz
file is a gzipped file containing the metadata for the less redundant HMM models.
Flat files
The TSV flat files included in the 2023_v0.4.1/refseq/import
directory may be useful for building custom databases (including non-Neo4j databases) or other analyses. The associations of individual flat files to their column header files are in the tables below. All of the flat files are gzip compressed even if the .gz
extension is not present in the filename.
The paths in the table below start with the import
directory which is located in the refseq
database directory (2023_v0.4.1/refseq/import
).
neo4j_type | neo4j_label | neo4j_header_path | flat_file_path |
---|---|---|---|
node | tigrfam_mainrole | import/neo4j_headers/tigrfam_mainrole.header | import/tigrfam_info/*.tigrfam_mainrole.* |
node | tigrfam_subrole | import/neo4j_headers/tigrfam_subrole.header | import/tigrfam_info/*.tigrfam_subrole.* |
node | parameters | import/neo4j_headers/parameters.header | import/parameters/*.socialgene_parameters.* |
node | hmm | import/neo4j_headers/sg_hmm_nodes.header | import/hmm_info/*.sg_hmm_nodes.* |
node | assembly | import/neo4j_headers/assembly.header | import/genomic_info/*.assemblies.* |
node | hmm_source | import/neo4j_headers/hmm_source.header | import/hmm_info/*.hmminfo.* |
node | tigrfam_role | import/neo4j_headers/tigrfam_role.header | import/tigrfam_info/*.tigrfam_role.* |
node | goterm | import/neo4j_headers/goterms.header | import/goterms/*.goterms.* |
node | protein | import/neo4j_headers/protein_ids.header | import/protein_info/*.protein_ids.* |
node | taxid | import/neo4j_headers/taxid.header | import/taxdump_process/*.nodes_taxid.* |
node | nucleotide | import/neo4j_headers/locus.header | import/genomic_info/*.loci.* |
neo4j_type | neo4j_label | neo4j_header_path | flat_file_path |
---|---|---|---|
relationship | GO_ANN | import/neo4j_headers/tigrfam_to_go.header | import/tigrfam_info/*.tigrfam_to_go.* |
relationship | SUBROLE_ANN | import/neo4j_headers/tigrfamrole_to_subrole.header | import/tigrfam_info/*.tigrfamrole_to_subrole.* |
relationship | MMSEQS2 | import/neo4j_headers/mmseqs2.header | import/mmseqs2_cluster/*.mmseqs2_results_cluster.tsv.* |
relationship | ANNOTATES | import/neo4j_headers/protein_to_hmm_header.header | import/parsed_domtblout/*.parseddomtblout.* |
relationship | IS_TAXON | import/neo4j_headers/assembly_to_taxid.header | import/genomic_info/*.assembly_to_taxid.* |
relationship | ROLE_ANN | import/neo4j_headers/tigrfam_to_role.header | import/tigrfam_info/*.tigrfam_to_role.* |
relationship | ENCODES | import/neo4j_headers/locus_to_protein.header | import/genomic_info/*.locus_to_protein.* |
relationship | SOURCE_DB | import/neo4j_headers/hmm_source_relationships.header | import/hmm_info/*..hmminfo.* |
relationship | TAXON_PARENT | import/neo4j_headers/taxid_to_taxid.header | import/taxdump_process/*.taxid_to_taxid.* |
relationship | PROTEIN_TO_GO | import/neo4j_headers/protein_to_go.header | import/protein_info/*.protein_to_go.* |
relationship | ASSEMBLES_TO | import/neo4j_headers/assembly_to_locus.header | import/genomic_info/*.assembly_to_locus.* |
relationship | MAINROLE_ANN | import/neo4j_headers/tigrfamrole_to_mainrole.header | import/tigrfam_info/*.tigrfamrole_to_mainrole.* |