Add run_dbcan screening for the CAZyme (carbohydrate active enzyme) and CGC (CAZyme Gene Cluster) annotation by HaidYi · Pull Request #483 · nf-core/funcscan

HaidYi · 2025-07-02T00:16:24Z

PR checklist

Close #481.

The main changes include:

Like other screening tools, added a dedicated subworkflow (subworkflows/dbcan.nf) for the support of run_dbcan screening.
Added the annotation step for generating the .gff files and added the alias of the current modules (e.g., PYRODIGAL_GFF). So, the input gbk column may also use gff file as input. Feel free to change this part as it may need some tweaks considering the both the pipeline and the document.
Other utilities:
- ci/cd, testing profiles for dbcan, module.config, etc.
- documents: readme and output

Things that are needed the changes from the maintainer:

Add the changelog for this change in the next release version.
Add the dbcan screening step in the schematic workflow.

nf-core-bot · 2025-07-02T00:17:00Z

Warning

Newer version of the nf-core template is available.

Your pipeline is using an old version of the nf-core template: 3.5.1.
Please update your pipeline to the latest version.

For more documentation on how to update your pipeline, please see the nf-core documentation and Synchronisation documentation.

jasmezz

What a great addition! @HaidYi I really appreciate your effort, your PR is really clear and on point. Thank you very much for this contribution. During review I directly pushed some minor changes to your fork.

Some other comments we could consider:

Thinking about renaming the new dbcan subworkflow to cazyme. This would be more in line with previous naming, i.e. subworkflow names tell the purpose, not the tool.
- This would include changing the output dir in modules.config to ${params.outdir}/cazyme/cazyme_annotation, ${params.outdir}/cazyme/cgc, ${params.outdir}/cazyme/substrate
- file tree in output docs
- test names
- nextflow_schema.json ...
The database download takes very long because of low download rate (>2 GB at at rate of ~ 1 MB/s). That is too long for the test profiles; we need to create a smaller database somehow...
Adding manual dbCAN database download (via bioconda) to the respective section in usage docs.

jasmezz · 2025-07-10T12:37:06Z

conf/test_preannotated_dbcan.config

+    dbcan_skip_cgc             = true   // skip cgc as .gbk is used
+    dbcan_skip_substrate       = true   // skip substrate as .gbk is used


If we want to be able to run the complete CAZyme subworkflow with pre-annotated .gff files while also providing pre-annotated .gbk files for other subworkflows, we need an additional (optional) column in the samplesheet.

jasmezz · 2025-07-10T13:22:09Z

docs/output.md

+    - `*_dbCAN_hmm_results.tsv`: TSV file containing the detailed dbCAN HMM results for CAZyme annotation.
+    - `*_dbCANsub_hmm_results.tsv`: TSV file containing the detailed dbCAN subfamily results for CAZyme annotation.
+    - `*_diamond.out`: TSV file containing the detailed dbCAN diamond results for CAZyme annotation.
+  - `cgc`


Many of the files of the cgc and substrate section seem duplicated. Maybe we don't need to store those which are created in the cazyme step already? Can control this in modules.config (e.g. see RGI_MAIN entry).

@jasmezz Thank you for reviewing the codes. I will revise it based on your comments.

jfy133

Really good first PR @HaidYi ! Clean and pretty much all of my comments are sort of minor/just polishing

Some additional things to my direct comments:

Missing citations.md update
Missing the how to cite/methods text in this file: https://github.com/HaidYi/funcscan/blob/0cad8f95c553b3cdd3a59c34a0db107bd6df14f4/subworkflows/local/utils_nfcore_funcscan_pipeline/main.nf#L174
Missing metromap update (but we can probably do this before release)
Missing nf-test test and snapshots for the new tests

conf/test_cazyme_pyrodigal.config

jfy133 · 2025-07-15T06:45:11Z

conf/test_preannotated_dbcan.config

+    run_bgc_screening          = false
+    run_cazyme_screening       = true
+
+    dbcan_skip_cgc             = true   // Skip cgc annotation as .gbk (not .gff) is provided in samplesheet


We should probably add gff files!

You can generate them from a normal funcscan fun, and make a PR against teh funscan branch of nf-core/testdatasets, which has the files and an updated samplesheet for the next funcscan version

Yes, currently the cazyme screening can only use the .gff files in the pipeline. To use the pre-annotated one, I generated the .gff files from pyrodigal. The PR can be found at nf-core/test-datasets#1683.

Can this be updated now you have the file?

jfy133 · 2025-07-15T06:46:43Z

docs/output.md

 |   ├── deepbgc/
 |   ├── gecco/
 |   └── hmmsearch/
+├── dbcan/


The top level should be the molecule/gene type (i.e., cazyme), then a subdirectory with each tool (in this case dbcan), and within that each of the different output directories

jfy133 · 2025-07-15T06:48:37Z

docs/output.md

+
+- `dbcan/`
+  - `cazyme`
+    - `*_overview.tsv`: TSV file containing the results of dbCAN CAZyme annotation


You're missing the <sample.id> sample subdirectory underneath the tool name (accoeding to your modules.confg)

docs/usage.md

jfy133 · 2025-07-15T06:55:00Z

subworkflows/local/cazyme.nf

+        .join(ch_gffs_for_rundbcan)
+        .multiMap { meta, faa, gff ->
+            faa: [meta, faa]
+            gff: [meta, gff, 'prodigal']


Is the gff always from prodigal? Or is this a dummy value?

Refer to the module description: https://nf-co.re/modules/rundbcan_easycgc/. If it's the generated in the pipeline, it is always the prodigal. But if it's provided using the pre-annotated one, then it could be either NCBI_prok, JGI, NCBI_euk or prodigal. This makes things complicated. An easier way is to define a parameter in the cli for this option but it's kind of hard to deal with the mixed case in a batch without doing the modifications in the samplesheet.

Co-authored-by: James A. Fellows Yates <jfy133@gmail.com>

HaidYi · 2025-07-17T03:58:29Z

@jfy133 Thank you for the comments and suggestions. I will fix all the problems one-by-one. As I don't want this PR corrupt other screening steps, I will do a more comprehensive testing, which may take more time. I will let you know when I fix all the issues.

HaidYi · 2025-12-12T05:05:24Z

@jfy133 Thank you for pointing this. For this issue, I contacted the tool author. The problem is because the server has a low uploading bandwidth at UNL. So, the authors just got approved for hosting the data on AWS S3. So, they will update the nf-core/rundbcan_database module when they finish the transition of the database to S3.

Then, I think the time-out problem in the testing will be resolved automatically when I pull the newest module. I will keep you in the loop.

jfy133 · 2025-12-12T14:13:50Z

Ok! Let's see if it helps 👍👍

jfy133 · 2025-12-17T10:23:50Z

@HaidYi I will check again after the holidays, but I just had a though it may also be an idea to ask the developer to make a mini database anyway . It may be useful for other cases too, of just needs to include a couple of gene sequences so there is something that is compatible with running db_can (even if output is nonsense).

HaidYi · 2026-01-07T04:41:32Z

@jfy133 Happy new year! I hope you had a great holiday. Thanks to @Xinpeng021001 's work, the db_can tool has updated the database hosted from local server in the university to amazon s3 supported by AWS Open Data Sponsorship Program. And the tool has released the new version (v5.2.2) to reflect this change.

So, next step we will update dbcan nf-core module and solve this slow database downloading problem in this PR as well. Will keep you posted for the progress. Thanks.

jfy133 · 2026-01-07T08:53:38Z

Wonderful and than kyou @Xinpeng021001 ! Much appreciated!

I'll keep an eye on this PR (just resovled a docs conflict just now) for updates :)

HaidYi · 2026-02-04T15:48:03Z

@jfy133 I updated the rundbcan module to aws for database downloading(nf-core/modules#9768). And this new PR now has no problems for the longtime db downloading problems. Please review again.

jfy133

OK we are ALMOST DONE @HaidYi 🎉! Thank you for your patience!

Here are the last points/questions (to summarise some of the specific comments too), but otherwise code looks great, I've checked against our pipeline conventions (now on dev here and you're already following them already 💪:

Conceptual

Can you confirm there are no db_can <subcmd> options/arguments that we should expose to the user via a pipeline parameter? E.g. for run_dbcan the --mode or --methods parameters? Or for the cgc_finder the parameter --use_distance ?

Code

test_preannotated_cazyme.conf: You are missing a tests nf-test test file and it's snapshot for the new test config

Documentation

usage.md: missing documentation in the sameplsheet section about the new gff column
nextflow_schema.json: missing the long-form helptext(s) describing when you would want to maybe skip the cgc and substrate detection
CHANGELOG.md: missing a change log entry of the PR, but also please make sure to add the version of db_can as a new dependency (i.e., the previous version column can be empty)
README.md: don't forget to add yourself to the 'credits` list!
nextflow.config: don't forget to add yourself to the manifest section as a contributor!

jfy133 · 2026-02-20T12:38:53Z

conf/test_preannotated_cazyme.config

+    dbcan_skip_cgc             = false   // Skip cgc annotation as .gbk (not .gff) is provided in samplesheet
+    dbcan_skip_substrate       = false   // Skip substrate annotation as .gbk (not .gff) is provided in samplesheet


Unless the GBK/GFF files are mutually exclusive as input to funcscan, I would argue maybe it would make sense to include the GFF file in the samplesheet_preannotated.csv samplesheet

But it would be nice in another test profile (maybe test_cazyme_prokka) you still also test skipping the dbcan_skip_cgc and dbcan_skip_substrate functionality?

jfy133 · 2026-02-20T13:00:05Z

nextflow_schema.json

+                },
+                "dbcan_skip_cgc": {
+                    "type": "boolean",
+                    "description": "Skip CGC during the dbCAN screening.",


Still missing

jfy133 · 2026-02-20T13:02:23Z

conf/test_preannotated_cazyme.config

@@ -0,0 +1,37 @@
+/*
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+    Nextflow config file for running minimal tests


This file is still missing a tests/test.nf.test file and associated snapshot

jfy133 · 2026-02-20T13:12:54Z

docs/usage.md

As you've added a new optional column to the samplesheet, you need to add a description on this near the top of this page in the relevant section)

jfy133 · 2026-03-14T19:26:05Z

@nf-core-bot fix linting

jfy133 · 2026-03-14T19:26:58Z

Hrm, the test failure error is:

> [7a/0497d8] Submitted process > NFCORE_FUNCSCAN:FUNCSCAN:CAZYME:RUNDBCAN_EASYSUBSTRATE (sample_3)
    > ERROR ~ Error executing process > 'NFCORE_FUNCSCAN:FUNCSCAN:CAZYME:RUNDBCAN_EASYSUBSTRATE (sample_3)'
    > 
    > Caused by:
    >   Process exceeded running time limit (1h)
    > 
    > 
    > Command executed:
    > 
    >   run_dbcan easy_substrate \
    >       --mode protein \
    >       --db_dir dbcan_db \
    >       --input_raw_data sample_3_pyrodigal.faa \
    >       --output_dir . \
    >       --input_gff sample_3_pyrodigal.gff \
    >       --gff_type prodigal \
    >   
    >   
    >   mv overview.tsv             sample_3_overview.tsv
    >   mv dbCAN_hmm_results.tsv    sample_3_dbCAN_hmm_results.tsv
    >   mv dbCANsub_hmm_results.tsv sample_3_dbCANsub_hmm_results.tsv
    >   mv diamond.out              sample_3_diamond.out
    >   mv cgc.gff                  sample_3_cgc.gff
    >   mv cgc_standard_out.tsv     sample_3_cgc_standard_out.tsv
    >   mv diamond.out.tc           sample_3_diamond.out.tc
    >   mv STP_hmm_results.tsv      sample_3_STP_hmm_results.tsv
    >   mv total_cgc_info.tsv       sample_3_total_cgc_info.tsv
    >   mv CGC.faa                  sample_3_CGC.faa
    >   mv PUL_blast.out            sample_3_PUL_blast.out
    >   mv substrate_prediction.tsv sample_3_substrate_prediction.tsv
    >   mv synteny_pdf/             sample_3_synteny_pdf/
    >   if [ -f TF_hmm_results.tsv ]; then
    >       mv TF_hmm_results.tsv   sample_3_TF_hmm_results.tsv
    >   fi
    >   
    >   cat <<-END_VERSIONS > versions.yml
    >   "NFCORE_FUNCSCAN:FUNCSCAN:CAZYME:RUNDBCAN_EASYSUBSTRATE":
    >       dbcan: $(echo $(run_dbcan version) | cut -f2 -d':' | cut -f2 -d' ')
    >   END_VERSIONS
    > 
    > Command exit status:
    >   -
    > 
    > Command output:
    >   step 1/4  CAZyme annotation...
    >   step 2/4  GFF processing...
    > 
    > Command wrapper:
    >   step 1/4  CAZyme annotation...
    >   step 2/4  GFF processing...
    > 
    > Work dir:
    >   /home/runner/_work/funcscan/funcscan/~/tests/663aa07f0ba76014e15144d46d75baaa/work/7a/0497d8588d4be4dfe1db4f5e81ce36
    > 
    > Container:
    >   quay.io/biocontainers/dbcan:5.2.6--pyhdfd78af_0
    > 
    > Tip: you can try to figure out what's wrong by changing to the process work dir and showing the script file named `.command.sh`
    > 
    >  -- Check '/home/runner/_work/funcscan/funcscan/~/tests/663aa07f0ba76014e15144d46d75baaa/meta/nextflow.log' file for details
    > Execution cancelled -- Finishing pending tasks before exit
    > ERROR ~ Pipeline failed. Please refer to troubleshooting docs: https://nf-co.re/docs/usage/troubleshooting

Do you expect EASYSUBSTRATE to be so slow @HaidYi ?

HaidYi · 2026-03-15T02:43:24Z

Hrm, the test failure error is:

> [7a/0497d8] Submitted process > NFCORE_FUNCSCAN:FUNCSCAN:CAZYME:RUNDBCAN_EASYSUBSTRATE (sample_3)
    > ERROR ~ Error executing process > 'NFCORE_FUNCSCAN:FUNCSCAN:CAZYME:RUNDBCAN_EASYSUBSTRATE (sample_3)'
    > 
    > Caused by:
    >   Process exceeded running time limit (1h)
    > 
    > 
    > Command executed:
    > 
    >   run_dbcan easy_substrate \
    >       --mode protein \
    >       --db_dir dbcan_db \
    >       --input_raw_data sample_3_pyrodigal.faa \
    >       --output_dir . \
    >       --input_gff sample_3_pyrodigal.gff \
    >       --gff_type prodigal \
    >   
    >   
    >   mv overview.tsv             sample_3_overview.tsv
    >   mv dbCAN_hmm_results.tsv    sample_3_dbCAN_hmm_results.tsv
    >   mv dbCANsub_hmm_results.tsv sample_3_dbCANsub_hmm_results.tsv
    >   mv diamond.out              sample_3_diamond.out
    >   mv cgc.gff                  sample_3_cgc.gff
    >   mv cgc_standard_out.tsv     sample_3_cgc_standard_out.tsv
    >   mv diamond.out.tc           sample_3_diamond.out.tc
    >   mv STP_hmm_results.tsv      sample_3_STP_hmm_results.tsv
    >   mv total_cgc_info.tsv       sample_3_total_cgc_info.tsv
    >   mv CGC.faa                  sample_3_CGC.faa
    >   mv PUL_blast.out            sample_3_PUL_blast.out
    >   mv substrate_prediction.tsv sample_3_substrate_prediction.tsv
    >   mv synteny_pdf/             sample_3_synteny_pdf/
    >   if [ -f TF_hmm_results.tsv ]; then
    >       mv TF_hmm_results.tsv   sample_3_TF_hmm_results.tsv
    >   fi
    >   
    >   cat <<-END_VERSIONS > versions.yml
    >   "NFCORE_FUNCSCAN:FUNCSCAN:CAZYME:RUNDBCAN_EASYSUBSTRATE":
    >       dbcan: $(echo $(run_dbcan version) | cut -f2 -d':' | cut -f2 -d' ')
    >   END_VERSIONS
    > 
    > Command exit status:
    >   -
    > 
    > Command output:
    >   step 1/4  CAZyme annotation...
    >   step 2/4  GFF processing...
    > 
    > Command wrapper:
    >   step 1/4  CAZyme annotation...
    >   step 2/4  GFF processing...
    > 
    > Work dir:
    >   /home/runner/_work/funcscan/funcscan/~/tests/663aa07f0ba76014e15144d46d75baaa/work/7a/0497d8588d4be4dfe1db4f5e81ce36
    > 
    > Container:
    >   quay.io/biocontainers/dbcan:5.2.6--pyhdfd78af_0
    > 
    > Tip: you can try to figure out what's wrong by changing to the process work dir and showing the script file named `.command.sh`
    > 
    >  -- Check '/home/runner/_work/funcscan/funcscan/~/tests/663aa07f0ba76014e15144d46d75baaa/meta/nextflow.log' file for details
    > Execution cancelled -- Finishing pending tasks before exit
    > ERROR ~ Pipeline failed. Please refer to troubleshooting docs: https://nf-co.re/docs/usage/troubleshooting

Do you expect EASYSUBSTRATE to be so slow @HaidYi ?

I don't think so. @Xinpeng021001 Could you answer the question why the substrate on this testing dataset needs a long time to finish ?

Xinpeng021001 · 2026-03-15T02:46:00Z

The substrate process shouldn't take such a long time. I'm reviewing the error and will reply asap.

HaidYi and others added 7 commits June 30, 2025 19:22

Add run_dbcan screening

6353679

fix missing gffs

15f2ef5

split dbcan results by meta.id

d5df4a1

rm constraints of annotation tool

f049e2f

add test config for rundbcan

8289bdb

add test profile for rundbcan in ci

d8af5e9

add dbcan in the refs

0a5e505

HaidYi self-assigned this Jul 2, 2025

HaidYi requested review from Darcy220606, jasmezz and jfy133 as code owners July 2, 2025 00:16

HaidYi added the enhancement Improvement for existing functionality label Jul 2, 2025

HaidYi mentioned this pull request Jul 2, 2025

Add rundbcan for the CAZyme (carbohydrate active enzyme) and CGC (CAZyme Gene Cluster) annotation #481

Open

Suggestions from code review

01a573a

jasmezz reviewed Jul 10, 2025

View reviewed changes

HaidYi added 5 commits July 14, 2025 23:18

rm duplicate outputs

5c5ec66

add manual dbCAN database download

9fd005c

rename DBCAN to CAZYME

ea4b852

add gff column in samplesheet

62623a5

change run_dbcan_screening to run_cazyme_screening

0cad8f9

jfy133 reviewed Jul 15, 2025

View reviewed changes

HaidYi and others added 4 commits July 16, 2025 19:24

add missing identifier

b76e3a2

Co-authored-by: James A. Fellows Yates <jfy133@gmail.com>

add missing identifier

0f5863a

Co-authored-by: James A. Fellows Yates <jfy133@gmail.com>

add missing conda

f2d79d5

Co-authored-by: James A. Fellows Yates <jfy133@gmail.com>

fix typo

625ced4

Co-authored-by: James A. Fellows Yates <jfy133@gmail.com>

HaidYi added 3 commits July 16, 2025 23:01

re-organize the outdir structure of cazyme screening

58273f1

add citation

a638f32

add cazyme_skip_dbcan param

a5d692b

Merge branch 'dev' into rundbcan

1132b7e

HaidYi and others added 7 commits February 3, 2026 11:41

update the rundbcan/database

07a7d71

recompute hash

497d6d1

update rundbcan modules

bab0e18

update the default test snap

c0c76dc

remove one assertion for the upgraded package

7a921b4

update the test cazyme pyrodigal snap file

d2e8e94

Merge branch 'dev' into rundbcan

4b5a342

HaidYi requested a review from jfy133 February 4, 2026 16:04

jfy133 added 2 commits February 20, 2026 14:12

Apply suggestions from code review

737db6c

Apply suggestions from code review

3b9bbab

jfy133 reviewed Feb 20, 2026

View reviewed changes

HaidYi added 7 commits March 7, 2026 19:44

fix comments

b6c0aab

update the doc/usage

3c9c118

update contributor

49ef6d9

add help_texts

0618e35

add to contributor

8be0c33

update changelog

ebe8bc4

add more cazyme tests

80f7f1e

[automated] Fix code linting

b54a6e7

		dbcan_skip_cgc = true // skip cgc as .gbk is used
		dbcan_skip_substrate = true // skip substrate as .gbk is used

		dbcan_skip_cgc = false // Skip cgc annotation as .gbk (not .gff) is provided in samplesheet
		dbcan_skip_substrate = false // Skip substrate annotation as .gbk (not .gff) is provided in samplesheet

Conversation

HaidYi commented Jul 2, 2025

PR checklist

Uh oh!

nf-core-bot commented Jul 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jasmezz left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jfy133 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HaidYi commented Jul 17, 2025

Uh oh!

HaidYi commented Dec 12, 2025

Uh oh!

jfy133 commented Dec 12, 2025

Uh oh!

jfy133 commented Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HaidYi commented Jan 7, 2026

Uh oh!

jfy133 commented Jan 7, 2026

Uh oh!

HaidYi commented Feb 4, 2026

Uh oh!

jfy133 left a comment

Choose a reason for hiding this comment

Conceptual

Code

Documentation

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jfy133 commented Mar 14, 2026

Uh oh!

jfy133 commented Mar 14, 2026

Uh oh!

HaidYi commented Mar 15, 2026

Uh oh!

Xinpeng021001 commented Mar 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

nf-core-bot commented Jul 2, 2025 •

edited

Loading

jfy133 commented Dec 17, 2025 •

edited

Loading