Skip to content

Add run_dbcan screening for the CAZyme (carbohydrate active enzyme) and CGC (CAZyme Gene Cluster) annotation#483

Open
HaidYi wants to merge 77 commits intonf-core:devfrom
HaidYi:rundbcan
Open

Add run_dbcan screening for the CAZyme (carbohydrate active enzyme) and CGC (CAZyme Gene Cluster) annotation#483
HaidYi wants to merge 77 commits intonf-core:devfrom
HaidYi:rundbcan

Conversation

@HaidYi
Copy link

@HaidYi HaidYi commented Jul 2, 2025

PR checklist

Close #481.

The main changes include:

  • Like other screening tools, added a dedicated subworkflow (subworkflows/dbcan.nf) for the support of run_dbcan screening.
  • Added the annotation step for generating the .gff files and added the alias of the current modules (e.g., PYRODIGAL_GFF). So, the input gbk column may also use gff file as input. Feel free to change this part as it may need some tweaks considering the both the pipeline and the document.
  • Other utilities:
    • ci/cd, testing profiles for dbcan, module.config, etc.
    • documents: readme and output

Things that are needed the changes from the maintainer:

  • Add the changelog for this change in the next release version.
  • Add the dbcan screening step in the schematic workflow.

  • This comment contains a description of changes (with reason).
  • If you've fixed a bug or added code that should be tested, add tests!
  • If you've added a new tool - have you followed the pipeline conventions in the contribution docs
  • If necessary, also make a PR on the nf-core/funcscan branch on the nf-core/test-datasets repository.
  • Make sure your code lints (nf-core pipelines lint).
  • Ensure the test suite passes (nextflow run . -profile test,docker --outdir <OUTDIR>).
  • Check for unexpected warnings in debug mode (nextflow run . -profile debug,test,docker --outdir <OUTDIR>).
  • Usage Documentation in docs/usage.md is updated.
  • Output Documentation in docs/output.md is updated.
  • CHANGELOG.md is updated.
  • README.md is updated (including new tool citations and authors/contributors).

@HaidYi HaidYi self-assigned this Jul 2, 2025
@HaidYi HaidYi added the enhancement Improvement for existing functionality label Jul 2, 2025
@nf-core-bot
Copy link
Member

nf-core-bot commented Jul 2, 2025

Warning

Newer version of the nf-core template is available.

Your pipeline is using an old version of the nf-core template: 3.5.1.
Please update your pipeline to the latest version.

For more documentation on how to update your pipeline, please see the nf-core documentation and Synchronisation documentation.

Copy link
Collaborator

@jasmezz jasmezz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What a great addition! @HaidYi I really appreciate your effort, your PR is really clear and on point. Thank you very much for this contribution. During review I directly pushed some minor changes to your fork.

Some other comments we could consider:

  • Thinking about renaming the new dbcan subworkflow to cazyme. This would be more in line with previous naming, i.e. subworkflow names tell the purpose, not the tool.
    • This would include changing the output dir in modules.config to ${params.outdir}/cazyme/cazyme_annotation, ${params.outdir}/cazyme/cgc, ${params.outdir}/cazyme/substrate
    • file tree in output docs
    • test names
    • nextflow_schema.json ...
  • The database download takes very long because of low download rate (>2 GB at at rate of ~ 1 MB/s). That is too long for the test profiles; we need to create a smaller database somehow...
  • Adding manual dbCAN database download (via bioconda) to the respective section in usage docs.

Comment on lines +35 to +36
dbcan_skip_cgc = true // skip cgc as .gbk is used
dbcan_skip_substrate = true // skip substrate as .gbk is used
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to be able to run the complete CAZyme subworkflow with pre-annotated .gff files while also providing pre-annotated .gbk files for other subworkflows, we need an additional (optional) column in the samplesheet.

docs/output.md Outdated
- `*_dbCAN_hmm_results.tsv`: TSV file containing the detailed dbCAN HMM results for CAZyme annotation.
- `*_dbCANsub_hmm_results.tsv`: TSV file containing the detailed dbCAN subfamily results for CAZyme annotation.
- `*_diamond.out`: TSV file containing the detailed dbCAN diamond results for CAZyme annotation.
- `cgc`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many of the files of the cgc and substrate section seem duplicated. Maybe we don't need to store those which are created in the cazyme step already? Can control this in modules.config (e.g. see RGI_MAIN entry).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jasmezz Thank you for reviewing the codes. I will revise it based on your comments.

Copy link
Member

@jfy133 jfy133 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really good first PR @HaidYi ! Clean and pretty much all of my comments are sort of minor/just polishing

Some additional things to my direct comments:

run_bgc_screening = false
run_cazyme_screening = true

dbcan_skip_cgc = true // Skip cgc annotation as .gbk (not .gff) is provided in samplesheet
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably add gff files!

You can generate them from a normal funcscan fun, and make a PR against teh funscan branch of nf-core/testdatasets, which has the files and an updated samplesheet for the next funcscan version

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, currently the cazyme screening can only use the .gff files in the pipeline. To use the pre-annotated one, I generated the .gff files from pyrodigal. The PR can be found at nf-core/test-datasets#1683.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be updated now you have the file?

docs/output.md Outdated
| ├── deepbgc/
| ├── gecco/
| └── hmmsearch/
├── dbcan/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The top level should be the molecule/gene type (i.e., cazyme), then a subdirectory with each tool (in this case dbcan), and within that each of the different output directories

docs/output.md Outdated

- `dbcan/`
- `cazyme`
- `*_overview.tsv`: TSV file containing the results of dbCAN CAZyme annotation
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're missing the <sample.id> sample subdirectory underneath the tool name (accoeding to your modules.confg)

.join(ch_gffs_for_rundbcan)
.multiMap { meta, faa, gff ->
faa: [meta, faa]
gff: [meta, gff, 'prodigal']
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the gff always from prodigal? Or is this a dummy value?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refer to the module description: https://nf-co.re/modules/rundbcan_easycgc/. If it's the generated in the pipeline, it is always the prodigal. But if it's provided using the pre-annotated one, then it could be either NCBI_prok, JGI, NCBI_euk or prodigal. This makes things complicated. An easier way is to define a parameter in the cli for this option but it's kind of hard to deal with the mixed case in a batch without doing the modifications in the samplesheet.

HaidYi and others added 4 commits July 16, 2025 19:24
Co-authored-by: James A. Fellows Yates <jfy133@gmail.com>
Co-authored-by: James A. Fellows Yates <jfy133@gmail.com>
Co-authored-by: James A. Fellows Yates <jfy133@gmail.com>
Co-authored-by: James A. Fellows Yates <jfy133@gmail.com>
@HaidYi
Copy link
Author

HaidYi commented Jul 17, 2025

@jfy133 Thank you for the comments and suggestions. I will fix all the problems one-by-one. As I don't want this PR corrupt other screening steps, I will do a more comprehensive testing, which may take more time. I will let you know when I fix all the issues.

@HaidYi
Copy link
Author

HaidYi commented Dec 12, 2025

@jfy133 Thank you for pointing this. For this issue, I contacted the tool author. The problem is because the server has a low uploading bandwidth at UNL. So, the authors just got approved for hosting the data on AWS S3. So, they will update the nf-core/rundbcan_database module when they finish the transition of the database to S3.

Then, I think the time-out problem in the testing will be resolved automatically when I pull the newest module. I will keep you in the loop.

@jfy133
Copy link
Member

jfy133 commented Dec 12, 2025

Ok! Let's see if it helps 👍👍

@jfy133
Copy link
Member

jfy133 commented Dec 17, 2025

@HaidYi I will check again after the holidays, but I just had a though it may also be an idea to ask the developer to make a mini database anyway . It may be useful for other cases too, of just needs to include a couple of gene sequences so there is something that is compatible with running db_can (even if output is nonsense).

@HaidYi
Copy link
Author

HaidYi commented Jan 7, 2026

@jfy133 Happy new year! I hope you had a great holiday. Thanks to @Xinpeng021001 's work, the db_can tool has updated the database hosted from local server in the university to amazon s3 supported by AWS Open Data Sponsorship Program. And the tool has released the new version (v5.2.2) to reflect this change.

So, next step we will update dbcan nf-core module and solve this slow database downloading problem in this PR as well. Will keep you posted for the progress. Thanks.

@jfy133
Copy link
Member

jfy133 commented Jan 7, 2026

Wonderful and than kyou @Xinpeng021001 ! Much appreciated!

I'll keep an eye on this PR (just resovled a docs conflict just now) for updates :)

@HaidYi
Copy link
Author

HaidYi commented Feb 4, 2026

@jfy133 I updated the rundbcan module to aws for database downloading(nf-core/modules#9768). And this new PR now has no problems for the longtime db downloading problems. Please review again.

@HaidYi HaidYi requested a review from jfy133 February 4, 2026 16:04
Copy link
Member

@jfy133 jfy133 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK we are ALMOST DONE @HaidYi 🎉! Thank you for your patience!

Here are the last points/questions (to summarise some of the specific comments too), but otherwise code looks great, I've checked against our pipeline conventions (now on dev here and you're already following them already 💪:

Conceptual

  1. Can you confirm there are no db_can <subcmd> options/arguments that we should expose to the user via a pipeline parameter? E.g. for run_dbcan the --mode or --methods parameters? Or for the cgc_finder the parameter --use_distance ?

Code

  1. test_preannotated_cazyme.conf: You are missing a tests nf-test test file and it's snapshot for the new test config

Documentation

  1. usage.md: missing documentation in the sameplsheet section about the new gff column
  2. nextflow_schema.json: missing the long-form helptext(s) describing when you would want to maybe skip the cgc and substrate detection
  3. CHANGELOG.md: missing a change log entry of the PR, but also please make sure to add the version of db_can as a new dependency (i.e., the previous version column can be empty)
  4. README.md: don't forget to add yourself to the 'credits` list!
  5. nextflow.config: don't forget to add yourself to the manifest section as a contributor!

Comment on lines +35 to +36
dbcan_skip_cgc = false // Skip cgc annotation as .gbk (not .gff) is provided in samplesheet
dbcan_skip_substrate = false // Skip substrate annotation as .gbk (not .gff) is provided in samplesheet
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless the GBK/GFF files are mutually exclusive as input to funcscan, I would argue maybe it would make sense to include the GFF file in the samplesheet_preannotated.csv samplesheet

But it would be nice in another test profile (maybe test_cazyme_prokka) you still also test skipping the dbcan_skip_cgc and dbcan_skip_substrate functionality?

},
"dbcan_skip_cgc": {
"type": "boolean",
"description": "Skip CGC during the dbCAN screening.",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still missing

@@ -0,0 +1,37 @@
/*
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Nextflow config file for running minimal tests
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file is still missing a tests/test.nf.test file and associated snapshot

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you've added a new optional column to the samplesheet, you need to add a description on this near the top of this page in the relevant section)

@jfy133
Copy link
Member

jfy133 commented Mar 14, 2026

@nf-core-bot fix linting

@jfy133
Copy link
Member

jfy133 commented Mar 14, 2026

Hrm, the test failure error is:

> [7a/0497d8] Submitted process > NFCORE_FUNCSCAN:FUNCSCAN:CAZYME:RUNDBCAN_EASYSUBSTRATE (sample_3)
    > ERROR ~ Error executing process > 'NFCORE_FUNCSCAN:FUNCSCAN:CAZYME:RUNDBCAN_EASYSUBSTRATE (sample_3)'
    > 
    > Caused by:
    >   Process exceeded running time limit (1h)
    > 
    > 
    > Command executed:
    > 
    >   run_dbcan easy_substrate \
    >       --mode protein \
    >       --db_dir dbcan_db \
    >       --input_raw_data sample_3_pyrodigal.faa \
    >       --output_dir . \
    >       --input_gff sample_3_pyrodigal.gff \
    >       --gff_type prodigal \
    >   
    >   
    >   mv overview.tsv             sample_3_overview.tsv
    >   mv dbCAN_hmm_results.tsv    sample_3_dbCAN_hmm_results.tsv
    >   mv dbCANsub_hmm_results.tsv sample_3_dbCANsub_hmm_results.tsv
    >   mv diamond.out              sample_3_diamond.out
    >   mv cgc.gff                  sample_3_cgc.gff
    >   mv cgc_standard_out.tsv     sample_3_cgc_standard_out.tsv
    >   mv diamond.out.tc           sample_3_diamond.out.tc
    >   mv STP_hmm_results.tsv      sample_3_STP_hmm_results.tsv
    >   mv total_cgc_info.tsv       sample_3_total_cgc_info.tsv
    >   mv CGC.faa                  sample_3_CGC.faa
    >   mv PUL_blast.out            sample_3_PUL_blast.out
    >   mv substrate_prediction.tsv sample_3_substrate_prediction.tsv
    >   mv synteny_pdf/             sample_3_synteny_pdf/
    >   if [ -f TF_hmm_results.tsv ]; then
    >       mv TF_hmm_results.tsv   sample_3_TF_hmm_results.tsv
    >   fi
    >   
    >   cat <<-END_VERSIONS > versions.yml
    >   "NFCORE_FUNCSCAN:FUNCSCAN:CAZYME:RUNDBCAN_EASYSUBSTRATE":
    >       dbcan: $(echo $(run_dbcan version) | cut -f2 -d':' | cut -f2 -d' ')
    >   END_VERSIONS
    > 
    > Command exit status:
    >   -
    > 
    > Command output:
    >   step 1/4  CAZyme annotation...
    >   step 2/4  GFF processing...
    > 
    > Command wrapper:
    >   step 1/4  CAZyme annotation...
    >   step 2/4  GFF processing...
    > 
    > Work dir:
    >   /home/runner/_work/funcscan/funcscan/~/tests/663aa07f0ba76014e15144d46d75baaa/work/7a/0497d8588d4be4dfe1db4f5e81ce36
    > 
    > Container:
    >   quay.io/biocontainers/dbcan:5.2.6--pyhdfd78af_0
    > 
    > Tip: you can try to figure out what's wrong by changing to the process work dir and showing the script file named `.command.sh`
    > 
    >  -- Check '/home/runner/_work/funcscan/funcscan/~/tests/663aa07f0ba76014e15144d46d75baaa/meta/nextflow.log' file for details
    > Execution cancelled -- Finishing pending tasks before exit
    > ERROR ~ Pipeline failed. Please refer to troubleshooting docs: https://nf-co.re/docs/usage/troubleshooting

Do you expect EASYSUBSTRATE to be so slow @HaidYi ?

@HaidYi
Copy link
Author

HaidYi commented Mar 15, 2026

Hrm, the test failure error is:

> [7a/0497d8] Submitted process > NFCORE_FUNCSCAN:FUNCSCAN:CAZYME:RUNDBCAN_EASYSUBSTRATE (sample_3)
    > ERROR ~ Error executing process > 'NFCORE_FUNCSCAN:FUNCSCAN:CAZYME:RUNDBCAN_EASYSUBSTRATE (sample_3)'
    > 
    > Caused by:
    >   Process exceeded running time limit (1h)
    > 
    > 
    > Command executed:
    > 
    >   run_dbcan easy_substrate \
    >       --mode protein \
    >       --db_dir dbcan_db \
    >       --input_raw_data sample_3_pyrodigal.faa \
    >       --output_dir . \
    >       --input_gff sample_3_pyrodigal.gff \
    >       --gff_type prodigal \
    >   
    >   
    >   mv overview.tsv             sample_3_overview.tsv
    >   mv dbCAN_hmm_results.tsv    sample_3_dbCAN_hmm_results.tsv
    >   mv dbCANsub_hmm_results.tsv sample_3_dbCANsub_hmm_results.tsv
    >   mv diamond.out              sample_3_diamond.out
    >   mv cgc.gff                  sample_3_cgc.gff
    >   mv cgc_standard_out.tsv     sample_3_cgc_standard_out.tsv
    >   mv diamond.out.tc           sample_3_diamond.out.tc
    >   mv STP_hmm_results.tsv      sample_3_STP_hmm_results.tsv
    >   mv total_cgc_info.tsv       sample_3_total_cgc_info.tsv
    >   mv CGC.faa                  sample_3_CGC.faa
    >   mv PUL_blast.out            sample_3_PUL_blast.out
    >   mv substrate_prediction.tsv sample_3_substrate_prediction.tsv
    >   mv synteny_pdf/             sample_3_synteny_pdf/
    >   if [ -f TF_hmm_results.tsv ]; then
    >       mv TF_hmm_results.tsv   sample_3_TF_hmm_results.tsv
    >   fi
    >   
    >   cat <<-END_VERSIONS > versions.yml
    >   "NFCORE_FUNCSCAN:FUNCSCAN:CAZYME:RUNDBCAN_EASYSUBSTRATE":
    >       dbcan: $(echo $(run_dbcan version) | cut -f2 -d':' | cut -f2 -d' ')
    >   END_VERSIONS
    > 
    > Command exit status:
    >   -
    > 
    > Command output:
    >   step 1/4  CAZyme annotation...
    >   step 2/4  GFF processing...
    > 
    > Command wrapper:
    >   step 1/4  CAZyme annotation...
    >   step 2/4  GFF processing...
    > 
    > Work dir:
    >   /home/runner/_work/funcscan/funcscan/~/tests/663aa07f0ba76014e15144d46d75baaa/work/7a/0497d8588d4be4dfe1db4f5e81ce36
    > 
    > Container:
    >   quay.io/biocontainers/dbcan:5.2.6--pyhdfd78af_0
    > 
    > Tip: you can try to figure out what's wrong by changing to the process work dir and showing the script file named `.command.sh`
    > 
    >  -- Check '/home/runner/_work/funcscan/funcscan/~/tests/663aa07f0ba76014e15144d46d75baaa/meta/nextflow.log' file for details
    > Execution cancelled -- Finishing pending tasks before exit
    > ERROR ~ Pipeline failed. Please refer to troubleshooting docs: https://nf-co.re/docs/usage/troubleshooting

Do you expect EASYSUBSTRATE to be so slow @HaidYi ?

I don't think so. @Xinpeng021001 Could you answer the question why the substrate on this testing dataset needs a long time to finish ?

@Xinpeng021001
Copy link
Member

The substrate process shouldn't take such a long time. I'm reviewing the error and will reply asap.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants