Workflow Parameters

The workflow parameters should be included in a configuration file, an example of which can be found at https://raw.githubusercontent.com/mriffle/nf-skyline-dia-ms/main/resources/pipeline.config

The parameters in this file should be changed to indicate the locations of your data, the options you’d like to use for the software included in the workflow, and the capabilities and configuration for the system on which you are running the workflow steps.

The configuration file is roughly organized as:

params {
...
}

profiles {
...
}

mail {
...
}
  • The params section includes locations of data and configuration options for a specific run of the workflow.

  • The profiles sections includes parameters that describe the capabilities of the systems that run the steps of the workflow. For example, if running on your local system, this will include things like how many cores and how much RAM may be used by the steps of the workflow. This will not need to be changed for each run of the workflow.

  • The mail section includes configuration options for sending email. This is optional and only necessary if you wish to send emails when the workflow completes. This will not need to be changed for each run of the workflow.

Below is a complete description of all parameters that may be included in these sections.

Note

This workflow can process files stored in PanoramaWeb. When specifying directories or file locations, any paths that begin with https:// will be interpreted as being PanoramaWeb locations.

For example, to process raw files stored in PanoramaWeb, you would have the following in your pipeline.config file:

quant_spectra_dir= 'https://panoramaweb.org/_webdav/path/to/@files/RawFiles/'

Where, https://panoramaweb.org/_webdav/path/to/@files/RawFiles/ is the WebDav URL of the folder on the Panorama server.

The params Section

Parameters for the params section

Req?

Parameter Name

Description

spectral_library

That path to the spectral library to use. May be a dlib, elib, blib, speclib (DIA-NN), tsv (DIA-NN), or other formats supported by EncyclopeDIA or DIA-NN. If a Carafe library is being generated the Carafe spectral library will override this parameter. This parameter is required for EncyclopeDIA. If omitted when using DIA-NN, DIA-NN will be run in library-free mode. This parameter is ignored when running Cascadia.

fasta

The path to the background FASTA file to use. This parameter is required, except when running Cascadia.

quant_spectra_dir

The path to the directory containing the raw data to be quantified. If using narrow window DIA and GPF to generated a chromatogram library this is the location of the wide-window data to be searched using the chromatogram library.

quant_spectra_glob

Which files in this directory to use. Default: *.raw

quant_spectra_regex

Use this regex instead of quant_spectra_glob to select files in quant_spectra_dir. If set, quant_spectra_glob must be set to null. Default: null.

files_per_quant_batch

Randomly select n files per batch in quant_spectra_dir. If null all the files in quant_spectra_dir are used. Default is null.

chromatogram_library_spectra_dir

If you are creating a chromatogram library using GPF and narrow window DIA, this is the path to the directory containing the narrow-window raw data.

chromatogram_library_spectra_glob

Which files in this directory to use. Default: *.raw

chromatogram_library_spectra_regex

Use this regex instead of chromatogram_library_spectra_glob to select files in chromatogram_library_spectra_dir. If set, chromatogram_library_spectra_glob must be set to null. Default: null.

use_vendor_raw

If supported by the search_engine, skip the MSCONVERT step to generate mzMLs and use vendor raw files for the search and to generate the Skyline document. Default is false.

vendor_raw_copy

If use_vendor_raw is set to true, Nextflow will attempt to use hard links to the raw file, which is required by vendor libraries. However, this is not supported in all environment. If hard links are not supported, set this to true to create physical copies of the files instead of hard links. This will use extra space. Default is false.

files_per_chrom_lib

Randomly select n files in chromatogram_library_spectra_dir to use to build chromatogram library. If null all the files in chromatogram_library_spectra_dir are used. Default is null.

random_file_seed

The seed used to randomly select files for the files_per_chrom_lib and files_per_quant_batch parameters. A seed is used so that if the workflow is re-run the same sequence of files will be randomly selected each time. Default is 12.

search_engine

Must be set to either 'encyclopedia', 'diann', 'cascadia', or null. If set to 'cascadia', chromatogram_library_spectra_dir, chromatogram_library_spectra_glob, and EncyclopeDIA-specific parameters will be ignored. If set to null, the workflow will skip the search step and generate Skyline document(s) using spectral_library, fasta, and files in quant_spectra_dir. Default: 'encyclopedia'.

replicate_metadata

Metadata annotations for each raw or mzML file. Can be in tsv or csv format. See the Providing replicate metadata section for details of how the file should be formatted. If a metadata file is specified it will be used to add annotations to the final Skyline document and can be used to color PCA plots in the QC report by specifying the qc_report.color_vars parameter. If this parameter is set to null the skyline document annotation step is skipped.

email

The email address to which a notification should be sent upon workflow completion. If no email is specified, no email will be sent. To send email, you must configure mail server settings (see below).

params.pdc

Parameters for getting raw files and metadata from the Proteomics Data Commons. All parameters in this section are optional.

Parameter Name

Description

pdc.study_id

When this option is set, raw files and metadata will be downloaded from the PDC. Default: null.

pdc.gene_level_data

A tsv file mapping gene names to NCIB gene IDs and gene metadata. Required for PDC gene reports. Default: null.

pdc.n_raw_files

If this option is set, only n raw files are downloaded. This is useful for testing but otherwise should be null.

pdc.client_args

Additional command line arguments passed to PDC_client. Default is null.

pdc.s3_download

If set to true download raw files through an S3 transfer instead of over https. This option will only work if the workflow execution environment is configured to directly access PDC AWS infrastructure. Default is faise.

params.carafe

Parameters for Carafe. All parameters in this section are optional.

Parameter Name

Description

carafe.spectra_file

raw or mzML file used by Carafe to generate final spectral library. If set to null Carafe is skipped. Default: null.

carafe.peptide_results_file

The path to a DIA-NN tsv or parquet precursor report file. If this parameter is set, the DIA-NN search will be skipped and this file used. Default: null (run DIA-NN).

carafe.carafe_fasta

FASTA file used by Carafe to generate final spectral library. If null, params.fasta is used.

carafe_cli_options

Command line options to pass to Carafe. Note: Do not set the se, lf_type, -db, -i, -o parameters, these are handled by the workflow. The default is -min_pep_charge 2 -max_pep_charge 3 See the Carafe GitHub page for details on available parameters.

carafe.diann_fasta

The FASTA file used by the DIA-NN search in the Carafe subworkflow. If not set either params.carafe_fasta or params.fasta will be used. Default: null.

params.msconvert

Parameters for Msconvert. All parameters in this section are optional.

Parameter Name

Description

msconvert.do_demultiplex

If starting with raw files, this is the value used by msconvert for the do_demultiplex parameter. Default: true.

msconvert.do_simasspectra

If starting with raw files, this is the value used by msconvert for the do_simasspectra parameter. Default: true.

msconvert.mz_shift_ppm

If starting with raw files, msconvert will shift all mz values by n ppm when converting to mzML. If null the mz values are not shifted. Default: null.

params.diann

When using DIA-NN, the chromatogram_library_spectra_dir parameter can optionally be used to create a subset library. The files in chromatogram_library_spectra_dir are searched first using a spectral library either specified by params.spectral_library, or a predicted library generated in the workflow by Carafe or DiaNN. Then, the resulting subset library containing only those precursors identified in the first search, is then used to search the files in quant_spectra_dir.

Parameters for DIA-NN. All parameters in this section are optional.

Parameter Name

Description

diann.search_params

The parameters passed to DIA-NN when it is run. Default: '--qvalue 0.01' Note: Do not set the --fasta, --lib, --threads, --use-quant, --gen-spec-lib, --reanalyse, --rt-profiling, or --id-profliing, parameters. These parameters are are handled by the DIANN_QUANT and DIANN_MBR processes.

diann.fasta_digest_params

Parameters used when generateing predicted spectral library with DIA-NN. Note: Do not set the --fasta, --predictor, --gen-spec-lib, --fasta-search, or --out-lib parameters. These parameters are are handled by the DIANN_BUILD_LIB process.

Default is: '--cut \'K*,R*,!*P\' --unimod4 --missed-cleavages 1 --min-pep-len 8 --min-pr-charge 2 --max-pep-len 30'

params.encyclopedia and params.cascadia

Parameters for EncyclopeDIA and Cacsadia. All parameters in this section are optional.

Parameter Name

Description

encyclopedia.chromatogram.params

If you are generating a chromatogram library for quantification, this is the command line options passed to EncyclopeDIA during the chromatogram generation step. Default: '-enableAdvancedOptions -v2scoring' If you do not wish to pass any options to EncyclopeDIA, this must be set to ''.

encyclopedia.quant.params

The command line options passed to EncyclopeDIA during the quantification step. Default: '-enableAdvancedOptions -v2scoring' If you do not wish to pass any options to EncyclopeDIA, this must be set to ''.

encyclopedia.save_output

EncyclopeDIA generates many intermediate files that are subsequently processed by the workflow to generate the final results. These intermediate files may be large. If this is set to 'true', these intermediate files will be saved locally in your results directory. Default: 'false'.

cascadia.use_gpu

If set to true, Cascadia will attempt to use the GPU(s) installed on the system where it is running. Do not set to true unless a GPU is available, otherwise an error will be gernated. Default: false.

params.skyline

Parameters for the params.skyline section. All parameters in this section are optional.

Parameter Name

Description

skyline.skip

If set to true, will skip the creation of a Skyline document. Default: false.

skyline.document_name

The base of the file name of the generated Skyline document. If set to 'human_dia', the output file name would be human_dia.sky.zip. Note: If importing into PanoramaWeb, this is also the name that appears in the list of imported Skyline documents on the project page. Default: final.

skyline.skyr_file

Path(s) (local file system or Panorama WebDAV) to a .skyr file, which is a Skyline report template. Any reports specified in the .skyr file will be run automatically as the last step of the workflow and the results saved in your results directory and (if requested) uploaded to Panorama. The report template(s) can be a single string, or for multiple .skyr files can be given as a list of strings. For example: '/path/to/report.skyr' for a single file, or ['/path/to/report_1.skyr', '/path/to/report_2.skyr'] for multiple files.

skyline.template_file

The Skyline template file used to generate the final Skyline file. By default a pre-made Skyline template file suitable for EncyclopeDIA or DIA-NN will be used. Specify a file location here to use your own template. Note: The filenames in the .zip file must match the name of the zip file, itself. E.g., my-skyline-template.zip must contain my-skyline-template.sky.

skyline.protein_parsimony

If true, protein parsimony is performed in Skyline. If false the protein assignments given by the search engine are used as protein groups. Default is false.

skyline.fasta

The fasta file to use as a background proteome in Skyline. If null the same fasta file (params.fasta) used for the DIA search is used. Default is null.

skyline.group_by_gene

If true, when protein parsimony is performed in Skyline protein groups are formed by gene instead of by protein. Default is false.

skyline.minimize

If true, the size of the final Skyline document is minimized. Chromatograms for isotopic peaks that are not in the document are removed from the skyd file and a minimal spectral library is generated by removing spectra that are not in the document. Default is false.

skyline.use_hardlinks

On systems that allow it, setting this to true allows the use of cached Skyline workflow steps and may improve performance on subsequent runs. Note: some systems do not allow this, which will result in an error. Default: false.

params.qc_report and params.batch_report

Parameters for QC and batch reports. All parameters in this section are optional.

Parameter Name

Description

qc_report.skip

If set to true, will skip the creation of a the QC report. Default: true.

qc_report.normalization_method

Normalization method to use for plots in QC and batch report(s). This option applies to both the QC and batch reports. Available options are DirectLFQ and median. Default is median.

qc_report.imputation_method

Method to use to impute missing precursor peak areas for plots in QC and batch report(s). This option applies to both the QC and batch reports. Available options are KNN. If set to null imputation of peaks areas is not performed. Default is null.

qc_report.standard_proteins

List of protein names in Skyline document to plot retention times for.

For example: ['iRT', 'sp|P00924|ENO1_YEAST']

If null, the standard protein retention time plot is skipped. Default is null.

qc_report.color_vars

List of metadata variables to color PCA plots by.

For example: ['sample_type', 'strain']

This option applies to both the QC and batch reports. If null, only a single PCA plot colored by file acquisition order is generated. Default is null.

qc_report.export_tables

Export tsv files containing normalized precursor and protein quantities? Default is false.

batch_report.skip

If set to true, will skip the creation of a the batch report. Default: true.

batch_report.batch1

Metadata key for batch level 1. If null, the project name in documents is used as the batch variable.

batch_report.batch2

Metadata key for batch level 2. A second batch level is only supported with limma as the batch correction method.

batch_report.covariate_vars

Metadata key(s) to use as covariates for batch correction. If null, no covariates are used.

batch_report.control_key

Metadata key indicating replicates which are controls for CV plots. If null, all replicates are used in CV distribution plot.

batch_report.control_values

Metadata value(s) mapping to control_key indicating whether a replicate is a control.

batch_report.plot_ext

File extension for standalone plots. If null, no standalone plots are produced.

params.panorama

Parameters for uploading pipeline results to PanoramaWeb. All parameters in this section are optional.

Parameter Name

Description

panorama.upload

Whether or not to upload results to PanoramaWeb Default: false.

panorama.upload_url

The WebDAV URL of a directory in PanoramaWeb to which to upload the results. Note that panorama.upload must be set to true to upload results.

panorama.import_skyline

If set to true, the generated Skyline document will be imported into PanoramaWeb’s relational database for inline visualization. The import will appear in the parent folder for the panorama.upload_url parameter, and will have the named used for the skyline_document_name parameter. Default: false. Note: panorama_upload must be set to true and skip_skyline must be set to false to use this feature.

Running the workflow in multi-batch mode

The workflow can be run in multi-batch mode if the params.search_engine supports it. Currently the only search engine option that supports multi batch mode is 'diann'.

To activate multi-batch mode params.quant_spectra_dir must be a Map where each key, value pair is a batch name and the ms files corresponding to the batch. For example:

params {
  quant_spectra_dir = ['Plate_1': '<path to mzML/raw files>',
                       'Plate_2': '<path to mzML/raw files>']
}

Note: mzML/raw file names can not be duplicated in any batch. If there are duplicate file names the DIANN_MBR process will fail.

Differences in result files in multi batch mode

  • A separate Skyline document is generated for each batch and prefixed with the batch name.

    • For example, if params.skyline.document_name is 'human_dia' and using the batches in the example above, 2 documents would be generated.

      1. Plate1_human_dia.sky.zip

      2. Plate2_human_dia.sky.zip

  • Any optional Skyline reports will be generated separately for each document.

  • A separate QC report is generated for each Skyline document.

  • If results are uploaded to PanoramaWeb, any mzML files generated in the workflow are put into a separate subdirectory for each batch.

Providing replicate metadata

The replicate_metadata file can be a tsv or csv file where the first column has the header Replicate. The values under the replicate column should match exactly the names of the mzML or raw files which will be in the Skyline document. The headers of subsequent columns are the names of each metadata variable and the values in each column are the annotations corresponding to each replicate.

Example replicate metadata file format

Replicate

sample_type

strain

replicate_1.raw

test

BALB/cJ

replicate_2.raw

test

C57BL/6J

replicate_3.raw

IBQC

Pool

The profiles Section

The example configuration file includes this profiles section:

profiles {

    // "standard" is the profile used when the steps of the workflow are run
    // locally on your computer. These parameters should be changed to match
    // your system resources (that you are willing to devote to running
    // workflow jobs).
    standard {
        params.max_memory = '8.GB'
        params.max_cpus = 4
        params.max_time = '240.h'

        params.mzml_cache_directory = '/data/mass_spec/nextflow/nf-skyline-dia-ms/mzml_cache'
        params.panorama_cache_directory = '/data/mass_spec/nextflow/panorama/raw_cache'
    }
}

These parameters describe the capability of your local computer for running the steps of the workflow. Below is a description of each parameter:

Parameters for the profiles/standard section

Req?

Parameter Name

Description

params.max_memory

The maximum amount of RAM that may be used by steps of the workflow. Default: 8 gigabytes.

params.max_cpus

The number of cores that may be used by the workflow. Default: 4 cores.

params.max_time

The maximum amount of a time a step in the workflow may run before it is stopped and error generated. Default: 240 hours.

params.mzml_cache_directory

When msconvert converts a RAW file to mzML, the mzML file is cached for future use. This specifies the directory in which the cached mzML files are stored.

params.panorama_cache_directory

If the RAW files to be processed are in PanoramaWeb, the RAW files will be downloaded to and cached in this directory for future use.

The process Section

In Nextflow the default compute resources allocated to a process can be adjusted in the process section using the withName selector. The following processes will dynamically adjust the requested memory and run time to fit the number and size of the files being processed. Nextflow will try to allocate resources using the formulas below up to the maximum values specified by params.max_memory, params.max_time and params.max_cpus.

Default resources for processes with custom labels

Process

CPUs

Memory

Walltime

DIANN_QUANT

8

Maximum of 16 GB and 2 times the sum of the sizes of the MS and spectral library files

2 hours

DIANN_MBR

32

Maximum of 32 GB and 2 times the sum of the MS file sizes

10 minutes times the number of MS files

BLIB_BUILD_LIBRARY

2

Maximum of 8 GB and 1.5 times the size of the precursor report file

2 hours

ENCYCLOPEDIA_SEARCH_FILE

8

16 GB

4 hours

ENCYCLOPEDIA_CREATE_ELIB

32

Maximum of 32 GB and 4 times the number of MS files

24 hours

SKYLINE_ADD_LIB

8

Maximum of 8 GB and 10 times the spectral library size

4 hours

SKYLINE_IMPORT_MS_FILE

8

Maximum of 8 GB and the sum of the MS file and skyline template with spectral library

2 hours

SKYLINE_MERGE_RESULTS

32

Maximum of 8 GB and 1.5 times the sum of the sizes of the .skyd files

8 hours

SKYLINE_ANNOTATE_DOCUMENT

8

Maximum of 8 GB and 1.5 times the size of the skyline zip file

4 hours

SKYLINE_RUN_REPORTS

8

Maximum of 8 GB and 1.5 times the size of the skyline zip file

4 hours

MERGE_REPORTS

2

Maximum of 8 GB and the sum of the sizes of the precursor reports

8 hours

FILTER_IMPUTE_NORMALIZE

8

Maximum of 8 GB and 2 times the size of the batch database

4 hours

GENERATE_QC_QMD

2

Maximum of 8 GB and 2 times the size of the batch database

2 hours

GENERATE_BATCH_REPORT

2

Maximum of 8 GB and 2 times the size of the batch database

4 hours

EXPORT_TABLES

2

Maximum of 8 GB and 2 times the size of the batch database

2 hours

RENDER_QC_REPORT

2

Maximum of 8 GB and 2 times the size of the batch database

2 hours

EXPORT_GENE_REPORTS

2

Maximum of 8 GB and 2 times the size of the batch database

2 hours

In most cases there is no need for users to adjust the default values. One instance where adjusting these parameters could be useful is to select the AWS batch queue to be used for a specific process. The DIANN_MBR process downloads all MS files to a single EC2 instance. In cases where large numbers of files are being processed the available disk space on the default EC2 instance might not be sufficient to hold all the MS files. The DIANN_MBR process can be set to run in a queue with more disk space by adding the following to the pipeline config.

 process {
    withName:DIANN_MBR {
        queue = "nextflow_basic_ec2_1tb"
    }
}

The resource requirements allocated to a process can be fully customized by adding a withName selector to the process section of the pipeline config file. For example, to override the default memory and wall time for DIANN_MBR you could add the following to the pipeline config:

process {
    withName:DIANN_MBR {
        memory = 248.GB
        time = 48.h
    }
}

The mail Section

This is a more advanced and entirely optional set of parameters. When the workflow completes, it can optionally send an email to the address specified above in the params section. For this to work, the following parameters must be changed to match the settings of your email server. You may need to contact your IT department to obtain the appropriate settings.

The example configuration file includes this mail section:

mail {
    from = 'address@host.com'
    smtp.host = 'smtp.host.com'
    smtp.port = 587
    smtp.user = 'smpt_user'
    smtp.password = 'smtp_password'
    smtp.auth = true
    smtp.starttls.enable = true
    smtp.starttls.required = false
    mail.smtp.ssl.protocols = 'TLSv1.2'
}

Below is a description of each parameter:

Parameters for the profiles/standard section

Req?

Parameter Name

Description

from

The email address from which the email should be sent.

smtp.host

The internet address (host name or ip address) of the email SMTP server.

smtp.port

The port on the host to connect to. Most likely will be 587.

smtp.user

If authentication is required, this is the username.

smtp.password

If authentication is required, this is the password.

smtp.auth

Whether or not (true or false) authentication is required.

smtp.starttls.enable

Whether or not to enable TLS support.

smtp.starttls.required

Whether or not TLS is required.

smtp.ssl.protocols

SSL protocol to use for sending SMTP messages.