Analysis options

The analysis source is provided in the module proteoTorch.analyze. Available options are listed below.

Main Options

The following is a list of post-processing options when calling proteoTorch from the command line.

  • --pin: input file in PIN format

  • --method: machine learning classifier to use during semi-supervised learning

    • Method 0: LDA

    • Method 1: linear SVM, solver TRON

    • Method 2: linear SVM, solver L2-SVM-MFN (Percolator’s solver)

    • Method 3: DNN (deep multi-layer perceptron, default value)

  • --output_dir: where to write result files. Default = model_output/<data_file_name>/<time_stamp>/

  • --tdc: Use target-decoy competition to assign q-values (true/false). Default = true/

  • --numThreads: Number of CPU threads to use for parallelizable computations. Default = 1)

  • --initDirection: If >= 0, specifies which feature to use as initial PSM scores during semi-supervised learning. If = -1, automatically find and use the most discriminative feature. Default = -1

  • --q: q-value tolerance when estimating positive training samples. Default = 0.01

  • --verbose: Verbosity. Default = 1

  • --output_per_iter_granularity: Specifies number of iterations to write recalibrated PSM scores. Default = 5

  • --write_output_per_iter: Write recalibrated PSM scores after every output_per_iter_granularity iterations (true/false). Default = true

  • --maxIters: Number of semi-supervised learning iterations to run. Default = 20

  • --seed: Random seed when partitioning PSMs into cross-validation bins. Default = 1

Deep learning options

  • --dnn_optimizer: DNN training algorithm to use (sgd or Adam). Default = Adam

  • --dnn_num_epochs: Number of epochs to train DNN. Default = 50

  • --deepq: DNN q-value tolerance when estimating positive training samples. Default = 0.07

  • --dnn_lr: DNN learning rate. Default = 0.001

  • --dnn_lr_decay: Reduce learning rate by this total for all epochs (dnn_lr_decay / dnn_num_epochs applied after each epoch). Default = 0.02

  • --dnn_num_layers: Number of hidden DNN layers. Default = 3

  • --dnn_layer_size: Number of neurons per hidden layer. Default = 200

  • --starting_dropout_rate: Dropout rate for first iteration. Default = 0.5

  • --dnn_dropout_rate: Dropout rate for iterations > 1. Default = 0.0

  • --dnn_gpu_id: GPU ID to use for the DNN model (will switch to CPU mode if no GPU is found or CUDA is not installed). Default = 0

  • --dnn_label_smoothing_0: Label smoothing for training class 0 (decoys). Default = 0.99

  • --dnn_label_smoothing_1: Label smoothing for training class 1 (targets within q-value tolerance). Default = 0.99

  • --dnn_train_qtol: AUC q-value tolerance to measure validation performance. Default = 0.1

  • --false_positive_loss_factor: Multiplicative factor to weight false positives during training. Default = 4.0

  • --deepInitDirection: Produce initial PSM scores by training a large ensemble DNN to speed up training convergence (true/false). Default = true if method=3

  • --deep_direction_ensemble: Number of DNN ensembles to train during deep initial direction search. Default = 30

  • --load_previous_dnn: Start iterations from previously saved model (boolean). Default = false

  • --previous_dnn_dir: Previous output directory containing trained dnn weights.

Note on parallelization

Within each iteration of the algorithm, nested cross-validation (CV) is performed. If a DNN classifier is selected (i.e., --method 3), the CV folds are run sequentially. This safeguards against the GPU running out of memory and ProteoTorch crashing during analysis.

When an SVM is selected (i.e., --method 2 or --method 3), the CV folds are run in parallel using the number of CPU threads specified by --numThreads.