Skip to content

The performence of CatPred model generate by reproduce_training.sh #39

@zunyun-Gong

Description

@zunyun-Gong

Thank you very much for sharing this excellent work and for providing such a detailed manual. I have some problems for reproduce details. I download your CatPred-DB from github and follow the instruction in reproduce_training.sh to reproduce CatPred for benchmark. But, I notice that the model performence is not as good as you reported in your paper. The command line is:
python train.py --protein_records_path kcat-random_trainvaltest.json --data_path removeX_random_kcat_data/catpred_db.kcat.random_trainval.csv --separate_test_path removeX_random_kcat_data/catpred_db.kcat.random_test.csv --separate_val_path removeX_random_kcat_data/catpred_db.kcat.random_val.csv --dataset_type regression --smiles_columns reactant_smiles --target_columns log10kcat_max --loss_function mve --seq_embed_dim 36 --seq_self_attn_nheads 6 --add_esm_feats --save_dir outdir --extra_metrics mae mse r2 --epochs 30 --batch_size 16 > capred_log
The evaluated result is Model 0 best validation rmse = 0.730834 on epoch 29 MoleculeModel( (softplus): Softplus(beta=1.0, threshold=20.0) (encoder): MPN( (encoder): ModuleList( (0): MPNEncoder( (dropout): Dropout(p=0.0, inplace=False) (act_func): ReLU() (W_i): Linear(in_features=147, out_features=300, bias=False) (W_h): Linear(in_features=300, out_features=300, bias=False) (W_o): Linear(in_features=433, out_features=300, bias=True) ) ) ) (seq_embedder): Embedding(21, 36, padding_idx=20) (rotary_embedder): RotaryEmbedding() (multihead_attn): MultiheadAttention( (out_proj): NonDynamicallyQuantizableLinear(in_features=36, out_features=36, bias=True) ) (attentive_pooler): AttentivePooling( (linear1): Linear(in_features=1316, out_features=1316, bias=True) (tanh): Tanh() (linear2): Linear(in_features=1316, out_features=1, bias=True) (softmax): Softmax(dim=1) ) (readout): Sequential( (0): Dropout(p=0.0, inplace=False) (1): Linear(in_features=1616, out_features=300, bias=True) (2): ReLU() (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=300, out_features=2, bias=True) ) ) Loading pretrained parameter "encoder.encoder.0.cached_zero_vector". Loading pretrained parameter "encoder.encoder.0.W_i.weight". Loading pretrained parameter "encoder.encoder.0.W_h.weight". Loading pretrained parameter "encoder.encoder.0.W_o.weight". Loading pretrained parameter "encoder.encoder.0.W_o.bias". Loading pretrained parameter "seq_embedder.weight". Loading pretrained parameter "rotary_embedder.freqs". Loading pretrained parameter "multihead_attn.in_proj_weight". Loading pretrained parameter "multihead_attn.in_proj_bias". Loading pretrained parameter "multihead_attn.out_proj.weight". Loading pretrained parameter "multihead_attn.out_proj.bias". Loading pretrained parameter "attentive_pooler.linear1.weight". Loading pretrained parameter "attentive_pooler.linear1.bias". Loading pretrained parameter "attentive_pooler.linear2.weight". Loading pretrained parameter "attentive_pooler.linear2.bias". Loading pretrained parameter "readout.1.weight". Loading pretrained parameter "readout.1.bias". Loading pretrained parameter "readout.4.weight". Loading pretrained parameter "readout.4.bias". Moving model to cuda Creating protein model Model 0 test rmse = 1.103250 Model 0 test mae = 0.755246 Model 0 test mse = 1.217161 Model 0 test r2 = 0.556943 Ensemble test rmse = 1.103250 Ensemble test mae = 0.755246 Ensemble test mse = 1.217161 Ensemble test r2 = 0.556943
The R2 is lower than you reported in your paper. Furthermore , the reproduced Unikp model also has this problem.
python external/UniKP/UniKP_Kcat_v2.py > unikp_log && python ./external/UniKP/parse_logs.py unikp_log unikp_log_parsed and I only change the input file just as above mentioned. The result is :
Test,R2_mean,R2_stderr,MAE_mean,MAE_stderr,p1mag_mean,p1mag_stderr
Heldout,0.5827,0.0006,0.7502,0.0005,74.96000000000001,0.16999999999999998
CLUSTER_99,0.3192,0.0017,1.0461,0.0015,61.029999999999994,0.41000000000000003
CLUSTER_80,0.2658,0.002,1.0613,0.0017,61.42999999999999,0.44
CLUSTER_60,0.2152,0.0022,1.1326,0.0021,57.8,0.49
CLUSTER_40,0.1321,0.0025,1.2235,0.0023,54.910000000000004,0.54
The performence of reproduced UniKP model in CLUSTER_** dataset are also lower than your mentioned in Supplementary Table 8.

Image About the input data, I only remove the entires contian "X" in Enzyme sequences. Could you provide more details about the model training and data processing? This would help me improve the model’s performance and use it correctly.

Thank you!

Metadata

Metadata

Assignees

No one assigned

    Labels

    questionFurther information is requested

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions