The performence of CatPred model generate by reproduce_training.sh

Thank you very much for sharing this excellent work and for providing such a detailed manual. I have some problems for reproduce details. I download your CatPred-DB from github and follow the instruction in reproduce_training.sh to reproduce CatPred for benchmark. But, I notice that the model performence is not as good as you reported in your paper. The command line is:
`python train.py --protein_records_path kcat-random_trainvaltest.json --data_path removeX_random_kcat_data/catpred_db.kcat.random_trainval.csv --separate_test_path removeX_random_kcat_data/catpred_db.kcat.random_test.csv --separate_val_path removeX_random_kcat_data/catpred_db.kcat.random_val.csv --dataset_type regression --smiles_columns reactant_smiles --target_columns log10kcat_max --loss_function mve --seq_embed_dim 36 --seq_self_attn_nheads 6 --add_esm_feats --save_dir outdir --extra_metrics mae mse r2 --epochs 30 --batch_size 16 > capred_log `
The evaluated result is `Model 0 best validation rmse = 0.730834 on epoch 29
MoleculeModel(
  (softplus): Softplus(beta=1.0, threshold=20.0)
  (encoder): MPN(
    (encoder): ModuleList(
      (0): MPNEncoder(
        (dropout): Dropout(p=0.0, inplace=False)
        (act_func): ReLU()
        (W_i): Linear(in_features=147, out_features=300, bias=False)
        (W_h): Linear(in_features=300, out_features=300, bias=False)
        (W_o): Linear(in_features=433, out_features=300, bias=True)
      )
    )
  )
  (seq_embedder): Embedding(21, 36, padding_idx=20)
  (rotary_embedder): RotaryEmbedding()
  (multihead_attn): MultiheadAttention(
    (out_proj): NonDynamicallyQuantizableLinear(in_features=36, out_features=36, bias=True)
  )
  (attentive_pooler): AttentivePooling(
    (linear1): Linear(in_features=1316, out_features=1316, bias=True)
    (tanh): Tanh()
    (linear2): Linear(in_features=1316, out_features=1, bias=True)
    (softmax): Softmax(dim=1)
  )
  (readout): Sequential(
    (0): Dropout(p=0.0, inplace=False)
    (1): Linear(in_features=1616, out_features=300, bias=True)
    (2): ReLU()
    (3): Dropout(p=0.0, inplace=False)
    (4): Linear(in_features=300, out_features=2, bias=True)
  )
)
Loading pretrained parameter "encoder.encoder.0.cached_zero_vector".
Loading pretrained parameter "encoder.encoder.0.W_i.weight".
Loading pretrained parameter "encoder.encoder.0.W_h.weight".
Loading pretrained parameter "encoder.encoder.0.W_o.weight".
Loading pretrained parameter "encoder.encoder.0.W_o.bias".
Loading pretrained parameter "seq_embedder.weight".
Loading pretrained parameter "rotary_embedder.freqs".
Loading pretrained parameter "multihead_attn.in_proj_weight".
Loading pretrained parameter "multihead_attn.in_proj_bias".
Loading pretrained parameter "multihead_attn.out_proj.weight".
Loading pretrained parameter "multihead_attn.out_proj.bias".
Loading pretrained parameter "attentive_pooler.linear1.weight".
Loading pretrained parameter "attentive_pooler.linear1.bias".
Loading pretrained parameter "attentive_pooler.linear2.weight".
Loading pretrained parameter "attentive_pooler.linear2.bias".
Loading pretrained parameter "readout.1.weight".
Loading pretrained parameter "readout.1.bias".
Loading pretrained parameter "readout.4.weight".
Loading pretrained parameter "readout.4.bias".
Moving model to cuda
Creating protein model
Model 0 test rmse = 1.103250                                                                                                                      
Model 0 test mae = 0.755246
Model 0 test mse = 1.217161
Model 0 test r2 = 0.556943
Ensemble test rmse = 1.103250
Ensemble test mae = 0.755246
Ensemble test mse = 1.217161
Ensemble test r2 = 0.556943` 
The R2 is lower than you reported in your paper. Furthermore , the reproduced  Unikp model also has this problem. 
`python external/UniKP/UniKP_Kcat_v2.py > unikp_log && python ./external/UniKP/parse_logs.py unikp_log unikp_log_parsed` and I only change the input file just as above mentioned.  The result is :
Test,R2_mean,R2_stderr,MAE_mean,MAE_stderr,p1mag_mean,p1mag_stderr
Heldout,0.5827,0.0006,0.7502,0.0005,74.96000000000001,0.16999999999999998
CLUSTER_99,0.3192,0.0017,1.0461,0.0015,61.029999999999994,0.41000000000000003
CLUSTER_80,0.2658,0.002,1.0613,0.0017,61.42999999999999,0.44
CLUSTER_60,0.2152,0.0022,1.1326,0.0021,57.8,0.49
CLUSTER_40,0.1321,0.0025,1.2235,0.0023,54.910000000000004,0.54
The performence of reproduced UniKP model in  CLUSTER_**  dataset are also lower than your mentioned in Supplementary Table 8.  

<img width="1020" height="375" alt="Image" src="https://github.com/user-attachments/assets/51035dfe-7c97-4eee-9e5c-1885bb027e46" />
About the input data, I only remove the entires contian "X" in Enzyme sequences. 
Could you provide more details about the model training and data processing? This would help me improve the model’s performance and use it correctly.

Thank you!
 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The performence of CatPred model generate by reproduce_training.sh #39

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

The performence of CatPred model generate by reproduce_training.sh #39

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions