Hey! As it says in the title, I was trying to recreate the results from the paper. I hooked up the repo with lm-eval and was trying to recreate the results on humaneval, and all variants had a lower accuracy of about 10 when using the same configuration found in the appendix. Can you release the evaluation code? Much appreciated.
Hey! As it says in the title, I was trying to recreate the results from the paper. I hooked up the repo with lm-eval and was trying to recreate the results on humaneval, and all variants had a lower accuracy of about 10 when using the same configuration found in the appendix. Can you release the evaluation code? Much appreciated.