Fix to only transform raw data when requested.#268
Fix to only transform raw data when requested.#268pritamdodeja wants to merge 1 commit intotensorflow:masterfrom
Conversation
When read_raw_data_for_training is set to False when invoking the main function, common.transform_data was being called on raw train and test data anyway. This fix moves the transformation to the block where read_raw_data_for_training is True. The scenario here is the data has already been preprocessed, and the user wishes to re-use that preprocessed data.
|
Thanks for the PR! It's true that transforming the raw data, serializing and writing the results is unnecessary when transform/examples/census_example_common.py Lines 210 to 258 in 2ac89ab We can even always call |
|
The behavior of the default usage won't change, as so To address your questions, yes, this would be the second invocation of
I am pretty new to github/distributed development, so apologies if I'm not structuring my questions/suggestions properly. Thanks! |
Scenario
When there is already pre-processed data available, and the user wants to re-use that data by passing read_raw_data_for_training=False to main, the flow was calling common.transform_data again on the raw data. This was causing WriteTransformFn to fail because there are already existing artifacts there, and unnecessarily recomputing statistics etc.
Fix
This fix moves the common.transform_data invocation to where we are processing the raw data for the first time.