DataTransformer init parameters by FlorentRamb · Pull Request #146 · sdv-dev/CTGAN

FlorentRamb · 2021-04-16T15:32:36Z

This PR solve issue #7, it allows two things:

ability to fit gaussian mixtures on a subsample (help to scale with big datasets while losing only little accuracy)
ability to pass init arguments to the DataTransformer through CTGANSynthesizer.fit (and so to change other parameters as max_clusters).

CLAassistant · 2021-04-16T15:32:41Z

All committers have signed the CLA.

…ing GMs

fealho

In general I think this looks good. @pvk-developer @amontanez24 what do you think?

ctgan/data_transformer.py

fealho · 2021-04-21T16:42:56Z

tests/integration/test_ctgan.py

        ctgan.sample(1, 'discrete', "d")
+
+
+def test_ctgan_data_transformer_params():


I think you should also add a performance test, something simple just to make sure that our results are not worse than before because of this change.

I'm not sure about this one, do you think about a performance test of the gaussian mixture model or CTGAN ? In terms of speed or accuracy ?

Accuracy for CTGAN. Basically, just a test to make sure the changes don't break the code. So something like changing your continuous column to be a normal distribution, instead of random, then sample from the model (after you fit) and make sure the samples loosely follow a normal distribution.

pvk-developer · 2021-04-22T11:00:21Z

ctgan/synthesizers/ctgan.py


-    def fit(self, train_data, discrete_columns=tuple(), epochs=None):
+    def fit(self, train_data, discrete_columns=tuple(), epochs=None,
+            data_transformer_params={}):


The data_transformer_params should be moved to the __init__ and be asigned as self.data_transformer_params. (Use deepcopy if needed).

pvk-developer · 2021-04-22T13:36:14Z

ctgan/data_transformer.py

    def _fit_continuous(self, column_name, raw_column_data):
        """Train Bayesian GMM for continuous column."""
+        if self._max_gm_samples <= raw_column_data.shape[0]:
+            raw_column_data = np.random.choice(raw_column_data,


I think that when it comes to this kind of line breaking this indentation is better:

raw_column_data = np.random.choice( raw_column_data, size=self._max_gm_samples, replace=False )

candalfigomoro · 2023-02-09T16:30:13Z

@fealho @pvk-developer
Can we merge this? It's basically impossible to fit the CTGAN on a large dataset because the gaussian mixture is a huge bottleneck (even using dozens of CPUs). This PR would allow to speedup the gaussian mixture step. Thanks

fealho · 2023-02-09T18:20:16Z

@npatki not sure what you want to do with this?

candalfigomoro · 2023-02-13T13:14:45Z

Meanwhile the library code has changed so the PR should be updated.

For example, the _fit_continuous method now receives a pandas DataFrame, so np.random.choice() can be replaced by something like data = data.sample(self._max_gm_samples, replace=False, random_state=SEED).

Also, I wonder if ClusterBasedNormalizer could not be optionally replaced by a power transform, which might be faster (although it might impact the quality of the generated data), see sdv-dev/RDT#613

FlorentRamb changed the title ~~Gh 7 feat gmparams~~ DataTransformer init parameters Apr 19, 2021

FlorentRamb added 3 commits April 19, 2021 10:42

add max_gm_samples param and subsample continuous columns before fitt…

4b0d505

…ing GMs

add data_transformer_params to have control over data_transformer

4560e78

add test to check max_gm_samples

96a6321

FlorentRamb force-pushed the gh-7-feat-gmparams branch from a4c4d5b to 96a6321 Compare April 19, 2021 08:43

fealho reviewed Apr 21, 2021

View reviewed changes

pvk-developer self-requested a review April 21, 2021 16:53

fix docs

8669599

pvk-developer requested changes Apr 22, 2021

View reviewed changes

FlorentRamb added 2 commits April 26, 2021 09:38

fix indentation

2a3222f

move data_transformers args to init

1b40159

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DataTransformer init parameters#146

DataTransformer init parameters#146
FlorentRamb wants to merge 6 commits intosdv-dev:mainfrom
FlorentRamb:gh-7-feat-gmparams

FlorentRamb commented Apr 16, 2021

Uh oh!

CLAassistant commented Apr 16, 2021 •

edited

Loading

Uh oh!

fealho left a comment

Uh oh!

Uh oh!

Uh oh!

fealho Apr 21, 2021

Uh oh!

FlorentRamb Apr 22, 2021

Uh oh!

fealho Apr 22, 2021

Uh oh!

pvk-developer Apr 22, 2021

Uh oh!

pvk-developer Apr 22, 2021

Uh oh!

candalfigomoro commented Feb 9, 2023

Uh oh!

fealho commented Feb 9, 2023 •

edited

Loading

Uh oh!

candalfigomoro commented Feb 13, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

		ctgan.sample(1, 'discrete', "d")


		def test_ctgan_data_transformer_params():

Conversation

FlorentRamb commented Apr 16, 2021

Uh oh!

CLAassistant commented Apr 16, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fealho left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

fealho Apr 21, 2021

Choose a reason for hiding this comment

Uh oh!

FlorentRamb Apr 22, 2021

Choose a reason for hiding this comment

Uh oh!

fealho Apr 22, 2021

Choose a reason for hiding this comment

Uh oh!

pvk-developer Apr 22, 2021

Choose a reason for hiding this comment

Uh oh!

pvk-developer Apr 22, 2021

Choose a reason for hiding this comment

Uh oh!

candalfigomoro commented Feb 9, 2023

Uh oh!

fealho commented Feb 9, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

candalfigomoro commented Feb 13, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

CLAassistant commented Apr 16, 2021 •

edited

Loading

fealho commented Feb 9, 2023 •

edited

Loading