From ee4ad522f5f7ef9d908e4a3ec297774fb05f1837 Mon Sep 17 00:00:00 2001 From: Louis Moureaux Date: Mon, 6 Apr 2026 21:29:41 +0200 Subject: [PATCH] Add initial dataset publication docs --- content/tutorials/dataset_publication.md | 148 +++++++++++++++++++++++ mkdocs.yml | 1 + 2 files changed, 149 insertions(+) create mode 100644 content/tutorials/dataset_publication.md diff --git a/content/tutorials/dataset_publication.md b/content/tutorials/dataset_publication.md new file mode 100644 index 0000000..b6ddf11 --- /dev/null +++ b/content/tutorials/dataset_publication.md @@ -0,0 +1,148 @@ +Publishing datasets +=================== + +The machine learning group encourages the publication of open dataset focusing +on specific tasks of interest for both CMS and the wider ML community. They can +be the basis for community challenges or simply support the findings of ML +papers. In general, consider publishing a dataset if: + +- You have at least one use case in mind, ideally related to a CMS publication; +- You can identify a few non-CMS groups who would likely use it; +- A similar dataset doesn't exist yet. + +If the three conditions above are met, releasing a public dataset is **strongly +encouraged.** It it also possible in other cases. + +The formal publication procedure is documented by the [Open Data Team]. The goal +of this page is to help you put together a dataset and useful documentation to +maximize its impact. We assume that the dataset is build from simulated and/or +data events; guidelines for publishing ML models can be found [here][ML models]. + +[Open Data Team]: https://twiki.cern.ch/twiki/bin/viewauth/CMS/DPOAMLSampleReleaseGuidelines +[ML models]: ?? + + +Publication format +------------------ + +Any easily readable file format can be used. This includes ROOT, HDF5, and text +files. Avoid formats requiring the use of a specific library or tool, e.g., +Numpy and Pickle files. The ideal file format is: + +- **Easy to use:** The required preprocessing should be reduced to a minimum. +- **Self-documenting:** Inspecting the file should be sufficient to understand + its contents. +- **Compact:** People are going to download it, there is no need to waste space + on their precious disk quota. + +If a similar dataset has previously been published, try to reuse the same layout +as much as possible. People will thank you if they don't need to adjust their +code. + +Regardless of the file format, always include the luminosity block and event +number (and for collision data, the run number). It can be done in a separate +file. This is essential to track the contents of overlapping datasets and +improves reproducibility. + +The file contents should be **thoroughly documented,** even if it replicates +existing datasets. Every dataset sharing platform has a short README file. This +is the entry point of your documentation. It should contain answers to the +following questions: + +### Which task is the dataset meant to be used for? + +You should write 1--2 sentences about the intended use of the data. + + +### What do the files contain? + +Is it data or simulation? Which physics processes were simulated? Do the files +contain events, jets, detector hits, or something else? + +Just like in a paper, it is very important to define every variable. Do not +assume that the reader knows anything about CMS. Even a variable name as simple +as `pT` needs to be explained. + +An effective way to do this is to include a table of definitions, referring to a +paper for a longer explanation. A complementary way is to provide a simple +example Python notebook performing the main task for which the dataset should be +used. You should ideally do both. + +A good test of the completeness of the documentation is to ask a student or +colleague to use the dataset (the less they know about what you are doing, the +better). Any question they ask is a sign that something is missing. + + +### How were the files produced? + +A dataset is only truly open if it can be extended. For instance, a theorist may +want to check what their BSM model does in your phase space. Or someone may want +to include additional variables. Supporting these use cases require extensive +documentation of how the dataset was produced. + +Usually, some steps of the production will have been done centrally. For CMS +open data, this is documented in the dataset record itself +([example](https://opendata.cern.ch/record/75601)). We suggest to use a similar +format; contact the [DPOA Team] to get the required information. Of course, if +the source dataset is already (even partially) open, you can just link to it. + +The scripts used to create the final files should also be published, even if the +source dataset is not public. Someone with knowledge of CMS data formats can use +them to understand the file format. This includes most AI assistants. The text +description of the variables will inevitably contain ambiguous statements, maybe +even inaccuracies. Code will generally not. + +The code can be included in the record itself, uploaded separately, or linked +from a Git repository. When republishing code, make sure that the license allows +it. Code without a license cannot legally be copied! + +[DPOA Team]: https://twiki.cern.ch/twiki/bin/view/CMS/DataPreservationOpenAccess + + +### Links to papers and other datasets + +A public dataset is a scientific "byproduct". This is recognized as a kind of +publication by funding agencies, like patents and vulgarization. As such, it +should follow scientific best practices, including citations. + +Citation standards for datasets are not as high as for papers. Since datasets +are usually published alongside a paper, a reference to it goes a long way. + +There is an exception for source datasets. If your dataset is derived from +central datasets that are also available as open data, this should be encoded in +the record. The best way to do it depends on the platform: + +- The CERN [Open Data Portal] only supports hyperlinks. The paper and source + datasets should be linked in the description. +- [Zenodo] additionally supports [Dublin Core] structured metadata. The paper + can be linked using the `Is supplement to` relationship. Source datasets + should be encoded with the `Is derived from` relationship using the DOI found + on the corresponding Open Data page. + +[Dublin Core]: https://www.dublincore.org/specifications/dublin-core/ + + +Where to publish? +----------------- + +The two main options are the CERN [Open Data Portal] and [Zenodo]. Which one to +choose depends on your target community but also the format of your dataset, its +size, the number of files, and the expected access pattern. + +The **Open Data Portal** is best suited for: + +- MiniAOD, NanoAOD, and other ROOT-based formats. The Portal already contains a + large collection of MiniAOD and NanoAOD files, and it supports XRootD access. +- Large datasets. Zenodo has a default limit of 50GB per record; the Portal + allows virtually any size. +- Datasets whose community already uses the Portal. + +For smaller datasets, the choice between **Zenodo** and the Open Data Portal +will primarily depend on the target community. Zenodo is marginally easier to +use and can track relationships to papers and other datasets better. Prefer +Zenodo if your dataset is a compilation of events from many other datasets +(e.g., different physics processes) and at least some of them have been +released. + +[Open Data Portal]: https://opendata.cern.ch +[Zenodo]: https://zenodo.org diff --git a/mkdocs.yml b/mkdocs.yml index f074500..70c5b49 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -130,6 +130,7 @@ nav: - ml.cern.ch: resources/gpu_resources/cms_resources/ml_cern_ch.md - Tutorials: - NN in CMS: tutorials/nn_in_cms.md + - Dataset publication: tutorials/dataset_publication.md - Guides: - Software environments: - LCG environments: software_envs/lcg_environments.md