This document is essentially a summary of my thoughts and plans, as well as the result of multiple discussions in our group on goals, non-goals and general requirements of the new serialization library.
Scribe is a general-purpose serialization library with a focus on large numerical datasets.
Scribedoes not specify an on-disk file formats. Instead it can use a selection of well-known formats like.jsonand.hdf5.Scribespecifies data-schemas that describe the layout of the data. This schema is independent of the file-format, thus it can in priniciple be used to convert between file-formats (though this is not the main focus of the library). A simple schema might look like this:{ "schema_name": "MyData", "schema_description": "A longer, human description goes here", // most schemas are expected to be dictionaries at top level "type": "dict", "items": [ { "key": "foo", "type": "int32", }, { "key": "bar", "optional":true, "type": "array", "rank": 1, "elements": { "type": "float64" } } ] }
Scribecan be used in multiple modes:- As a command line tool. For example
checks whether a data-file adheres to a given schema
scribe validate schema.json data.hdf5
- As a C++ library:
Note here that
#include "scribe/scribe.h" int main() { auto schema = Scribe::Schema("my_schema.json"); Scribe::Tome data = Scribe::read("data.hdf5", schema); fmt::print("{}\n", data["foo"].get<int>()); }
Scribe::Tomeis a generic runtime data-object which can hold arbitrary amounts of nested data, similar tonlohmann::json. - As a C++ code generator. Calling
scribe codegen-cpp myschema.jsonwill generate a header fileas well as implementations for the read/write functions.struct MyData { int32_t foo; std::vector<double> bar; static MyData read(std::string_view filename); void write(std::string_view filename); };
- As a command line tool. For example
json-schemais purely for validation, whereasScribealso does data read/writejson-schemacan only describe json data, not more low-level formats. For example it cannot reasonably distinguish between single- and double-precision floating point numbers.json-schemalacks an efficient understanding of multi-dimensional homogeneous numerical arrays, which are crucial to our usecases in scientific computations.
ProtoBuffersare mainly concerned with reading and writing its own (platform-independent, but otherwise opaque) binary format. Scribe is concerned with describing existing data in well-known formats likehdf5andjson- while large numerical arrays can in principle be stored efficiently in a ProtoBuffer (using the
repeatedkeyword), multi-dimensional arrays are not easy to implement. Scribe aims for an interface that should feel very natural to experienced users ofhdf5or numpy'sndarray.
Every schema has the fields:
type, which must be either- one of the builtin types
boolean,string,int{8,16,32,64},uint{8,16,32,64},complex{64,128},array,dict,variant - one of the place-holders
none(which is a schema that always rejects), orany(which accepts arbitrary data). - a sub-schema, identified by name, prefixed by
$ - if
typeis not present, it defaults toany
- one of the builtin types
- Optional meta-data
schema_name: This can be used to identify the schema for re-use.schema_description: human-readable description without any specific semantics- TBD: additional meta-data like "example", "default", "author", "version" are possible. Needs more discussion, as some of these could carry semantic meaning, while others dont.
- Nested types (
array,dict, andvariant) contain a schema that describes each of it constituents. - Optionally, additional constraints that narrow the range of allowed values (e.g. min/max values for numbers, shape/size of arrays).
- Optioanlly, file-format specific hints on storage. E.g.,
chunk_sizefor hdf5 files.
Numeric types are identified by "type" being one of int{8,16,32,64}, uint{8,16,32,64}, complex{64,128}. Possible constraints are
min_valueandmax_valuethat specify the allowed range of values
Example:
{
"type": "int32",
"min_value": 3
"max_value": 8
}Example:
{
"schema_name": "username",
"type": "string",
"min_size": 3,
"max_size": 16
}Example 1: one-dimensional array of real numbers:
{
"schema_name": "correlator",
"type": "array",
"rank": 1,
"elements":{
"type": "float64"
}
}Example 2: large numerical array:
{
"schema_name": "propagator",
"type": "array",
"shape": [-1,-1,-1,-1, 4, 3],
"hdf5":
{
"chunk_size": [4,4,4,4, 4,3],
"fletcher32": true
}
}Note that -1 has a special meaning. In shape it means this dimension can have arbitrary size (similar to specifying extents in the Eigen library, or C++'s std::[md]span). in chunk_size, it means the full dimensions should occopy a single chunk (simiar to the meaning of -1 in NumPy's .reshape function)
In addition to "type": "dict", the only required field is items, which is a list of possible items. Example
{
"schema_name": "MyDict",
"type": "dict",
"items": [
{
"key": "foo",
"type": "int32"
},
{
"key": "bar",
"type": "float32",
"optional": true
}
]
}which in C++ would match to
struct MyDict
{
int32_t foo;
std::optional<float> bar;
};Instead of key, an item can have a key_pattern which understands * as a universal placeholder. Example:
{
"type": "dict",
"items": [
{ "key_pattern": "foo_*" }
]
}allows any key staring with foo_. Note that in case of overlapping patterns, order matters, the first matching key will always be taken, regardless of further constraints. Together with the none type, this allows some advanced usages. For example
{
"type": "dict",
"items":[
{ "key": "foo_2" },
{ "key_pattern": "foo_*", "type": "none"},
{ "key_pattern": "*" }
]
}which requires a key foo_2, disallows all other keys starting with foo_, but allows any key not starting with foo_.
This is a schema that can be one of several sub-schemas.
{
"type": "variant",
"variants": [
{"type": "string"}
{"type": "integer"}
]
}Restriction: For now, we require all sub-variants to be distinguished by their type field, in order to keep validation simple. (Sidenote: arbitrary variants together with recursive schemas could make validation undecidable, or at least exponentially hard. I think.)
| Scribe schema | C++ type | Json example |
|---|---|---|
"type": "none" |
- | - |
"type": "any" |
Scribe::Tome |
"foo" |
"type": "int32" |
int32_t |
42 |
"type": "float64" |
double |
3.7 |
"type": "complex128" |
std::complex<double> |
[1.0,2.0] |
"type": "string" |
std::string |
"Hello World |
"type": "array""rank": 1 |
std::vector<...> |
[0.2,3.4,4.0] |
"type": "array" |
Scribe::ndarray<...> |
[[1,2],[3,4]] |
"type": "dict""items":[{"key":"foo"}, {"key":"bar", "optional":true}] |
struct {Scribe::Tome foo; std::optional<Scribe::Tome> bar; } |
{"foo":5, "bar":[1,2]} |
"type":"variant" "variants":[{"type":"string"}{"type":"int32"}] |
std::variant<std::string, int32_t> |
"hello" |
- Strictly speaking, JSON does not distinguish between integers and floating point numbers. In Scribe, we use this convention:
- Numbers with a period
.or a scientific notationecannot be integers (even if it is5.0). - Numbers with only digits (and sign) are integers if and only if they are in the domain. E.g.:
1234is not a validint8_t, but it is anint32_t. Numbers above2^64-1cannot be stored currently.
- Any number can be interpreted as
float32/64, regardless of magnitude or number of digits. E.g.:42is a validfloat32/64, though its also a valid integer.1e10000as afloat32is valid, though its value will beinf3.141592653589793as afloat32is valid, though the value will be rounded.
- Numbers with a period
- JSON does in principle allow duplicate keys, Scribe does not.
- JSON does leave it to the implementation whether the order of keys matters. In Scribe, the order of keys when reading a json file does not matter. The order of keys specified in a schema however does matter in some cases (e.g. overlapping
key_patterns). This is one reason why it is a List of keys, and ant a Map (like in json-schema).
- The top-level of an HDF5 file is always a "group" (in HDF5 lingo), which maps to a
dictin Scribe. Therefore only schemas that aredictson top-level can be read from / written to HDF5. (This is not strictly true, depending on how non-numerical arrays are mapped. To be clarified.) - Additional storage hints can be specified in a field called
hdf5. For the time being the usecases are:- specifying chunks and checksums for arrays:
{ "type":"array", "rank": 2 "hdf5": {"chunk_size": [16,-1], "fletcher32": true} } - Explicitly specifying what kind of hdf5 object a scribe object should map to
Without this field, the mapping is automatic:
{ "type": "array", "shape": [3], "hdf5": {"category": "attribute"} }Scribe type HDF5 category string,bool, single numberattribute arraywith numeric element typedataset arraywith non-numeric element typegroup (maybe multiple levels?) dictgroup
- specifying chunks and checksums for arrays:
- In HDF5, attributes can be not only be attached to groups, but also to datasets, this is not supported in Scribe. (Rationale: there is no clean way to map this to JSON)
- In HDF5, one can have an attribute and a dataset sharing a name (I think?). Scribe does not support this.
- HDF5 support arbitrary user-defined types. In scribe, we only use exactly two, namely for
complex64andcomplex128. - Future hdf5-specific hints might include:
- more filters (for example lossless storage for small integers)
- compression
Conceptually, a perambulator is 7 dimensional, but it is stored in multiple files, each containing a 6-dimensional dataset. Each file can be described as
{
"schema_name": "perambulator_slice",
"type": "dict",
"items":[
{
"key": "GridDimensions",
"hdf5": {"category": "attribute"},
"type": "array",
"shape": [3],
"elements": {"type": "uint64"}
},
{
"key": "TensorDimensions",
"hdf5": {"category": "attribute"},
"type": "array",
"shape": [6],
"elements": {"type": "uint64"}
},
{
"key": "_Grid_dataset_threshold",
"hdf5": {"category": "attribute"},
"type": "uint32",
},
{
"key": "IndexNames",
"type": "dict",
"items":[
{
"key_pattern": "IndexNames_*",
"hdf5": {"category": "attribute"},
"type": "string",
},
{
"key": "_Grid_vector_size",
"type": "$GridVectorSize"
},
]
},
{
"key": "MetaData",
"type": "dict",
"items": [
{
"key": "Version",
"hdf5": {"category": "attribute"},
"type": "string",
},
{
"key": "timeDilutionIndex",
"hdf5": {"category": "attribute"},
"type": "int32",
},
{
"key": "noiseHashes",
"type": "dict",
"items":[
{
"key": "_Grid_vector_size",
"type": "$GridVectorSize"
},
{
"key_pattern": "noiseHashes_*",
"hdf5": {"category": "attribute"},
"type": "string"
}
]
}
]
},
{
"key": "Perambulator",
"type": "array",
"rank": 6,
"elements": {"type": "complex128"}
}
],
"defs":[
{
"schema_name": "GridVectorSize",
"hdf5": {"category": "attribute"},
"type": "uint64",
}
]
}This is a rough todo-list what we need to make this project slightly useful to some. The goal is to have something presentable as soon as possible in order to look for potential users outside our immediate science community. Also, this list should be transformed to somethink like a kanban board.
- Write the
Schemaclass, which itself can be read from (and written to?) a json file. - Write a schema for the schemas, in the form of a json-schema. Then find a validation library that can check it for any given schema.
- Write the
Tomeclass. This is similar to thenlohmann::jsontype. Also discuss whether we are okay calling is "Tome". Could go with something neutral like "Object" instead. - Write a validation function for
jsonfiles. As a first pass, a function likeis sufficient, though an implementation based on nlohmann's SAX parser could be more efficient in case of big datasets.bool validate_json(nlohmann::json const&, Schema const&);
- Write a json reader
This should definitely be based on nlohmann's SAX parser, i.e. circumventing the
Tome read_json(std::string_view filename, Schema const&);
nlohmann::jsontype itself. - HDF5 validation.
bool validate_hdf5(std::string_view filename, Schema const&);
- Of course this should be done without reading any actual data from the hdf5 file, only the meta-data.
- HDF5 reader
Tome read_hdf5(std::string_view filename, Schema const&);
- C++ code generation. This is a huge item, should be done after the dynamic
Tomebased readers are done, because at that point, code generation is effectively just an optimization. - Tie most of it together with a command-line utility:
scribe validate myschema.json # check that myschema.json is valid itself scribe validate mydata.hdf5 myschema.json # just calls the "validate_{hdf5,json}(...)" function scribe codegen myschem.json # C++ code generation
Minor points that could be discussed in a future meeting (alternatively, actually writing the code might make it clearer what to do):
- How strict should our reader be?
- A missing (but non-optional) key could be silently interpreted as an empty array/dict, provided this adheres to all other constraints
- Is a json
nullvalue the same as a missing key? - Should validation fail if the only "error" is a wrong chunk_size?
- Proposal: have a strict and non-strict mode. Validation defaults to strict, reading to non-strict.
- What exactly is the default mapping to hdf5 in regards to attribute vs group vs dataset? Proposal:
- dict-values that are single numbers, booleans, string are attributes (and not datasets)
- arrays with non-numeric element types are stored in groups, one level per dimension.
- How are arrays mapped to C++ exactly? Proposal: depends on specified rank/shape:
rank=1,shape=[N]:std::array<..., N>ifNis small,std::vectorotherwiserank=1, shape unspecified:std::vectorranklarger or unspecified:Tome::Array(which consists of a flattenedstd::vector<T>and a shape, always standard row-major layout)
- Actually, we should allow "C++" format specific hints, just like for hdf5. This way the scheme could decide if an array should be mapped to
std::vector,std::array, or maybe evenEigen::Matrixand friends. - Should we support any "non-local" constraints? For example:
- A Hadrons output file might specify a
GridDimensionattribute, and also a Correlator, the size of which must match with the value of the attribute. - Proposal: No. Thats an endless rabbit hole of complexity.
- A Hadrons output file might specify a
No promises when these will be implementd (if ever), but we like to keep these ideas around to make sure our design stays compatible in case we need any of them later.
There are at least two places where regular expressions would come up naturally:
- As constraint for string data:
{ "type": "string" "regex": "[a-zA-Z]*" } - as a generalization of
key_patterninside adict. Example:{ "type": "dict", "items":[ { "key_regex": "foo_[0-9]*" } ] }
Both of these should be straight-forward to implement, assuming we can find and decide on a good implementation of regular expressions. Ideally, such a regex-library would need to support C++ gode generation. Right now, this does not seem to be required for our usecases.
the Scribe::read function would aquire a third (optional) parameter
scribe::read(std::string_view filename, Scribe::Schema const& schema, std::string_view path)- Note that the return value of this function no longer adheres to the given schema, but to a sub-schema thereof. So it would be natural to add a helper function
That determines the schema resulting from a partial read, whithout looking at any particular data.
scribe::project_schema(schema, path) - Some possible kinds of path are:
- Simple paths like
"/foo", which (assuming the top-level schema is a dict), simply returns one element of the dict. Similary"/foo/bar"for nested dicts. - Similar to "pathname expansion" in bash, there could be wildcards like
"/*/bar""/component_{a,b}""/foo_{1..100}
- Similar to "fancy-indexing" in NumPy's
ndarrays, arrays could be indexed as/mydata[:,5,2:-1]
- Simple paths like
The simplest case (only dict-keys, no wildcards) should be implemented before version 1.0 of Scribe. Anything more needs a lot more work defining all semantics precisely, so this was decided against for the time being.
XML: Not a huge fan, but probably required for existing codebases. Should be easy enough to implement, provided we find a good open-source library, equivalent tonlohmann::json. The same might be the case forYAML.- NumPy's
.npyformat is actually quite easy to implement (very little meta-data, followed by the binary data of a singlenparray). Of course this can only reasonably store schemas of the formand not anything nested. Still might be nice{ "type": "array" "elements": {"type": "float32/float64/..."} } SQLitedatabases as a file format have some serious pros (extremely stable, memory efficient). Can we map our schemas to a relational database? Need to read up on SQL database schemas, which have been a thing since forever I think.
We could think about three levels of python support:
- Binding for the
Schemaclass, which allows validation of data files. - Binding for the
Tomeclass, which allows reading and writing of data files. - A proper "Python Codegen" that makes a Schema into a python class at runtime.
- Is completely reasonable.
- Not clear what the advantage of this is compared to using existing libraries like
h5py, which integrate with NumPy better than we could probably hope to. - This could be a cool project (and a great showcase of Python's capabilities), but would require maintaining a lot of actual python code. There this is judged to be out of scope for the time being.
In the world of JSON schemas, it is common for schemas to be canonically named using a web URL, at which the full schema is publicly visible. To facilitate this, we can do two things:
- Add a
schema_urlmetadata, which points to a (publicly visible) adress - Allow
"type": "https://.../some_schema.json"to refer to sub-schemas. Effectively, the prefix"https://"distinguishes this from a builtin-type or locally-defined subscheme (which are prefixed with"$").
Not very hard to do, but in the interest of keeping dependencies at a minimum, this will not be part of the first version of the library.
Quite often in lattice QCD, a "single" measurement is written in multiple files. Very roughly, we want to do something like
scribe::read("propagator_t*.hdf5", schema);where schema describes the full dataset, and not a single file. Some points of note:
- The schema should still not depend on the file-format, so logically, the same schema should be usable regardless of whether the data is in one file or multiple files.
- A more sane approach than the genral asterisk
*might go something like this: Give each axis in a (multi-dimensional) array a name. In lattice QCD, these are typicallyx,y,z,t,spin,colorand the like. Then the function call could beread("propagator_t{t}.hdf5", schema);
- Somewhat similar to the "fancy internal paths" from before, an individual file does no longer adhere to the given scheme, but to a projection thereof.
Personally, I enjoy this direction of thought. it seems like we could draw some cool category-theory-style diagrams between datasets, schemas and locations, each with their own project function. But practically speaking, making all of this usable might be a multi-year research project in itself, so we reasonably decided against it.
Open question: Is there a well-defined special case of multi-file datasets that could be reasonably implemented on its own?
Idea: not multi-file, but splitting one logical nd-array along one axis along multiple datasets/groups inside a HDF5 file could be done quite reasonably inside the format-specific traits.