Conversation
Signed-off-by: Adam Gutglick <adam@spiraldb.com>
fuzz/src/array/mod.rs
Outdated
| // Currently, no support at all | ||
| DType::Variant => Default::default(), |
There was a problem hiding this comment.
should be unreachable?
There was a problem hiding this comment.
not sure what the contract here is, but sure.
There was a problem hiding this comment.
some docs for this or a ptr would be nice
| Canonical::Extension(ExtensionArray::new(ext_dtype.clone(), storage_self)) | ||
| } | ||
| DType::Variant => { | ||
| vortex_bail!("Variant don't currently support canonicalization"); |
There was a problem hiding this comment.
Are we defining the semantics of this now? If not I think this should be left as unimplemented with a clear TODO marked so we can easily search for this.
| .ok_or_else(|| vortex_err!("failed to parse variant from flatbuffer"))?; | ||
| Ok(Self::Variant) | ||
| } | ||
| _ => Err(vortex_err!("Unknown DType variant")), |
There was a problem hiding this comment.
This should probably keep the type guard
There was a problem hiding this comment.
What does it do? It didn't cause any build issues to me
There was a problem hiding this comment.
it just means that if someone else adds a new type here it will cause a compile error instead of falling through (this was helpful when I was new to the codebase)
| pub fn is_nullable(&self) -> bool { | ||
| match self { | ||
| Null => true, | ||
| Null | Variant => true, |
There was a problem hiding this comment.
I missed the nullability section in the RFC, do we really want to do this? Based on a tiny bit of research on other systems with variant it seems like they do support NOT NULL variant.
AFAICT this was not really discussed in the original RFC PR, and I feel like this is something worth reconsidering. Should I open a new PR on the RFC page?
There was a problem hiding this comment.
I think this is a question of how you define nullability, in my mind its "can I get a null value out of this array?", which is basically always true for any variant type I'm familiar with, whether its an explicit Variant::null, top-level null or a path that doesn't exist in a specific row
There was a problem hiding this comment.
see below for my comment on the null inside the variant scalar value, I guess my question is if a variant null is semantically different from nulls currently in Vortex? In my mind it does not seem any different so this decision is a bit confusing to me.
There was a problem hiding this comment.
If they're not semantically different, then Variant is always nullable?
connortsui20
left a comment
There was a problem hiding this comment.
Nice first steps! I am a bit worried about the format for variant scalars, and I think I would like to see more TODOs marked in places that we know will change, but other than that just a few nits.
| DType::Variant => match self.value() { | ||
| None => write!(f, "null"), | ||
| Some(value) => write!(f, "{value}"), | ||
| }, |
There was a problem hiding this comment.
This should be self.as_variant() (or if we want to do that in a future PR then it should just be unimplemented)
| pub fn as_variant(&self) -> &VariantValue { | ||
| self.value() | ||
| .vortex_expect("Failed to convert null scalar to variant value") | ||
| .as_variant() | ||
| } | ||
|
|
||
| /// Returns the semantic variant value if the scalar has `DType::Variant` and is non-null. | ||
| pub fn as_variant_opt(&self) -> Option<&VariantValue> { | ||
| self.dtype() | ||
| .is_variant() | ||
| .then(|| self.value()) | ||
| .flatten() | ||
| .map(ScalarValue::as_variant) | ||
| } |
There was a problem hiding this comment.
I think we might as well add the typed view of a variant scalar now? I believe it would just need to hold a scalar value. If it's not in the scope of this PR then I think it would be better to leave this out and keep it as unimplemented
| .as_ref() | ||
| .ok_or_else(|| vortex_err!(Serde: "Scalar value missing kind"))?; | ||
|
|
||
| if matches!(dtype, DType::Variant) { |
There was a problem hiding this comment.
Nit: I think if dtype == DType::Variant { is more readable
| Ok(ScalarValue::List(values)) | ||
| } | ||
|
|
||
| fn variant_to_proto(value: &VariantValue) -> pb::VariantValue { |
There was a problem hiding this comment.
could you move these function up near the other into proto functions? Either that or maybe proto deserves its own directory of modules at this point.
| // zero storage value and try to make an extension scalar from that. | ||
| Self::zero_value(ext_dtype.storage_dtype()) | ||
| } | ||
| DType::Variant => Self::Variant(VariantValue::Null), |
There was a problem hiding this comment.
I'm worried that this might break our abstractions (maybe its fine though), see below
| /// Semantic value for a `DType::Variant` scalar. | ||
| #[derive(Debug, Clone, PartialEq, Eq, Hash)] | ||
| pub enum VariantValue { | ||
| Null, | ||
| Bool(bool), | ||
| Primitive(PValue), | ||
| Decimal(DecimalValue), | ||
| Utf8(BufferString), | ||
| Binary(ByteBuffer), | ||
| List(Vec<VariantValue>), | ||
| Object(Vec<(BufferString, VariantValue)>), | ||
| } |
There was a problem hiding this comment.
I'm worried that this will break our abstraction. This is eerily similar to what ScalarValue (and the old InnerScalarValue) looks like. We also very recently made a change to extract out nulls from scalar value such that a nullable scalar has an Option<ScalarValue> (so ScalarValue is always non-nullable).
Is it fine that this is basically scalar value with a few extra variants (heh)? Couldn't we just extend the existing scalar value enum to support object and then everything falls out of that? Is there something here that requires we store these exact parameters?
There was a problem hiding this comment.
But there's no other Object type in our scalar system. This is also about differentiating between the various ways variants can be nulls, you can have a JSON that looks like {"value": null}, which is different than an array with a nullability mask that marks that one row as null
| Ok(ScalarValue::List(values)) | ||
| } | ||
|
|
||
| fn variant_to_proto(value: &VariantValue) -> pb::VariantValue { |
There was a problem hiding this comment.
I thought variants were arbitrary nested vortex types? So isn't the pb::ScalarValue::Variant just a pb::Scalar?
There was a problem hiding this comment.
not exactly, they also have an "object" variant and list where every item can be a variant
There was a problem hiding this comment.
We can just add an "object" variant to the ScalarValue enum though? It already has a List variant that we use for Struct scalars (we also use them for List scalars but we would like to add an Array variant too, see #6717).
There was a problem hiding this comment.
(lets keep this conversation in this thread) We can, but there's also the issue of differentiating between the various nulls. I don't know if its a thing in every possible variant encoding, but it is in parquet-variant, which is the main reference point we have.
There was a problem hiding this comment.
sorry is this a nullability concern or adding a variant that doesnt match the model concern?
/// A list of potentially null scalar values. (This is what we have now)
List(Vec<Option<ScalarValue>>),
/// We would add this as a variant (or maybe as a HashMap):
Object(Vec<(BufferString, Option<ScalarValue>)>),If we say that variant is any scalar value then I don't see why this is an issue? The null just gets outlined.
There was a problem hiding this comment.
I'm not really following...
I think your VariantNull == Variant(Scalar { dtype: DType::Null, ... })
There was a problem hiding this comment.
I'll go with that. I'm or tomorrow, but I'll pick it back up on Monday.
There was a problem hiding this comment.
I was bad at explaining the intuition here.
It's that the column's type is DType::Variant because each row can have its own type.
But scalar is a single row.
So while the top-level scalar must declare its DType as DType::Variant, we must store a row-specific DType inside it.
And if we're storing a row-specific DType, and the data that underpins it, then we're really just storing a scalar.
On the null front, a variant array may store a null buffer. But really I just see that as "shredding the validity". I could choose not to shred the validity I guess
There was a problem hiding this comment.
That makes perfect sense. But for variant types from the Parquet family (which includes Spark and Iceberg), that's not how it's defined.
They have an optional null buffer (like most arrow arrays), and they have a second possible null state, that's is valid in the SQL sense - that's "VariantNull", it's not about shredding or not shredding the validity.
There was a problem hiding this comment.
I think they're wrong 😀
But if that is the case, then we should add nullability to DType::Variant, and allow encodings to have a top-level nullability separately from a defined, but internally null, row scalar.
gatesn
left a comment
There was a problem hiding this comment.
I think we can simplify scalar values by just storing scalars
| repeated ScalarValue values = 1; | ||
| } | ||
|
|
||
| message VariantValue { |
Signed-off-by: Adam Gutglick <adam@spiraldb.com>
This PR includes initial support for the Variant DType, as described in the Variant RFC. It includes most of the required boilerplate and initial structure for this new dtype.
It includes the following changes: