Skip to content

Variant DType#6912

Open
AdamGS wants to merge 3 commits intodevelopfrom
adamg/variant-dtype
Open

Variant DType#6912
AdamGS wants to merge 3 commits intodevelopfrom
adamg/variant-dtype

Conversation

@AdamGS
Copy link
Contributor

@AdamGS AdamGS commented Mar 12, 2026

This PR includes initial support for the Variant DType, as described in the Variant RFC. It includes most of the required boilerplate and initial structure for this new dtype.

It includes the following changes:

  1. New dtype
  2. serialization for the dtype
  3. Scalar variant and new scalar value
  4. A lot of code paths that aren't supported yet and error accordingly.

AdamGS added 2 commits March 12, 2026 11:15
Signed-off-by: Adam Gutglick <adam@spiraldb.com>
Signed-off-by: Adam Gutglick <adam@spiraldb.com>
@AdamGS AdamGS added the changelog/feature A new feature label Mar 12, 2026
@AdamGS AdamGS changed the title Adamg/variant dtype Variant DType Mar 12, 2026
@AdamGS AdamGS marked this pull request as ready for review March 12, 2026 11:52
Comment on lines +491 to +492
// Currently, no support at all
DType::Variant => Default::default(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be unreachable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure what the contract here is, but sure.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some docs for this or a ptr would be nice

Canonical::Extension(ExtensionArray::new(ext_dtype.clone(), storage_self))
}
DType::Variant => {
vortex_bail!("Variant don't currently support canonicalization");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we defining the semantics of this now? If not I think this should be left as unimplemented with a clear TODO marked so we can easily search for this.

.ok_or_else(|| vortex_err!("failed to parse variant from flatbuffer"))?;
Ok(Self::Variant)
}
_ => Err(vortex_err!("Unknown DType variant")),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably keep the type guard

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does it do? It didn't cause any build issues to me

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it just means that if someone else adds a new type here it will cause a compile error instead of falling through (this was helpful when I was new to the codebase)

pub fn is_nullable(&self) -> bool {
match self {
Null => true,
Null | Variant => true,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I missed the nullability section in the RFC, do we really want to do this? Based on a tiny bit of research on other systems with variant it seems like they do support NOT NULL variant.

AFAICT this was not really discussed in the original RFC PR, and I feel like this is something worth reconsidering. Should I open a new PR on the RFC page?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a question of how you define nullability, in my mind its "can I get a null value out of this array?", which is basically always true for any variant type I'm familiar with, whether its an explicit Variant::null, top-level null or a path that doesn't exist in a specific row

Copy link
Contributor

@connortsui20 connortsui20 Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see below for my comment on the null inside the variant scalar value, I guess my question is if a variant null is semantically different from nulls currently in Vortex? In my mind it does not seem any different so this decision is a bit confusing to me.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If they're not semantically different, then Variant is always nullable?

Copy link
Contributor

@connortsui20 connortsui20 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice first steps! I am a bit worried about the format for variant scalars, and I think I would like to see more TODOs marked in places that we know will change, but other than that just a few nits.

Comment on lines +24 to +27
DType::Variant => match self.value() {
None => write!(f, "null"),
Some(value) => write!(f, "{value}"),
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be self.as_variant() (or if we want to do that in a future PR then it should just be unimplemented)

Comment on lines +159 to +172
pub fn as_variant(&self) -> &VariantValue {
self.value()
.vortex_expect("Failed to convert null scalar to variant value")
.as_variant()
}

/// Returns the semantic variant value if the scalar has `DType::Variant` and is non-null.
pub fn as_variant_opt(&self) -> Option<&VariantValue> {
self.dtype()
.is_variant()
.then(|| self.value())
.flatten()
.map(ScalarValue::as_variant)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we might as well add the typed view of a variant scalar now? I believe it would just need to hold a scalar value. If it's not in the scope of this PR then I think it would be better to leave this out and keep it as unimplemented

.as_ref()
.ok_or_else(|| vortex_err!(Serde: "Scalar value missing kind"))?;

if matches!(dtype, DType::Variant) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I think if dtype == DType::Variant { is more readable

Ok(ScalarValue::List(values))
}

fn variant_to_proto(value: &VariantValue) -> pb::VariantValue {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you move these function up near the other into proto functions? Either that or maybe proto deserves its own directory of modules at this point.

// zero storage value and try to make an extension scalar from that.
Self::zero_value(ext_dtype.storage_dtype())
}
DType::Variant => Self::Variant(VariantValue::Null),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm worried that this might break our abstractions (maybe its fine though), see below

Comment on lines +14 to +25
/// Semantic value for a `DType::Variant` scalar.
#[derive(Debug, Clone, PartialEq, Eq, Hash)]
pub enum VariantValue {
Null,
Bool(bool),
Primitive(PValue),
Decimal(DecimalValue),
Utf8(BufferString),
Binary(ByteBuffer),
List(Vec<VariantValue>),
Object(Vec<(BufferString, VariantValue)>),
}
Copy link
Contributor

@connortsui20 connortsui20 Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm worried that this will break our abstraction. This is eerily similar to what ScalarValue (and the old InnerScalarValue) looks like. We also very recently made a change to extract out nulls from scalar value such that a nullable scalar has an Option<ScalarValue> (so ScalarValue is always non-nullable).

Is it fine that this is basically scalar value with a few extra variants (heh)? Couldn't we just extend the existing scalar value enum to support object and then everything falls out of that? Is there something here that requires we store these exact parameters?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But there's no other Object type in our scalar system. This is also about differentiating between the various ways variants can be nulls, you can have a JSON that looks like {"value": null}, which is different than an array with a nullability mask that marks that one row as null

Ok(ScalarValue::List(values))
}

fn variant_to_proto(value: &VariantValue) -> pb::VariantValue {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought variants were arbitrary nested vortex types? So isn't the pb::ScalarValue::Variant just a pb::Scalar?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not exactly, they also have an "object" variant and list where every item can be a variant

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can just add an "object" variant to the ScalarValue enum though? It already has a List variant that we use for Struct scalars (we also use them for List scalars but we would like to add an Array variant too, see #6717).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(lets keep this conversation in this thread) We can, but there's also the issue of differentiating between the various nulls. I don't know if its a thing in every possible variant encoding, but it is in parquet-variant, which is the main reference point we have.

Copy link
Contributor

@connortsui20 connortsui20 Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry is this a nullability concern or adding a variant that doesnt match the model concern?

    /// A list of potentially null scalar values. (This is what we have now)
    List(Vec<Option<ScalarValue>>),

    /// We would add this as a variant (or maybe as a HashMap):
    Object(Vec<(BufferString, Option<ScalarValue>)>),

If we say that variant is any scalar value then I don't see why this is an issue? The null just gets outlined.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not really following...

I think your VariantNull == Variant(Scalar { dtype: DType::Null, ... })

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll go with that. I'm or tomorrow, but I'll pick it back up on Monday.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was bad at explaining the intuition here.

It's that the column's type is DType::Variant because each row can have its own type.

But scalar is a single row.

So while the top-level scalar must declare its DType as DType::Variant, we must store a row-specific DType inside it.

And if we're storing a row-specific DType, and the data that underpins it, then we're really just storing a scalar.

On the null front, a variant array may store a null buffer. But really I just see that as "shredding the validity". I could choose not to shred the validity I guess

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes perfect sense. But for variant types from the Parquet family (which includes Spark and Iceberg), that's not how it's defined.

They have an optional null buffer (like most arrow arrays), and they have a second possible null state, that's is valid in the SQL sense - that's "VariantNull", it's not about shredding or not shredding the validity.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think they're wrong 😀

But if that is the case, then we should add nullability to DType::Variant, and allow encodings to have a top-level nullability separately from a defined, but internally null, row scalar.

Copy link
Contributor

@gatesn gatesn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can simplify scalar values by just storing scalars

repeated ScalarValue values = 1;
}

message VariantValue {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, this == Scalar?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Signed-off-by: Adam Gutglick <adam@spiraldb.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

changelog/feature A new feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants