Skip to content

[FEATURE] Add Union command in PPL #5240

Open
srikanthpadakanti wants to merge 7 commits intoopensearch-project:mainfrom
srikanthpadakanti:feature/union-public
Open

[FEATURE] Add Union command in PPL #5240
srikanthpadakanti wants to merge 7 commits intoopensearch-project:mainfrom
srikanthpadakanti:feature/union-public

Conversation

@srikanthpadakanti
Copy link
Copy Markdown
Contributor

Description

Add union command to PPL that implements SQL-style UNION ALL semantics with Calcite-based type coercion. The command supports combining multiple datasets (indices, patterns, aliases, or subsearches) with automatic schema merging and missing fields are filled with NULL, compatible types are coerced to a common supertype (e.g., int+float --> float), and incompatible types fall back to string. Works both as a first command and mid-pipeline, where the upstream result set is implicitly included as the first dataset.

Related Issues

Resolves #5110
#5110

Check List

  • [ X ] New functionality includes testing.
  • [ X ] New functionality has been documented.
  • [ X ] New functionality has javadoc added.
  • [ X ] New functionality has a user manual doc added.
  • [ X ] New PPL command checklist all confirmed.
  • API changes companion pull request created.
  • [ X ] Commits are signed per the DCO using --signoff or -s.
  • Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 17, 2026

PR Reviewer Guide 🔍

(Review updated until commit 467a7d1)

Here are some key observations to aid the review process:

🧪 PR contains tests
🔒 No security concerns identified
✅ No TODO sections
🔀 Multiple PR themes

Sub-PR theme: AST and Parser: Union command grammar and AST node

Relevant files:

  • core/src/main/java/org/opensearch/sql/ast/tree/Union.java
  • core/src/main/java/org/opensearch/sql/ast/AbstractNodeVisitor.java
  • core/src/main/java/org/opensearch/sql/analysis/Analyzer.java
  • ppl/src/main/antlr/OpenSearchPPLLexer.g4
  • ppl/src/main/antlr/OpenSearchPPLParser.g4
  • ppl/src/main/java/org/opensearch/sql/ppl/parser/AstBuilder.java
  • ppl/src/test/java/org/opensearch/sql/ppl/parser/AstBuilderTest.java

Sub-PR theme: Calcite execution: Union planning, schema unification, and anonymizer

Relevant files:

  • core/src/main/java/org/opensearch/sql/calcite/SchemaUnifier.java
  • core/src/main/java/org/opensearch/sql/calcite/CalciteRelNodeVisitor.java
  • ppl/src/main/java/org/opensearch/sql/ppl/utils/PPLQueryDataAnonymizer.java
  • ppl/src/test/java/org/opensearch/sql/ppl/calcite/CalcitePPLUnionTest.java
  • ppl/src/test/java/org/opensearch/sql/ppl/utils/PPLQueryDataAnonymizerTest.java

Sub-PR theme: Integration tests and documentation for Union command

Relevant files:

  • integ-test/src/test/java/org/opensearch/sql/calcite/remote/CalciteUnionCommandIT.java
  • integ-test/src/test/java/org/opensearch/sql/calcite/CalciteNoPushdownIT.java
  • integ-test/src/test/java/org/opensearch/sql/calcite/remote/CalciteExplainIT.java
  • integ-test/src/test/java/org/opensearch/sql/security/CalciteCrossClusterSearchIT.java
  • integ-test/src/test/java/org/opensearch/sql/ppl/NewAddedCommandsIT.java
  • integ-test/src/test/resources/expectedOutput/calcite/explain_union.yaml
  • integ-test/src/test/resources/expectedOutput/calcite_no_pushdown/explain_union.yaml
  • docs/user/ppl/cmd/union.md
  • docs/user/ppl/index.md
  • docs/category.json

⚡ Recommended focus areas for review

Case-Sensitive Schema Merge

The buildUnifiedSchemaForUnion method uses field names as keys in seenFields map with a plain HashMap, which means field name matching is case-sensitive. If two datasets have the same field with different casing (e.g., Age vs age), they will be treated as distinct fields rather than merged, potentially producing unexpected NULL-filled columns instead of a unified column.

private static List<SchemaField> buildUnifiedSchemaForUnion(List<RelNode> nodes) {
  List<SchemaField> schema = new ArrayList<>();
  Map<String, RelDataType> seenFields = new HashMap<>();

  for (RelNode node : nodes) {
    for (RelDataTypeField field : node.getRowType().getFieldList()) {
      if (!seenFields.containsKey(field.getName())) {
        schema.add(new SchemaField(field.getName(), field.getType()));
        seenFields.put(field.getName(), field.getType());
      }
    }
  }
  return schema;
}
FLOAT/DECIMAL Rank Ordering

In getNumericTypeRankForUnion, DECIMAL (rank 5) is ranked lower than FLOAT (rank 7) and DOUBLE (rank 8). This means INTEGER+FLOAT→FLOAT and DECIMAL+FLOAT→FLOAT, but DECIMAL is a higher-precision exact type. Coercing DECIMAL to FLOAT/DOUBLE can silently lose precision. The ranking of DECIMAL vs floating-point types should be reconsidered.

private static int getNumericTypeRankForUnion(SqlTypeName typeName) {
  return switch (typeName) {
    case TINYINT -> 1;
    case SMALLINT -> 2;
    case INTEGER -> 3;
    case BIGINT -> 4;
    case DECIMAL -> 5;
    case REAL -> 6;
    case FLOAT -> 7;
    case DOUBLE -> 8;
    default -> 0;
  };
}
EmptySource Side Effect

In visitUnion, each dataset is first processed through EmptySourcePropagateVisitor and then visited with this (the main visitor). The prunedDataset.accept(this, context) call may have side effects on context (e.g., modifying shared state in context.relBuilder) between iterations. If a dataset visit fails partway through, the context may be left in a partially modified state, potentially corrupting subsequent dataset processing.

public RelNode visitUnion(Union node, CalcitePlanContext context) {
  List<RelNode> inputNodes = new ArrayList<>();

  for (UnresolvedPlan dataset : node.getDatasets()) {
    UnresolvedPlan prunedDataset = dataset.accept(new EmptySourcePropagateVisitor(), null);
    prunedDataset.accept(this, context);
    inputNodes.add(context.relBuilder.build());
  }

  if (inputNodes.size() < 2) {
    throw new IllegalArgumentException(
        "Union command requires at least two datasets. Provided: " + inputNodes.size());
  }

  List<RelNode> unifiedInputs =
      SchemaUnifier.buildUnifiedSchemaWithTypeCoercion(inputNodes, context);

  for (RelNode input : unifiedInputs) {
    context.relBuilder.push(input);
  }
  context.relBuilder.union(true, unifiedInputs.size()); // true = UNION ALL

  if (node.getMaxout() != null) {
    context.relBuilder.push(
        LogicalSystemLimit.create(
            LogicalSystemLimit.SystemLimitType.SUBSEARCH_MAXOUT,
            context.relBuilder.build(),
            context.relBuilder.literal(node.getMaxout())));
  }

  return context.relBuilder.peek();
}
Regex Keyword List Incomplete

The anonymizeSubsearchCommand method builds a keywords regex pattern that includes the commandName parameter but does not include all PPL command keywords (e.g., append, dedup, join, lookup, etc.). This could cause legitimate command names to be incorrectly replaced with identifier in the anonymized output.

String keywords =
    "source|fields|where|stats|head|tail|sort|eval|rename|"
        + commandName
        + "|search|table|identifier|\\*\\*\\*";
List<String> anonymizedSubsearches = new ArrayList<>();

for (UnresolvedPlan subsearch : subsearches) {
  String anonymizedSubsearch = anonymizeData(subsearch);
  anonymizedSubsearch = "search " + anonymizedSubsearch;
  anonymizedSubsearch =
      anonymizedSubsearch
          .replaceAll("\\bsource=\\w+", "source=table")
          .replaceAll("\\b(?!" + keywords + ")\\w+(?=\\s*[<>=!])", "identifier")
          .replaceAll("\\b(?!" + keywords + ")\\w+(?=\\s*,)", "identifier")
          .replaceAll("fields \\+\\s*\\b(?!" + keywords + ")\\w+", "fields + identifier")
          .replaceAll(
              "fields \\+\\s*identifier,\\s*\\b(?!" + keywords + ")\\w+",
              "fields + identifier,identifier");
  anonymizedSubsearches.add(StringUtils.format("[%s]", anonymizedSubsearch));
}

return StringUtils.format("| %s %s", commandName, String.join(" ", anonymizedSubsearches));
Mutable maxout Field

The Union class uses both @RequiredArgsConstructor and @AllArgsConstructor, with maxout as a non-final field. This means maxout is mutable after construction. Combined with @EqualsAndHashCode, two Union nodes with the same datasets but different maxout values will not be equal, but the field could be changed externally after creation, breaking equality contracts. Consider making maxout final.

public class Union extends UnresolvedPlan {
  private final List<UnresolvedPlan> datasets;

  private Integer maxout;

  @Override
  public UnresolvedPlan attach(UnresolvedPlan child) {
    List<UnresolvedPlan> newDatasets =
        ImmutableList.<UnresolvedPlan>builder().add(child).addAll(datasets).build();
    return new Union(newDatasets, maxout);
  }

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 17, 2026

PR Code Suggestions ✨

Latest suggestions up to 467a7d1

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
Possible issue
Fix temporal type widening precedence order

The temporal type widening logic has an incorrect precedence:
TIMESTAMP_WITH_LOCAL_TIME_ZONE should be wider than TIMESTAMP, but the current code
returns TIMESTAMP when either type is TIMESTAMP, even if the other is
TIMESTAMP_WITH_LOCAL_TIME_ZONE. This means TIMESTAMP +
TIMESTAMP_WITH_LOCAL_TIME_ZONE incorrectly resolves to TIMESTAMP instead of
TIMESTAMP_WITH_LOCAL_TIME_ZONE, potentially losing timezone information.

core/src/main/java/org/opensearch/sql/calcite/SchemaUnifier.java [378-387]

 private static SqlTypeName getWiderTemporalTypeForUnion(SqlTypeName type1, SqlTypeName type2) {
-  if (type1 == SqlTypeName.TIMESTAMP || type2 == SqlTypeName.TIMESTAMP) {
-    return SqlTypeName.TIMESTAMP;
-  }
   if (type1 == SqlTypeName.TIMESTAMP_WITH_LOCAL_TIME_ZONE
       || type2 == SqlTypeName.TIMESTAMP_WITH_LOCAL_TIME_ZONE) {
     return SqlTypeName.TIMESTAMP_WITH_LOCAL_TIME_ZONE;
   }
+  if (type1 == SqlTypeName.TIMESTAMP || type2 == SqlTypeName.TIMESTAMP) {
+    return SqlTypeName.TIMESTAMP;
+  }
   return SqlTypeName.DATE;
 }
Suggestion importance[1-10]: 7

__

Why: This is a genuine bug: when combining TIMESTAMP and TIMESTAMP_WITH_LOCAL_TIME_ZONE, the current code incorrectly returns TIMESTAMP instead of the wider TIMESTAMP_WITH_LOCAL_TIME_ZONE, potentially losing timezone information. The fix correctly reorders the checks to prioritize TIMESTAMP_WITH_LOCAL_TIME_ZONE.

Medium
Fix potential builder stack state after maxout limit

When maxout is set, LogicalSystemLimit is pushed onto the builder but then
context.relBuilder.peek() is called, which returns the top of the stack without
consuming it. However, the LogicalSystemLimit node was pushed via
context.relBuilder.push(...) after context.relBuilder.build() consumed the union
node. The final return context.relBuilder.peek() is correct only if the limit node
is on the stack, but the code path when maxout is null also calls peek() after
union() — this is consistent. The real issue is that after
context.relBuilder.union(true, unifiedInputs.size()), the union result is on the
stack, and peek() returns it without removing it, which may leave stale state on the
builder for subsequent pipeline steps. Consider using build() consistently and
returning the result directly.

core/src/main/java/org/opensearch/sql/calcite/CalciteRelNodeVisitor.java [2629-2642]

 for (RelNode input : unifiedInputs) {
   context.relBuilder.push(input);
 }
 context.relBuilder.union(true, unifiedInputs.size()); // true = UNION ALL
 
 if (node.getMaxout() != null) {
+  RelNode unionNode = context.relBuilder.build();
   context.relBuilder.push(
       LogicalSystemLimit.create(
           LogicalSystemLimit.SystemLimitType.SUBSEARCH_MAXOUT,
-          context.relBuilder.build(),
+          unionNode,
           context.relBuilder.literal(node.getMaxout())));
 }
 
 return context.relBuilder.peek();
Suggestion importance[1-10]: 4

__

Why: The suggestion identifies a potential inconsistency in how the relBuilder stack is managed when maxout is set. The improved code explicitly calls build() before pushing the LogicalSystemLimit, which is cleaner and avoids potential stale state. However, the existing code already calls context.relBuilder.build() inside the LogicalSystemLimit.create() call, so the practical impact may be limited.

Low
General
Make maxout field final for equality correctness

Both @RequiredArgsConstructor and @AllArgsConstructor are present.
@RequiredArgsConstructor generates a constructor with only final fields (i.e.,
datasets), while @AllArgsConstructor generates one with all fields (datasets and
maxout). This means there are two constructors: Union(List) and Union(List,
Integer). However, maxout is not final, so it won't be included in
@EqualsAndHashCode, which may cause two Union nodes with different maxout values to
be considered equal. Consider making maxout final for consistency and correctness.

core/src/main/java/org/opensearch/sql/ast/tree/Union.java [18-27]

 @Getter
 @ToString
 @EqualsAndHashCode(callSuper = false)
 @RequiredArgsConstructor
 @AllArgsConstructor
 public class Union extends UnresolvedPlan {
   private final List<UnresolvedPlan> datasets;
 
-  private Integer maxout;
+  private final Integer maxout;
Suggestion importance[1-10]: 5

__

Why: Making maxout final is a valid correctness concern since two Union nodes with different maxout values would incorrectly be considered equal by @EqualsAndHashCode. However, making maxout final would require removing @RequiredArgsConstructor or restructuring, since @RequiredArgsConstructor only includes final fields and @AllArgsConstructor would then cover both.

Low
Ensure unified schema uses coerced types

The unified schema is built using the type from the first node that introduces a
field, but after coerceUnionTypes runs, the field types in subsequent nodes may have
been coerced to a common type. The schema should reflect the coerced (common) type,
not just the first-seen type. This can cause type mismatches when projecting fields
from nodes where the coerced type differs from the first-seen type, potentially
leading to incorrect CASTs or runtime errors.

core/src/main/java/org/opensearch/sql/calcite/SchemaUnifier.java [192-205]

 private static List<SchemaField> buildUnifiedSchemaForUnion(List<RelNode> nodes) {
   List<SchemaField> schema = new ArrayList<>();
   Map<String, RelDataType> seenFields = new HashMap<>();
 
   for (RelNode node : nodes) {
     for (RelDataTypeField field : node.getRowType().getFieldList()) {
-      if (!seenFields.containsKey(field.getName())) {
-        schema.add(new SchemaField(field.getName(), field.getType()));
-        seenFields.put(field.getName(), field.getType());
+      String name = field.getName();
+      if (!seenFields.containsKey(name)) {
+        schema.add(new SchemaField(name, field.getType()));
+        seenFields.put(name, field.getType());
       }
+      // After coercion, all nodes sharing a field name should have the same type;
+      // no update needed since coerceUnionTypes already unified them.
     }
   }
   return schema;
 }
Suggestion importance[1-10]: 3

__

Why: The suggestion claims a type mismatch issue, but after coerceUnionTypes runs, all nodes sharing a field name already have the same coerced type. The buildUnifiedSchemaForUnion method correctly picks the first-seen type, which after coercion is the common type. The improved code is functionally identical to the existing code with only a comment added, making this a low-impact suggestion.

Low

Previous suggestions

Suggestions up to commit 5952ff6
CategorySuggestion                                                                                                                                    Impact
Possible issue
Fix potential builder API ordering issue with maxout literal

After calling context.relBuilder.union(true, unifiedInputs.size()), the union node
is on the builder stack. Then if maxout is set, context.relBuilder.build() pops the
union node and wraps it in a LogicalSystemLimit, which is pushed back. However,
context.relBuilder.literal(node.getMaxout()) is called after
context.relBuilder.build() which already consumed the top of the stack — the literal
call should be evaluated before build() to avoid potential ordering issues with the
builder API.

core/src/main/java/org/opensearch/sql/calcite/CalciteRelNodeVisitor.java [2634-2642]

 if (node.getMaxout() != null) {
+  RelNode unionNode = context.relBuilder.build();
+  RexNode maxoutLiteral = context.relBuilder.literal(node.getMaxout());
   context.relBuilder.push(
       LogicalSystemLimit.create(
           LogicalSystemLimit.SystemLimitType.SUBSEARCH_MAXOUT,
-          context.relBuilder.build(),
-          context.relBuilder.literal(node.getMaxout())));
+          unionNode,
+          maxoutLiteral));
 }
 
 return context.relBuilder.peek();
Suggestion importance[1-10]: 6

__

Why: The suggestion correctly identifies that context.relBuilder.literal(node.getMaxout()) is called after context.relBuilder.build() pops the stack, which could cause ordering issues depending on the RelBuilder API implementation. Separating the build() call and the literal() call makes the intent clearer and avoids potential side effects.

Low
Ensure unified schema uses coerced types

The unified schema is built using the type from the first node that introduces a
field, but after coerceUnionTypes runs, the actual field types in the coerced inputs
may differ from the original. The schema should be built from the coerced inputs to
ensure the projected types match the actual types in the nodes being projected.
Otherwise, the cast in buildProjectionForUnion may be applied incorrectly or skipped
when it shouldn't be.

core/src/main/java/org/opensearch/sql/calcite/SchemaUnifier.java [192-205]

 private static List<SchemaField> buildUnifiedSchemaForUnion(List<RelNode> nodes) {
   List<SchemaField> schema = new ArrayList<>();
   Map<String, RelDataType> seenFields = new HashMap<>();
 
   for (RelNode node : nodes) {
     for (RelDataTypeField field : node.getRowType().getFieldList()) {
-      if (!seenFields.containsKey(field.getName())) {
-        schema.add(new SchemaField(field.getName(), field.getType()));
-        seenFields.put(field.getName(), field.getType());
+      String name = field.getName();
+      if (!seenFields.containsKey(name)) {
+        schema.add(new SchemaField(name, field.getType()));
+        seenFields.put(name, field.getType());
       }
+      // If already seen, the coerceUnionTypes step should have unified the type,
+      // so no update needed here.
     }
   }
   return schema;
 }
Suggestion importance[1-10]: 3

__

Why: The suggestion claims the schema may use pre-coercion types, but buildUnifiedSchemaForUnion is called on coercedInputs (after coerceUnionTypes), so the types should already be unified. The improved code is functionally identical to the existing code, making this suggestion largely a no-op.

Low
General
Guard against unexpected builder stack state per dataset

Each dataset is visited using the shared context, which means the relBuilder state
is shared across iterations. If visiting a dataset modifies the relBuilder stack in
an unexpected way (e.g., leaves extra nodes), subsequent iterations or the final
union call may operate on incorrect state. Consider saving and restoring the builder
state or using a fresh context per dataset to isolate side effects.

core/src/main/java/org/opensearch/sql/calcite/CalciteRelNodeVisitor.java [2615-2619]

 for (UnresolvedPlan dataset : node.getDatasets()) {
   UnresolvedPlan prunedDataset = dataset.accept(new EmptySourcePropagateVisitor(), null);
+  int stackSizeBefore = context.relBuilder.size();
   prunedDataset.accept(this, context);
+  if (context.relBuilder.size() != stackSizeBefore + 1) {
+    throw new IllegalStateException(
+        "Expected exactly one RelNode pushed for union dataset, but stack changed by "
+            + (context.relBuilder.size() - stackSizeBefore));
+  }
   inputNodes.add(context.relBuilder.build());
 }
Suggestion importance[1-10]: 4

__

Why: Adding a stack size assertion is a reasonable defensive check, but RelBuilder.size() may not be a standard public API method, and the pattern of visiting and calling build() is consistent with how other commands are handled in the codebase. The improvement is minor and may not compile without verifying the API.

Low
Clarify single-input early return intent

The early return for a single input bypasses type coercion and schema unification
entirely. However, the visitUnion method already validates that at least 2 inputs
exist before calling this method, so the size() == 1 check is redundant. More
importantly, if this method is ever called with 1 input in a future code path, it
silently returns without unification, which could cause downstream issues. The
validation should be consistent with the caller's contract.

core/src/main/java/org/opensearch/sql/calcite/SchemaUnifier.java [164-166]

-if (inputs.isEmpty() || inputs.size() == 1) {
+if (inputs.isEmpty()) {
+  return inputs;
+}
+if (inputs.size() == 1) {
+  // Single input: no unification needed, but still return as-is
   return inputs;
 }
Suggestion importance[1-10]: 1

__

Why: The improved code is functionally identical to the existing code — it just splits the condition into two separate if blocks with a comment. This is purely cosmetic and adds no functional value.

Low
Suggestions up to commit 3096cd4
CategorySuggestion                                                                                                                                    Impact
Possible issue
Builder stack inconsistency in maxout branch

When maxout is null, context.relBuilder.build() is never called after the union, so
the union RelNode remains on the builder stack. The method returns
context.relBuilder.peek() which is correct for the non-maxout path, but the caller
in visitUnion also calls context.relBuilder.build() to pop the result. If the
builder stack is not properly managed, subsequent operations may consume a stale
node. Ensure the final node is consistently left on the stack (via peek) or
consistently consumed (via build) regardless of the maxout branch.

core/src/main/java/org/opensearch/sql/calcite/CalciteRelNodeVisitor.java [2629-2642]

 for (RelNode input : unifiedInputs) {
   context.relBuilder.push(input);
 }
 context.relBuilder.union(true, unifiedInputs.size()); // true = UNION ALL
 
 if (node.getMaxout() != null) {
+  RelNode unionNode = context.relBuilder.build();
   context.relBuilder.push(
       LogicalSystemLimit.create(
           LogicalSystemLimit.SystemLimitType.SUBSEARCH_MAXOUT,
-          context.relBuilder.build(),
+          unionNode,
           context.relBuilder.literal(node.getMaxout())));
 }
 
 return context.relBuilder.peek();
Suggestion importance[1-10]: 5

__

Why: The suggestion identifies a real potential issue: in the maxout branch, context.relBuilder.build() is called to pop the union node before passing it to LogicalSystemLimit.create, which is correct. The improved code makes this explicit by storing the union node in a variable first, improving clarity and correctness of the builder stack management.

Low
Unified schema uses pre-coercion types incorrectly

The unified schema is built using the type from the first node that introduces a
field, but after coerceUnionTypes runs, subsequent nodes may have had their field
types changed to the common supertype. The schema should use the coerced (target)
type rather than the original type from the first-seen node, otherwise the
projection in buildProjectionForUnion may generate unnecessary or incorrect CASTs.

Consider building the unified schema after coercion by reading the target type from
the coerced inputs, or by passing the targetTypeMap computed in coerceUnionTypes
directly to buildUnifiedSchemaForUnion.

core/src/main/java/org/opensearch/sql/calcite/SchemaUnifier.java [192-205]

 private static List<SchemaField> buildUnifiedSchemaForUnion(List<RelNode> nodes) {
   List<SchemaField> schema = new ArrayList<>();
   Map<String, RelDataType> seenFields = new HashMap<>();
 
   for (RelNode node : nodes) {
     for (RelDataTypeField field : node.getRowType().getFieldList()) {
-      if (!seenFields.containsKey(field.getName())) {
-        schema.add(new SchemaField(field.getName(), field.getType()));
-        seenFields.put(field.getName(), field.getType());
+      String name = field.getName();
+      if (!seenFields.containsKey(name)) {
+        // After coercion, all nodes sharing this field have the same type; use it directly.
+        schema.add(new SchemaField(name, field.getType()));
+        seenFields.put(name, field.getType());
       }
     }
   }
   return schema;
 }
Suggestion importance[1-10]: 3

__

Why: The suggestion claims that buildUnifiedSchemaForUnion uses pre-coercion types, but it is called with coercedInputs (already coerced) in unifySchemasForUnion. Since the inputs passed to this method are already coerced, the first-seen type will be the coerced type. The improved_code is essentially identical to the existing_code with only a comment change, making this suggestion misleading.

Low
General
Mutable field causes inconsistent object state

Using both @RequiredArgsConstructor and @AllArgsConstructor together with a mix of
final and non-final fields can cause confusion: @RequiredArgsConstructor generates a
constructor with only datasets (the final field), while @AllArgsConstructor
generates one with both datasets and maxout. However, since maxout is not final, it
is mutable and can be set independently, which may lead to inconsistent state.
Consider making maxout final and removing @RequiredArgsConstructor, or using a
single explicit constructor.

core/src/main/java/org/opensearch/sql/ast/tree/Union.java [18-27]

 @Getter
 @ToString
 @EqualsAndHashCode(callSuper = false)
-@RequiredArgsConstructor
 @AllArgsConstructor
 public class Union extends UnresolvedPlan {
   private final List<UnresolvedPlan> datasets;
+  private final Integer maxout;
 
-  private Integer maxout;
+  public Union(List<UnresolvedPlan> datasets) {
+    this(datasets, null);
+  }
Suggestion importance[1-10]: 4

__

Why: The suggestion correctly identifies that mixing @RequiredArgsConstructor and @AllArgsConstructor with a non-final field can be confusing. Making maxout final and using a single @AllArgsConstructor with a convenience constructor improves immutability and clarity of the design.

Low
Regex injection risk from unescaped command name

When commandName is "union", the keywords regex alternation includes "union" as a
keyword to protect from replacement. However, if commandName is "multisearch", the
original code also protected "multisearch". The refactored code correctly handles
this, but the keywords string is built by concatenating commandName in the middle of
the alternation without escaping. If a future command name contains regex special
characters, this would break the pattern. Consider using Pattern.quote(commandName)
for safety.

ppl/src/main/java/org/opensearch/sql/ppl/utils/PPLQueryDataAnonymizer.java [806-809]

 String keywords =
     "source|fields|where|stats|head|tail|sort|eval|rename|"
-        + commandName
+        + java.util.regex.Pattern.quote(commandName)
         + "|search|table|identifier|\\*\\*\\*";
Suggestion importance[1-10]: 3

__

Why: While the suggestion is technically valid for future-proofing against regex special characters in command names, the current command names ("union" and "multisearch") contain no regex special characters, making this a low-impact defensive improvement rather than a critical fix.

Low
Suggestions up to commit e03a298
CategorySuggestion                                                                                                                                    Impact
General
Include all command names in anonymizer keyword list

The keywords regex negative lookahead is built by concatenating commandName directly
into the regex pattern. If commandName contains regex special characters, this could
break the pattern. Additionally, when commandName is "union", the keyword list will
include "union" but not "multisearch", and vice versa. This means field names that
happen to match the other command name could be incorrectly anonymized. Both command
names should always be included in the keywords list.

ppl/src/main/java/org/opensearch/sql/ppl/utils/PPLQueryDataAnonymizer.java [805-809]

 private String anonymizeSubsearchCommand(String commandName, List<UnresolvedPlan> subsearches) {
   String keywords =
       "source|fields|where|stats|head|tail|sort|eval|rename|"
-          + commandName
+          + "multisearch|union"
           + "|search|table|identifier|\\*\\*\\*";
Suggestion importance[1-10]: 5

__

Why: The suggestion correctly identifies that when commandName is "union", the keyword "multisearch" is excluded from the negative lookahead, and vice versa. This could cause field names matching the other command name to be incorrectly anonymized. Including both command names always is a valid correctness fix.

Low
Simplify constructor annotations for clarity

Both @RequiredArgsConstructor and @AllArgsConstructor are present.
@RequiredArgsConstructor generates a constructor with only datasets (the final
field), while @AllArgsConstructor generates one with both datasets and maxout.
However, maxout is not final, so @RequiredArgsConstructor will only include
datasets. This means there are two constructors: Union(List) and Union(List,
Integer). The attach method uses new Union(newDatasets, maxout) which calls the
all-args constructor. This is fine, but having both annotations is redundant since
@AllArgsConstructor already covers the single-field case when maxout is null.
Consider removing @RequiredArgsConstructor to avoid confusion, or make maxout final
with a default.

core/src/main/java/org/opensearch/sql/ast/tree/Union.java [18-27]

 @Getter
 @ToString
 @EqualsAndHashCode(callSuper = false)
-@RequiredArgsConstructor
 @AllArgsConstructor
 public class Union extends UnresolvedPlan {
   private final List<UnresolvedPlan> datasets;
+  private final Integer maxout;
 
-  private Integer maxout;
+  public Union(List<UnresolvedPlan> datasets) {
+    this(datasets, null);
+  }
Suggestion importance[1-10]: 3

__

Why: Having both @RequiredArgsConstructor and @AllArgsConstructor is redundant but not incorrect since they generate different constructors. The suggestion to consolidate is a minor code quality improvement, but the current code works correctly as-is.

Low
Preserve insertion order in unified schema map

The unified schema is built using only the first occurrence of each field name,
ignoring type differences across inputs. After coerceUnionTypes, all inputs should
have the same type for shared fields, but the schema should reflect the coerced
(target) type rather than the first-seen type. If coercion didn't fully align types,
the projection in buildProjectionForUnion will cast, but the schema type used for
NULL literals for missing fields may be wrong. The schema should be derived from the
coerced inputs' field types to ensure correctness.

core/src/main/java/org/opensearch/sql/calcite/SchemaUnifier.java [192-205]

 private static List<SchemaField> buildUnifiedSchemaForUnion(List<RelNode> nodes) {
   List<SchemaField> schema = new ArrayList<>();
-  Map<String, RelDataType> seenFields = new HashMap<>();
+  Map<String, RelDataType> seenFields = new LinkedHashMap<>();
 
   for (RelNode node : nodes) {
     for (RelDataTypeField field : node.getRowType().getFieldList()) {
-      if (!seenFields.containsKey(field.getName())) {
-        schema.add(new SchemaField(field.getName(), field.getType()));
-        seenFields.put(field.getName(), field.getType());
+      String name = field.getName();
+      if (!seenFields.containsKey(name)) {
+        seenFields.put(name, field.getType());
+        schema.add(new SchemaField(name, field.getType()));
       }
+      // After coercion, all nodes should agree on the type; no update needed
     }
   }
   return schema;
 }
Suggestion importance[1-10]: 2

__

Why: The improved_code is functionally identical to the existing_code - it still uses HashMap in the signature description but the actual logic change (using LinkedHashMap) is minor and the comment "After coercion, all nodes should agree on the type; no update needed" doesn't add meaningful behavior change. The suggestion's concern about type correctness is addressed by the prior coerceUnionTypes step, making this a very low-impact change.

Low
Possible issue
Avoid nested build inside push argument

After pushing the LogicalSystemLimit node onto the builder,
context.relBuilder.peek() is called to return the top node. However,
LogicalSystemLimit was pushed directly via context.relBuilder.push(...) and then
context.relBuilder.build() was called inside the push argument, which pops the union
node. The LogicalSystemLimit is pushed but peek() should correctly return it.
However, if maxout is null, context.relBuilder.peek() is called after union(...)
which is correct. Verify that after the union(true, size) call, the builder's stack
has exactly one node before peek() is called, since multiple push calls were made
for each input.

core/src/main/java/org/opensearch/sql/calcite/CalciteRelNodeVisitor.java [2634-2642]

 if (node.getMaxout() != null) {
+  RelNode unionNode = context.relBuilder.build();
   context.relBuilder.push(
       LogicalSystemLimit.create(
           LogicalSystemLimit.SystemLimitType.SUBSEARCH_MAXOUT,
-          context.relBuilder.build(),
+          unionNode,
           context.relBuilder.literal(node.getMaxout())));
 }
 
 return context.relBuilder.peek();
Suggestion importance[1-10]: 5

__

Why: The suggestion correctly identifies a potential issue with calling context.relBuilder.build() inside the argument to context.relBuilder.push(), which could cause ordering/evaluation issues. Extracting build() to a separate variable before passing it to push() is cleaner and avoids any side-effect ordering concerns.

Low
Suggestions up to commit f64bacb
CategorySuggestion                                                                                                                                    Impact
General
Fix regex keyword list to include all command names

The keywords regex is built by concatenating commandName directly into the
alternation pattern. If commandName contains regex special characters, this could
break the pattern. Additionally, when commandName is "union", the keyword list won't
include "multisearch", and vice versa, potentially causing incorrect anonymization
of the other command name if it appears in a subsearch. Consider using a fixed,
comprehensive keyword list instead.

ppl/src/main/java/org/opensearch/sql/ppl/utils/PPLQueryDataAnonymizer.java [805-809]

 private String anonymizeSubsearchCommand(String commandName, List<UnresolvedPlan> subsearches) {
   String keywords =
       "source|fields|where|stats|head|tail|sort|eval|rename|"
-          + commandName
-          + "|search|table|identifier|\\*\\*\\*";
+          + "multisearch|union|search|table|identifier|\\*\\*\\*";
Suggestion importance[1-10]: 5

__

Why: The suggestion correctly identifies that the dynamic keywords string won't include "multisearch" when processing a "union" command and vice versa, which could cause incorrect anonymization. Using a fixed comprehensive keyword list is a valid improvement for correctness and maintainability.

Low
Isolate context per union dataset branch

Each dataset is visited using context, which is a shared mutable object. If
EmptySourcePropagateVisitor or the visitor modifies shared state in context (such as
relBuilder stack or other fields), processing one dataset may corrupt the state for
subsequent datasets. Consider creating a fresh or isolated context for each dataset
to avoid cross-contamination between union branches.

core/src/main/java/org/opensearch/sql/calcite/CalciteRelNodeVisitor.java [2615-2619]

 for (UnresolvedPlan dataset : node.getDatasets()) {
   UnresolvedPlan prunedDataset = dataset.accept(new EmptySourcePropagateVisitor(), null);
-  prunedDataset.accept(this, context);
-  inputNodes.add(context.relBuilder.build());
+  CalcitePlanContext datasetContext = context.createSubContext();
+  prunedDataset.accept(this, datasetContext);
+  inputNodes.add(datasetContext.relBuilder.build());
 }
Suggestion importance[1-10]: 4

__

Why: The concern about shared mutable context is valid in principle, but the improved_code references a context.createSubContext() method that may not exist in the codebase. The existing code uses context.relBuilder.build() to pop the built node, which is the standard pattern used elsewhere in the visitor, so the risk may be lower than suggested.

Low
Prevent accidental null maxout in constructors

Both @RequiredArgsConstructor and @AllArgsConstructor are present.
@RequiredArgsConstructor generates a constructor with only final fields (i.e.,
datasets), while @AllArgsConstructor generates one with both datasets and maxout.
However, maxout is not final, so @RequiredArgsConstructor produces a constructor
that doesn't set maxout, leaving it always null when using that constructor. This is
intentional for the no-maxout case, but the attach method always uses the two-arg
constructor, which is correct. Verify that the single-arg constructor usage is
intentional and won't silently ignore a maxout value set elsewhere.

core/src/main/java/org/opensearch/sql/ast/tree/Union.java [18-27]

 @Getter
 @ToString
 @EqualsAndHashCode(callSuper = false)
-@RequiredArgsConstructor
 @AllArgsConstructor
 public class Union extends UnresolvedPlan {
   private final List<UnresolvedPlan> datasets;
+  private final Integer maxout;
 
-  private Integer maxout;
+  public Union(List<UnresolvedPlan> datasets) {
+    this(datasets, null);
+  }
Suggestion importance[1-10]: 3

__

Why: The suggestion to make maxout final and use a single canonical constructor is a reasonable design improvement, but the current dual-annotation approach (@RequiredArgsConstructor + @AllArgsConstructor) is intentional and functional. The attach method correctly uses the two-arg constructor, so there's no actual bug here.

Low
Possible issue
Ensure unified schema uses coerced types

The unified schema is built using the type from the first node that introduces a
field, but after coerceUnionTypes runs, the field types in subsequent nodes may have
been coerced to a common type. The schema should reflect the coerced (common) type,
not just the first-seen type. This can cause type mismatches when projecting fields
from nodes where the coerced type differs from the first-seen type, leading to
incorrect CASTs or runtime errors.

core/src/main/java/org/opensearch/sql/calcite/SchemaUnifier.java [192-205]

 private static List<SchemaField> buildUnifiedSchemaForUnion(List<RelNode> nodes) {
   List<SchemaField> schema = new ArrayList<>();
   Map<String, RelDataType> seenFields = new HashMap<>();
 
   for (RelNode node : nodes) {
     for (RelDataTypeField field : node.getRowType().getFieldList()) {
-      if (!seenFields.containsKey(field.getName())) {
-        schema.add(new SchemaField(field.getName(), field.getType()));
-        seenFields.put(field.getName(), field.getType());
+      String name = field.getName();
+      if (!seenFields.containsKey(name)) {
+        schema.add(new SchemaField(name, field.getType()));
+        seenFields.put(name, field.getType());
       }
+      // After coercion, all nodes sharing a field name should have the same type already.
+      // No update needed since coerceUnionTypes already aligned types.
     }
   }
   return schema;
 }
Suggestion importance[1-10]: 3

__

Why: The suggestion claims a type mismatch issue, but after coerceUnionTypes runs, all nodes sharing a field name already have the same coerced type. The buildUnifiedSchemaForUnion method correctly picks the first-seen type, which after coercion is the common type. The 'improved_code' is functionally identical to the 'existing_code' with only a comment added, making this a low-impact suggestion.

Low
Suggestions up to commit 811fc25
CategorySuggestion                                                                                                                                    Impact
General
Use complete keyword set in anonymizer regex

The keywords regex is built by concatenating commandName directly into the
alternation pattern. If commandName contains regex special characters (unlikely for
current values like "union" or "multisearch", but possible for future commands),
this could break the regex. Additionally, when visitMultisearch calls this method,
the keyword list will include "multisearch" but not "union", and vice versa —
meaning the other command name could be incorrectly anonymized as an identifier if
it appears as a field name. Consider using a fixed set of all known command keywords
rather than dynamically inserting only the current command name.

ppl/src/main/java/org/opensearch/sql/ppl/utils/PPLQueryDataAnonymizer.java [805-809]

 private String anonymizeSubsearchCommand(String commandName, List<UnresolvedPlan> subsearches) {
   String keywords =
       "source|fields|where|stats|head|tail|sort|eval|rename|"
-          + commandName
-          + "|search|table|identifier|\\*\\*\\*";
+          + "multisearch|union|search|table|identifier|\\*\\*\\*";
Suggestion importance[1-10]: 5

__

Why: The suggestion is valid — using a fixed complete set of command keywords prevents cross-command anonymization issues and avoids potential regex injection. The improved code correctly includes both multisearch and union in the keyword list regardless of which command is being processed.

Low
Make maxout field immutable for consistency

Using both @RequiredArgsConstructor and @AllArgsConstructor together with a mix of
final and non-final fields creates two constructors: one with only datasets
(required) and one with both datasets and maxout. However, maxout is non-final and
mutable, which can lead to inconsistent state if the field is modified after
construction. Consider making maxout final and using only @AllArgsConstructor, or
using a builder pattern to avoid ambiguity.

core/src/main/java/org/opensearch/sql/ast/tree/Union.java [18-27]

 @Getter
 @ToString
 @EqualsAndHashCode(callSuper = false)
-@RequiredArgsConstructor
 @AllArgsConstructor
 public class Union extends UnresolvedPlan {
   private final List<UnresolvedPlan> datasets;
+  private final Integer maxout;
 
-  private Integer maxout;
-
Suggestion importance[1-10]: 4

__

Why: Making maxout final and removing @RequiredArgsConstructor in favor of only @AllArgsConstructor is a valid immutability improvement. The current design with both annotations and a non-final field is somewhat inconsistent, though it works functionally since attach() creates a new Union instance.

Low
Ensure unified schema uses coerced common types

The unified schema is built using the type from the first node that introduces a
field, but after coerceUnionTypes runs, the field types in subsequent nodes may have
been coerced to a common supertype. The schema should reflect the coerced (common)
type, not just the first-seen type. This can cause type mismatches when projecting
fields in buildProjectionForUnion, leading to unnecessary or incorrect CASTs.

Consider building the unified schema by looking up the target type from the coercion
map (or by comparing all nodes' types for each field) rather than using the
first-seen type.

core/src/main/java/org/opensearch/sql/calcite/SchemaUnifier.java [192-205]

 private static List<SchemaField> buildUnifiedSchemaForUnion(List<RelNode> nodes) {
   List<SchemaField> schema = new ArrayList<>();
   Map<String, RelDataType> seenFields = new HashMap<>();
 
   for (RelNode node : nodes) {
     for (RelDataTypeField field : node.getRowType().getFieldList()) {
-      if (!seenFields.containsKey(field.getName())) {
-        schema.add(new SchemaField(field.getName(), field.getType()));
-        seenFields.put(field.getName(), field.getType());
+      String name = field.getName();
+      if (!seenFields.containsKey(name)) {
+        schema.add(new SchemaField(name, field.getType()));
+        seenFields.put(name, field.getType());
       }
+      // After coercion, all nodes sharing a field name should have the same type;
+      // the first-seen type is the coerced common type.
     }
   }
   return schema;
 }
Suggestion importance[1-10]: 2

__

Why: The suggestion claims a bug where the first-seen type is used instead of the coerced type, but after coerceUnionTypes runs, all nodes sharing a field name already have the same coerced type. So the first-seen type IS the coerced type. The improved_code is functionally identical to the existing_code, making this suggestion incorrect.

Low
Possible issue
Fix inconsistent relBuilder state after union visit

After calling context.relBuilder.build() inside the if (node.getMaxout() != null)
block, the built node is passed to LogicalSystemLimit.create but then pushed back
via context.relBuilder.push(...). However, context.relBuilder.peek() is called at
the end regardless of whether maxout was applied. If maxout is null,
context.relBuilder.peek() returns the union node, which is correct. But if maxout is
not null, the LogicalSystemLimit node is pushed but never built — peek() returns it
correctly only if push was called. This is fine, but the final return
context.relBuilder.peek() is inconsistent with the pattern used elsewhere (which
typically calls build()). More critically, callers of visitUnion in visitUnion use
context.relBuilder.build() immediately after, so the peek/build mismatch could cause
double-build issues.
Ensure the method consistently returns the top node without
leaving the builder in an ambiguous state, by returning context.relBuilder.build()
and having the caller re-push if needed, or by aligning with the existing visitor
pattern.

core/src/main/java/org/opensearch/sql/calcite/CalciteRelNodeVisitor.java [2629-2642]

 for (RelNode input : unifiedInputs) {
   context.relBuilder.push(input);
 }
 context.relBuilder.union(true, unifiedInputs.size()); // true = UNION ALL
 
 if (node.getMaxout() != null) {
   context.relBuilder.push(
       LogicalSystemLimit.create(
           LogicalSystemLimit.SystemLimitType.SUBSEARCH_MAXOUT,
           context.relBuilder.build(),
           context.relBuilder.literal(node.getMaxout())));
 }
 
-return context.relBuilder.peek();
+return context.relBuilder.build();
Suggestion importance[1-10]: 5

__

Why: The suggestion to use context.relBuilder.build() instead of peek() is a valid concern about consistency with the visitor pattern. Looking at the caller in visitUnion which calls context.relBuilder.build() after prunedDataset.accept(this, context), using peek() vs build() could cause issues if the caller tries to build again. This is a moderate correctness concern.

Low

@github-actions
Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit 5fd5e1c

@srikanthpadakanti srikanthpadakanti force-pushed the feature/union-public branch 2 times, most recently from dcbc60f to 12b2eb5 Compare March 17, 2026 16:59
@ahkcs
Copy link
Copy Markdown
Collaborator

ahkcs commented Apr 1, 2026

Please also take a look at the merge conflicts

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 2, 2026

Persistent review updated to latest commit 23f420d

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 2, 2026

Persistent review updated to latest commit b6879f9

@srikanthpadakanti
Copy link
Copy Markdown
Contributor Author

srikanthpadakanti commented Apr 2, 2026

Please also take a look at the merge conflicts

@ahkcs Done. The PR was opened earlier and the branch was behind by 24 commits. It has now been updated. Could you please review it before any new conflicts arise?

@srikanthpadakanti srikanthpadakanti requested a review from ahkcs April 2, 2026 03:16
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 2, 2026

Persistent review updated to latest commit 811fc25

Srikanth Padakanti added 3 commits April 1, 2026 22:37
Signed-off-by: Srikanth Padakanti <srikanth_padakanti@apple.com>
Signed-off-by: Srikanth Padakanti <srikanth_padakanti@apple.com>
Signed-off-by: Srikanth Padakanti <srikanth_padakanti@apple.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 2, 2026

Persistent review updated to latest commit f64bacb

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 2, 2026

Persistent review updated to latest commit e03a298

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 5, 2026

Persistent review updated to latest commit 3096cd4

@srikanthpadakanti
Copy link
Copy Markdown
Contributor Author

Hello @ahkcs Can you please review this. Thanks.

@srikanthpadakanti
Copy link
Copy Markdown
Contributor Author

@mengweieric @LantaoJin Please review this, so that it can be merged before I get conflicts. Thanks

@mengweieric mengweieric self-requested a review April 9, 2026 22:44
@mengweieric mengweieric self-requested a review April 9, 2026 22:47
@github-actions
Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit 5952ff6

@srikanthpadakanti
Copy link
Copy Markdown
Contributor Author

@mengweieric After merging the upstream changes to my branch, lintchecker is failing and looks like this is missing Related files: scripts/comment-on-duplicates.sh

@github-actions
Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit 467a7d1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature PPL Piped processing language

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEATURE] Add Union command in PPL

4 participants