Skip to content

bug: df.count() returns 0 when Comet native scan is disabled and when Comet parquet conversion is enabled #4793

Description

@Dummk0pf

Describe the bug

The workaround for the issue #4747 is to disable spark.comet.scan.enabled and enable spark.comet.convert.parquet.enabled so that all the other transformations can go through comet. This workaround was also mentioned here at #4751

But using these settings if we read a parquet file from azure storage bucket and execute df.count() the result is zero, even though the parquet file contains records, which can be confirmed by running df.show().

Attached a screenshot of the query plan that shows that the number of rows scanned is 1178, but in the next stage where i assume the conversion happens the number of input rows is 0.

Image

Steps to reproduce

The steps to reproduce are the exact same as #4747, except conf spark.comet.scan.enabled is set to false and spark.comet.convert.parquet.enabled is set to true, But the steps have been repeated here to maintain consistency

Category Details
Spark 3.5.6
Comet 0.16.0 (https://repo1.maven.org/maven2/org/apache/datafusion/comet-spark-spark3.5_2.12/0.16.0/comet-spark-spark3.5_2.12-0.16.0.jar)
Cluster Manager Kubernetes (Azure Kubernetes Service)
Storage Azure cloud storage with necessary ABFS drivers in the spark driver and setted up to be authenticated using Workload Identity attached to the kubernetes service account
  • Deploy Spark cluster (3.5.6) with Comet 0.16.0 enabled on an Azure Kubernetes Service (AKS) cluster.
  • Ensure that all ABFS related connectors and comet jar are present in the classpath.
  • Configure Azure Workload Identity (which injects AZURE_CLIENT_ID, AZURE_TENANT_ID, and AZURE_FEDERATED_TOKEN_FILE env vars).
  • In the configurations for the spark-submit, attach the following
     --conf spark.comet.enabled=true
     --conf spark.comet.exec.enabled=true
     --conf spark.comet.scan.enabled=false
     --conf spark.comet.convert.parquet.enabled=true
 
     --conf spark.comet.expression.regexp.allowIncompatible=true
     --conf spark.comet.expression.RegExpReplace.allowIncompatible=true
     --conf spark.comet.expression.RegExpExtract.allowIncompatible=true
     --conf spark.comet.caseConversion.enabled=true
 
     --conf spark.comet.exec.shuffle.enabled=true
     --conf spark.comet.exec.shuffle.mode=auto
     --conf spark.comet.batchSize=8192
     --conf spark.comet.memoryOverhead=$COMET_MEMORY_OVERHEAD
 
     --conf spark.memory.offHeap.enabled=true
     --conf spark.memory.offHeap.size=$OFFHEAP_SIZE
     --conf spark.comet.explainFallback.enabled=true
     --conf spark.comet.logFallbackReasons.enabled=true
 
     ## Azure related configurations
     --conf spark.hadoop.fs.azure.account.oauth2.msi.tenant="$TENANT_ID"
     --conf spark.hadoop.fs.azure.account.auth.type."$STORAGE_ACCOUNT".dfs.core.windows.net=OAuth
     --conf spark.hadoop.fs.azure.account.oauth2.client.id"$STORAGE_ACCOUNT".dfs.core.windows.net="$CLIENT_ID"
     --conf spark.hadoop.fs.azure.impl=org.apache.hadoop.fs.azure.NativeAzureFileSystem
     --conf spark.hadoop.fs.azure.account.oauth2.token.file."$STORAGE_ACCOUNT".dfs.core.windows.net=/var/run/secrets/azure/tokens/azure-identity-token
     --conf spark.hadoop.fs.azure.account.oauth.provider.type."$STORAGE_ACCOUNT".dfs.core.windows.net=org.apache.hadoop.fs.azurebfs.oauth2.WorkloadIdentityTokenProvider

    ## There are other configurations related to kubernetes and spark which are skipped for the sake of brevity
  • Attempt to read a Parquet file stored in an Azure storage bucket as a spark dataframe using an abfss:// URI.
  • After reading the parquet file, execute df.count() to check if the result is 0.
  • Execute df.show() to confirm that the parquet file contains valid data.

Expected behavior

After reading the parquet file with the above df.count() should have returned the correct number of rows instead of 0.

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingpriority:criticalData corruption, silent wrong results, security issuesrequires-triage

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions