Describe the bug
The workaround for the issue #4747 is to disable spark.comet.scan.enabled and enable spark.comet.convert.parquet.enabled so that all the other transformations can go through comet. This workaround was also mentioned here at #4751
But using these settings if we read a parquet file from azure storage bucket and execute df.count() the result is zero, even though the parquet file contains records, which can be confirmed by running df.show().
Attached a screenshot of the query plan that shows that the number of rows scanned is 1178, but in the next stage where i assume the conversion happens the number of input rows is 0.
Steps to reproduce
The steps to reproduce are the exact same as #4747, except conf spark.comet.scan.enabled is set to false and spark.comet.convert.parquet.enabled is set to true, But the steps have been repeated here to maintain consistency
- Deploy Spark cluster (3.5.6) with Comet 0.16.0 enabled on an Azure Kubernetes Service (AKS) cluster.
- Ensure that all ABFS related connectors and comet jar are present in the classpath.
- Configure Azure Workload Identity (which injects AZURE_CLIENT_ID, AZURE_TENANT_ID, and AZURE_FEDERATED_TOKEN_FILE env vars).
- In the configurations for the
spark-submit, attach the following
--conf spark.comet.enabled=true
--conf spark.comet.exec.enabled=true
--conf spark.comet.scan.enabled=false
--conf spark.comet.convert.parquet.enabled=true
--conf spark.comet.expression.regexp.allowIncompatible=true
--conf spark.comet.expression.RegExpReplace.allowIncompatible=true
--conf spark.comet.expression.RegExpExtract.allowIncompatible=true
--conf spark.comet.caseConversion.enabled=true
--conf spark.comet.exec.shuffle.enabled=true
--conf spark.comet.exec.shuffle.mode=auto
--conf spark.comet.batchSize=8192
--conf spark.comet.memoryOverhead=$COMET_MEMORY_OVERHEAD
--conf spark.memory.offHeap.enabled=true
--conf spark.memory.offHeap.size=$OFFHEAP_SIZE
--conf spark.comet.explainFallback.enabled=true
--conf spark.comet.logFallbackReasons.enabled=true
## Azure related configurations
--conf spark.hadoop.fs.azure.account.oauth2.msi.tenant="$TENANT_ID"
--conf spark.hadoop.fs.azure.account.auth.type."$STORAGE_ACCOUNT".dfs.core.windows.net=OAuth
--conf spark.hadoop.fs.azure.account.oauth2.client.id"$STORAGE_ACCOUNT".dfs.core.windows.net="$CLIENT_ID"
--conf spark.hadoop.fs.azure.impl=org.apache.hadoop.fs.azure.NativeAzureFileSystem
--conf spark.hadoop.fs.azure.account.oauth2.token.file."$STORAGE_ACCOUNT".dfs.core.windows.net=/var/run/secrets/azure/tokens/azure-identity-token
--conf spark.hadoop.fs.azure.account.oauth.provider.type."$STORAGE_ACCOUNT".dfs.core.windows.net=org.apache.hadoop.fs.azurebfs.oauth2.WorkloadIdentityTokenProvider
## There are other configurations related to kubernetes and spark which are skipped for the sake of brevity
- Attempt to read a Parquet file stored in an Azure storage bucket as a spark dataframe using an
abfss:// URI.
- After reading the parquet file, execute
df.count() to check if the result is 0.
- Execute
df.show() to confirm that the parquet file contains valid data.
Expected behavior
After reading the parquet file with the above df.count() should have returned the correct number of rows instead of 0.
Additional context
No response
Describe the bug
The workaround for the issue #4747 is to disable
spark.comet.scan.enabledand enablespark.comet.convert.parquet.enabledso that all the other transformations can go through comet. This workaround was also mentioned here at #4751But using these settings if we read a parquet file from azure storage bucket and execute
df.count()the result is zero, even though the parquet file contains records, which can be confirmed by runningdf.show().Attached a screenshot of the query plan that shows that the number of rows scanned is 1178, but in the next stage where i assume the conversion happens the number of input rows is 0.
Steps to reproduce
The steps to reproduce are the exact same as #4747, except
conf spark.comet.scan.enabledis set tofalseandspark.comet.convert.parquet.enabledis set totrue, But the steps have been repeated here to maintain consistencyspark-submit, attach the followingabfss://URI.df.count()to check if the result is 0.df.show()to confirm that the parquet file contains valid data.Expected behavior
After reading the parquet file with the above
df.count()should have returned the correct number of rows instead of 0.Additional context
No response