[Bug] Kyuubi return query status (FINISH_STATE) quickly, but job still executing. #6302

DA1OOO · 2024-04-12T04:00:39Z

Code of Conduct

I agree to follow this project's Code of Conduct

Search before asking

I have searched in the issues and found no similar issues.

Describe the bug

When I submit query by kyuubi to Spark, kyuubi return query status (FINISH_STATE) quickly, but spark job still executing.

Then I find similar bug description in issue. @XorSum #5232, in this issue, said this fix has been merged into 1.8.0, but I try both 1.8.0, 1.8.1, 1.9.0, this problem will always arise, Is there any configuration that needs to be modified？

I find my SQL is also (empty table) left join (big table).

my SQL is as follow:
select * from ( select uid from db.table1 where d_s = 20240408 and abtest_id like 'sadjjfljlajf' ) t1 left join ( select vendor_id FROM db.table2 LATERAL VIEW explode (json_object_keys (countdata)) t1 AS kv WHERE d_s = 20240408 ) t2 on t2.vendor_id = t1.uid limit 5;

in this SQL, the left sub query result is empty (0 rows).
select count(1) from db.table1 where d_s = 20240408 and abtest_id like 'sadjjfljlajf'

right sub-query result is relative big (4432411935 rows).
select count(1) FROM db.table2 LATERAL VIEW explode (json_object_keys (countdata)) t1 AS kv WHERE d_s = 20240408

Affects Version(s)

kyuubi 1.8.0 / 1.8.1/1.9.0 | Spark 3.4.2

Kyuubi Server Log Output

No response

Kyuubi Engine Log Output

No response

Kyuubi Server Configurations

No response

Kyuubi Engine Configurations

No response

Additional context

No response

Are you willing to submit PR?

Yes. I would be willing to submit a PR with guidance from the Kyuubi community to fix.
No. I cannot submit a PR at this time.

The text was updated successfully, but these errors were encountered:

github-actions · 2024-04-12T04:01:04Z

Hello @DA1OOO,
Thanks for finding the time to report the issue!
We really appreciate the community's efforts to improve Apache Kyuubi.

pan3793 · 2024-04-18T03:18:15Z

cc @cxzl25 and @ulysses-you

XorSum · 2024-04-18T10:41:19Z

Hi @DA1OOO, can you find Asked to cancel job group f35e7afd in your Spark log?

  /**
   * Cancel all jobs in the given job group ID.
   * @param cancelFutureJobs if true, future submitted jobs in this job group will be cancelled
   */
  def cancelJobGroup(groupId: String, cancelFutureJobs: Boolean = false): Unit = {
    logInfo(s"Asked to cancel job group $groupId with cancelFutureJobs=$cancelFutureJobs")
    eventProcessLoop.post(JobGroupCancelled(groupId, cancelFutureJobs))
  }

https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1109

DA1OOO · 2024-04-18T12:39:58Z

@XorSum
In the above except situation, there is no such log as you mentioned.

$ grep "2ec5749d-9b85-454e-825f-b7fb13b868e9" _driver.log
2024-04-18 20:30:46 INFO OperationLog:64 - Creating operation log file /home/work/apache-kyuubi-1.9.0-bin/work/engine_operation_logs/8008f27d-567a-4548-991a-5d0ce22fb809/2ec5749d-9b85-454e-825f-b7fb13b868e9
2024-04-18 20:30:46 INFO ExecuteStatement:64 - Processing work's query[2ec5749d-9b85-454e-825f-b7fb13b868e9]: PENDING_STATE -> RUNNING_STATE, statement:
2024-04-18 20:30:48 INFO SQLOperationListener:64 - Query [2ec5749d-9b85-454e-825f-b7fb13b868e9]: Job 3 started with 1 stages, 1 active jobs running
2024-04-18 20:30:48 INFO SQLOperationListener:64 - Query [2ec5749d-9b85-454e-825f-b7fb13b868e9]: Stage 3.0 started with 420 tasks, 1 active stages running
2024-04-18 20:30:49 INFO SQLOperationListener:64 - Query [2ec5749d-9b85-454e-825f-b7fb13b868e9]: Job 4 started with 1 stages, 2 active jobs running
2024-04-18 20:30:49 INFO SQLOperationListener:64 - Query [2ec5749d-9b85-454e-825f-b7fb13b868e9]: Stage 4.0 started with 9 tasks, 2 active stages running
2024-04-18 20:31:13 INFO SQLOperationListener:64 - Query [2ec5749d-9b85-454e-825f-b7fb13b868e9]: Job 4 succeeded, 1 active jobs running
2024-04-18 20:31:13 INFO SQLOperationListener:64 - Query [2ec5749d-9b85-454e-825f-b7fb13b868e9]: Job 5 started with 1 stages, 2 active jobs running
2024-04-18 20:31:13 INFO SQLOperationListener:64 - Query [2ec5749d-9b85-454e-825f-b7fb13b868e9]: Stage 5.0 started with 1 tasks, 2 active stages running
2024-04-18 20:31:14 INFO SQLOperationListener:64 - Query [2ec5749d-9b85-454e-825f-b7fb13b868e9]: Job 5 succeeded, 1 active jobs running
2024-04-18 20:31:14 INFO ExecuteStatement:64 - Processing work's query[2ec5749d-9b85-454e-825f-b7fb13b868e9]: RUNNING_STATE -> FINISHED_STATE, time taken: 27.892 seconds
2024-04-18 20:31:14 INFO ExecuteStatement:64 - statementId=2ec5749d-9b85-454e-825f-b7fb13b868e9, operationRunTime=18 s, operationCpuTime=16 s

But when I manually cancel, the log you mentioned does appear.

$ grep "c5c17ebd-9aaa-48d4-ae17-42203fb6fa60" _driver.log
2024-04-18 20:33:57 INFO OperationLog:64 - Creating operation log file /home/work/apache-kyuubi-1.9.0-bin/work/engine_operation_logs/8008f27d-567a-4548-991a-5d0ce22fb809/c5c17ebd-9aaa-48d4-ae17-42203fb6fa60
2024-04-18 20:33:57 INFO ExecuteStatement:64 - Processing work's query[c5c17ebd-9aaa-48d4-ae17-42203fb6fa60]: PENDING_STATE -> RUNNING_STATE, statement:
2024-04-18 20:33:57 INFO SQLOperationListener:64 - Query [c5c17ebd-9aaa-48d4-ae17-42203fb6fa60]: Job 6 started with 1 stages, 1 active jobs running
2024-04-18 20:33:57 INFO SQLOperationListener:64 - Query [c5c17ebd-9aaa-48d4-ae17-42203fb6fa60]: Stage 6.0 started with 420 tasks, 1 active stages running
2024-04-18 20:33:57 INFO SQLOperationListener:64 - Query [c5c17ebd-9aaa-48d4-ae17-42203fb6fa60]: Job 7 started with 1 stages, 2 active jobs running
2024-04-18 20:33:57 INFO SQLOperationListener:64 - Query [c5c17ebd-9aaa-48d4-ae17-42203fb6fa60]: Stage 7.0 started with 9 tasks, 2 active stages running
2024-04-18 20:34:12 INFO ExecuteStatement:64 - Processing work's query[c5c17ebd-9aaa-48d4-ae17-42203fb6fa60]: RUNNING_STATE -> CANCELED_STATE, time taken: 15.025 seconds
2024-04-18 20:34:12 INFO ExecuteStatement:64 - statementId=c5c17ebd-9aaa-48d4-ae17-42203fb6fa60, operationRunTime=0 ms, operationCpuTime=0 ms
2024-04-18 20:34:12 INFO DAGScheduler:60 - Asked to cancel job group c5c17ebd-9aaa-48d4-ae17-42203fb6fa60
2024-04-18 20:34:12 INFO DAGScheduler:60 - ShuffleMapStage 6 (sql at ExecuteStatement.scala:99) failed in 14.551 s due to Job 6 cancelled part of cancelled job group c5c17ebd-9aaa-48d4-ae17-42203fb6fa60
2024-04-18 20:34:12 INFO SparkSQLEngineListener:64 - Job end. Job 6 state is JobFailed(org.apache.spark.SparkException: Job 6 cancelled part of cancelled job group c5c17ebd-9aaa-48d4-ae17-42203fb6fa60)
2024-04-18 20:34:12 INFO DAGScheduler:60 - ShuffleMapStage 7 (sql at ExecuteStatement.scala:99) failed in 14.461 s due to Job 7 cancelled part of cancelled job group c5c17ebd-9aaa-48d4-ae17-42203fb6fa60
2024-04-18 20:34:12 INFO SparkSQLEngineListener:64 - Job end. Job 7 state is JobFailed(org.apache.spark.SparkException: Job 7 cancelled part of cancelled job group c5c17ebd-9aaa-48d4-ae17-42203fb6fa60)
2024-04-18 20:34:12 INFO DAGScheduler:60 - Asked to cancel job group c5c17ebd-9aaa-48d4-ae17-42203fb6fa60

XorSum · 2024-04-18T13:22:37Z

Could you provide more information to help me reproduce this bug?
For example, Spark version, the execution plan, and sql code to create db.table1 and db.table2.

DA1OOO · 2024-04-19T03:12:27Z

@XorSum
Spark 3.4.2 | Kyuubi 1.9.0
SQL:

 CREATE EXTERNAL TABLE `ios_test_behavior`(                                                          
   `other_column` string,                                                              
   `statusdata` string,                                                                
   `vendor_id` string,                                                                  
   `dt` string,                                                                        
   `countdata` string,                                                                 
   `time` string,                                                                      
   `device` string,                                                                       
   `app_version` string,                                                                 
   `system_version` string,                                                             
   `is_kb` string,                                                              
   `userid` string,                                                                
   `web_id` string,                                                                                    
   `is_test` string)                                                           
 PARTITIONED BY (                                                                                      
   `d_s` string)                                                                                       
 ROW FORMAT SERDE                                                                                      
   'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'                                       
 STORED AS INPUTFORMAT                                                                                 
   'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'                                     
 OUTPUTFORMAT                                                                                          
   'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat';




CREATE EXTERNAL TABLE `ods_test_abtest_uid_daily`(                                                          
   `other_column` string,                                                                                      
   `product` string,                                                                                           
   `abtest_id` string,                                                                                         
   `uid` string,                                                                                               
   `item_type` string)                                                                                         
 PARTITIONED BY (                                                                                              
   `d_s` string)                                                                                               
 ROW FORMAT SERDE                                                                                              
   'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'                                                        
 STORED AS INPUTFORMAT                                                                                         
   'org.apache.hadoop.mapred.TextInputFormat'                                                                  
 OUTPUTFORMAT                                                                                                  
   'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat';


select *
from (
    select vendor_id
    FROM test.ios_test_behavior
      -- 不explode的话 只有1000w数据，无法复现
    LATERAL VIEW explode (json_object_keys (countdata)) t1 AS kv 
    WHERE  d_s = 20240408
  ) t1
  join (
    select uid
    from test.ods_test_abtest_uid_daily
    where d_s = 20240408
      and abtest_id like 'sadjjfljlajf'
  ) t2 on t1.vendor_id = t2.uid
limit 5;

in table ios_test_behavior, column countdata are json string, and abtest_id like 'sadjjfljlajf' is used to make right sub-query result empty.
Physical Plan:

== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- CollectLimit 5
   +- SortMergeJoin [vendor_id#3676], [uid#3690], Inner
      :- Sort [vendor_id#3676 ASC NULLS FIRST], false, 0
      :  +- Exchange hashpartitioning(vendor_id#3676, 200), ENSURE_REQUIREMENTS, [plan_id=3927]
      :     +- Project [vendor_id#3676]
      :        +- Generate explode(json_object_keys(countdata#3678)), [vendor_id#3676], false, [kv#3693]
      :           +- Project [vendor_id#3676, countdata#3678]
      :              +- Filter isnotnull(vendor_id#3676)
      :                 +- FileScan parquet spark_catalog.test.ios_test_behavior[vendor_id#3676,countdata#3678,d_s#3686] Batched: true, DataFilters: [isnotnull(vendor_id#3676)], Format: Parquet, Location: InMemoryFileIndex(1 paths)[hdfs://user-cn01/user/hive/warehouse/sim..., PartitionFilters: [isnotnull(d_s#3686), (cast(d_s#3686 as int) = 20240408)], PushedFilters: [IsNotNull(vendor_id)], ReadSchema: struct<vendor_id:string,countdata:string>
      +- Sort [uid#3690 ASC NULLS FIRST], false, 0
         +- Exchange hashpartitioning(uid#3690, 200), ENSURE_REQUIREMENTS, [plan_id=3928]
            +- Project [uid#3690]
               +- Filter ((isnotnull(abtest_id#3689) AND (abtest_id#3689 = sadjjfljlajf)) AND isnotnull(uid#3690))
                  +- Scan hive spark_catalog.test.ods_test_abtest_uid_daily [abtest_id#3689, uid#3690], HiveTableRelation [`spark_catalog`.`test`.`ods_test_abtest_uid_daily`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [other_column#3687, product#3688, abtest_id#3689, uid#3690, item_type#3691], Partition Cols: [d_s#3692], Pruned Partitions: [(d_s=20240408)]], [isnotnull(d_s#3692), (cast(d_s#3692 as int) = 20240408)]

DA1OOO · 2024-05-10T06:09:18Z

@XorSum is above information enough to reproduce this case?

XorSum · 2024-05-10T07:31:49Z

@DA1OOO I tried running the code below on Kyuubi 1.9 and Spark 3.4.2, but I couldn't reproduce your bug. In other words, I can see the Asked to cancel job group log.
Does the number of data rows affect reproducibility? If so, could you explain how to create a test dataset?

CREATE TABLE test.ios_test_behavior
(
    `vendor_id` string,
    `countdata` string
) PARTITIONED BY (`d_s` string);
CREATE TABLE test.ods_test_abtest_uid_daily
(
    `abtest_id` string,
    `uid` string
) PARTITIONED BY (`d_s` string);

insert into test.ods_test_abtest_uid_daily partition (d_s ='20240408')
values ('abcde', 1);
insert into test.ios_test_behavior partition (d_s ='20240408')
values (1, '{}');

select *
from (select vendor_id
      FROM test.ios_test_behavior LATERAL VIEW explode (json_object_keys (countdata)) t1 AS kv
      WHERE d_s = 20240408) t1
         join (select uid
               from test.ods_test_abtest_uid_daily
               where d_s = 20240408
                 and abtest_id like 'sadjjfljlajf') t2 on t1.vendor_id = t2.uid
limit 5;

DA1OOO · 2024-05-10T12:55:17Z

I think it maybe relative to the number of data rows, because when i not use explode in subquery. it will not be reproduced. i dont have any test dataset, we meet this case in our production environment.

XorSum · 2024-05-10T13:06:33Z

We don't need real datasets in the production environment.
You just need to create a fake dataset with random data that is large enough to reproduce the bug.

DA1OOO · 2024-05-10T13:37:14Z

ok, I will try to reproduce this case using some test dataset later.

DA1OOO · 2024-05-13T07:11:12Z

@XorSum
I have used hive test bench to reproduce this case.

after build this project ./tpcds-build.sh, I generate about 5GB data on my HDFS cluster./tpcds-setup.sh 5 /extwarehouse/tpcds, and use table store_sales to reproduce it.

the follow is the sql which could be used to reproduced this bug.
select * from ( select sum(ss_net_profit) as sum_net_pro, sum(ss_net_paid_inc_tax), sum(ss_net_paid), sum(ss_coupon_amt), sum(ss_ext_tax), sum(ss_ext_list_price), sum(ss_ext_wholesale_cost), avg(ss_ext_sales_price), sum(ss_ext_discount_amt), avg(ss_sales_price), sum(ss_list_price), avg(ss_wholesale_cost) from test.store_sales group by ss_sold_date_sk, ss_sold_time_sk, ss_item_sk, ss_customer_sk, ss_cdemo_sk, ss_hdemo_sk ) t1 join ( select ss_sold_date_sk, ss_sold_time_sk, ss_item_sk from test.store_sales where ss_net_paid = 101930291 and ss_ext_discount_amt = 1 ) t2 on t1.sum_net_pro = t2.ss_item_sk

this is my kyuubi version information and query's start time and finish time.

this is actually the spark job information.

you can see, before the job 310 finished, query's state has been change to FINISHED.

spark plan:

relative log:

grep "932ff7cb-3b52-4abe-bd1e-5851bb77cfcc" _driver.log*
_driver.log:2024-05-13 14:50:17 INFO OperationLog:64 - Creating operation log file /home/work/linzhaosheng/apache-kyuubi-1.9.0-bin/work/engine_operation_logs/0e5d19a7-1f46-4601-9c77-108ca4016f1d/932ff7cb-3b52-4abe-bd1e-5851bb77cfcc
_driver.log:2024-05-13 14:50:17 INFO ExecuteStatement:64 - Processing work's query[932ff7cb-3b52-4abe-bd1e-5851bb77cfcc]: PENDING_STATE -> RUNNING_STATE, statement:
_driver.log:2024-05-13 14:50:17 INFO SQLOperationListener:64 - Query [932ff7cb-3b52-4abe-bd1e-5851bb77cfcc]: Job 310 started with 1 stages, 1 active jobs running
_driver.log:2024-05-13 14:50:17 INFO SQLOperationListener:64 - Query [932ff7cb-3b52-4abe-bd1e-5851bb77cfcc]: Stage 325.0 started with 10 tasks, 1 active stages running
_driver.log:2024-05-13 14:50:17 INFO SQLOperationListener:64 - Query [932ff7cb-3b52-4abe-bd1e-5851bb77cfcc]: Job 311 started with 1 stages, 2 active jobs running
_driver.log:2024-05-13 14:50:17 INFO SQLOperationListener:64 - Query [932ff7cb-3b52-4abe-bd1e-5851bb77cfcc]: Stage 326.0 started with 10 tasks, 2 active stages running
_driver.log:2024-05-13 14:50:24 INFO SQLOperationListener:64 - Query [932ff7cb-3b52-4abe-bd1e-5851bb77cfcc]: Job 311 succeeded, 1 active jobs running
_driver.log:2024-05-13 14:50:24 INFO SQLOperationListener:64 - Query [932ff7cb-3b52-4abe-bd1e-5851bb77cfcc]: Job 312 started with 1 stages, 2 active jobs running
_driver.log:2024-05-13 14:50:24 INFO SQLOperationListener:64 - Query [932ff7cb-3b52-4abe-bd1e-5851bb77cfcc]: Stage 327.0 started with 1 tasks, 2 active stages running
_driver.log:2024-05-13 14:50:24 INFO SQLOperationListener:64 - Query [932ff7cb-3b52-4abe-bd1e-5851bb77cfcc]: Job 312 succeeded, 1 active jobs running
_driver.log:2024-05-13 14:50:25 INFO ExecuteStatement:64 - Processing work's query[932ff7cb-3b52-4abe-bd1e-5851bb77cfcc]: RUNNING_STATE -> FINISHED_STATE, time taken: 7.885 seconds
_driver.log:2024-05-13 14:50:25 INFO ExecuteStatement:64 - statementId=932ff7cb-3b52-4abe-bd1e-5851bb77cfcc, operationRunTime=25 s, operationCpuTime=23 s

grep "Job 310" _driver.log*
_driver.log:2024-05-13 14:50:17 INFO SQLOperationListener:64 - Query [932ff7cb-3b52-4abe-bd1e-5851bb77cfcc]: Job 310 started with 1 stages, 1 active jobs running
_driver.log:2024-05-13 14:50:35 INFO SparkSQLEngineListener:64 - Job end. Job 310 state is JobSucceeded

XorSum · 2024-05-15T06:51:31Z

Thank you for providing the reproduction code.
Unfortunately, I am still unable to reproduce this bug using your tpcds query.

DA1OOO added kind:bug This is a clearly a bug priority:major labels Apr 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Kyuubi return query status (FINISH_STATE) quickly, but job still executing. #6302

[Bug] Kyuubi return query status (FINISH_STATE) quickly, but job still executing. #6302

DA1OOO commented Apr 12, 2024 •

edited by pan3793

github-actions bot commented Apr 12, 2024

pan3793 commented Apr 18, 2024

XorSum commented Apr 18, 2024

DA1OOO commented Apr 18, 2024 •

edited

XorSum commented Apr 18, 2024

DA1OOO commented Apr 19, 2024 •

edited

DA1OOO commented May 10, 2024

XorSum commented May 10, 2024

DA1OOO commented May 10, 2024

XorSum commented May 10, 2024

DA1OOO commented May 10, 2024

DA1OOO commented May 13, 2024 •

edited

XorSum commented May 15, 2024

[Bug] Kyuubi return query status (FINISH_STATE) quickly, but job still executing. #6302

[Bug] Kyuubi return query status (FINISH_STATE) quickly, but job still executing. #6302

Comments

DA1OOO commented Apr 12, 2024 • edited by pan3793

Code of Conduct

Search before asking

Describe the bug

Affects Version(s)

Kyuubi Server Log Output

Kyuubi Engine Log Output

Kyuubi Server Configurations

Kyuubi Engine Configurations

Additional context

Are you willing to submit PR?

github-actions bot commented Apr 12, 2024

pan3793 commented Apr 18, 2024

XorSum commented Apr 18, 2024

DA1OOO commented Apr 18, 2024 • edited

XorSum commented Apr 18, 2024

DA1OOO commented Apr 19, 2024 • edited

DA1OOO commented May 10, 2024

XorSum commented May 10, 2024

DA1OOO commented May 10, 2024

XorSum commented May 10, 2024

DA1OOO commented May 10, 2024

DA1OOO commented May 13, 2024 • edited

XorSum commented May 15, 2024

DA1OOO commented Apr 12, 2024 •

edited by pan3793

DA1OOO commented Apr 18, 2024 •

edited

DA1OOO commented Apr 19, 2024 •

edited

DA1OOO commented May 13, 2024 •

edited