Does Koalas support reading hive table by default? #2194

amznero · 2021-08-26T07:39:50Z

Hi,

I'm trying to use Koalas to load a hive table on the remote cluster. In https://koalas.readthedocs.io/en/latest/reference/io.html#spark-metastore-table, it says that I can use ks.read_table API to read spark-table, but it failed when I use ks.read_table to read the table.

import pandas as pd
import numpy as np
import databricks.koalas as ks
from pyspark.sql import SparkSession

koalas_df = ks.read_table("xxx.yyy")

Error log:

AnalysisException: "Table or view not found: `xxx`.`yyy`;;\n'UnresolvedRelation `xxx`.`yyy`\n"

However, I can load it successfully by directly using pyspark+pandas+pyarrow.

some snippets

from pyspark.sql import SparkSession

spark = SparkSession.builder.enableHiveSupport().getOrCreate()
spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")

spark_df = spark.read.table("xxx")
pandas_df = spark_df.toPandas()
...

And I check some source codes in

koalas/databricks/koalas/namespace.py

Line 556 in e971d6f

    
           def read_table(name: str, index_col: Optional[Union[str, List[str]]] = None) -> DataFrame:

It uses default_session(without option configures) to load the table, but it does not set enableHiveSupport option.

koalas/databricks/koalas/utils.py

Lines 433 to 456 in e971d6f

    
           def default_session(conf=None): 
        
               if conf is None: 
        
                   conf = dict() 
        
               should_use_legacy_ipc = False 
        
               if LooseVersion(pyarrow.__version__) >= LooseVersion("0.15") and LooseVersion( 
        
                   pyspark.__version__ 
        
               ) < LooseVersion("3.0"): 
        
                   conf["spark.executorEnv.ARROW_PRE_0_15_IPC_FORMAT"] = "1" 
        
                   conf["spark.yarn.appMasterEnv.ARROW_PRE_0_15_IPC_FORMAT"] = "1" 
        
                   conf["spark.mesos.driverEnv.ARROW_PRE_0_15_IPC_FORMAT"] = "1" 
        
                   conf["spark.kubernetes.driverEnv.ARROW_PRE_0_15_IPC_FORMAT"] = "1" 
        
                   should_use_legacy_ipc = True 
        
               builder = spark.SparkSession.builder.appName("Koalas") 
        
               for key, value in conf.items(): 
        
                   builder = builder.config(key, value) 
        
               # Currently, Koalas is dependent on such join due to 'compute.ops_on_diff_frames' 
        
               # configuration. This is needed with Spark 3.0+. 
        
               builder.config("spark.sql.analyzer.failAmbiguousSelfJoin", False) 
        
               if LooseVersion(pyspark.__version__) >= LooseVersion("3.0.1") and is_testing(): 
        
                   builder.config("spark.executor.allowSparkContext", False) 
        
               session = builder.getOrCreate()

So, I'm a little confused about ks.read_table, where does it load tables from?
Maybe link to Spark-warehouse?

The text was updated successfully, but these errors were encountered:

itholic added question Further information is requested and removed question Further information is requested labels Aug 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Does Koalas support reading hive table by default? #2194

Does Koalas support reading hive table by default? #2194

amznero commented Aug 26, 2021

Does Koalas support reading hive table by default? #2194

Does Koalas support reading hive table by default? #2194

Comments

amznero commented Aug 26, 2021