Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does Koalas support reading hive table by default? #2194

Open
amznero opened this issue Aug 26, 2021 · 0 comments
Open

Does Koalas support reading hive table by default? #2194

amznero opened this issue Aug 26, 2021 · 0 comments

Comments

@amznero
Copy link

amznero commented Aug 26, 2021

Hi,

I'm trying to use Koalas to load a hive table on the remote cluster. In https://koalas.readthedocs.io/en/latest/reference/io.html#spark-metastore-table, it says that I can use ks.read_table API to read spark-table, but it failed when I use ks.read_table to read the table.

import pandas as pd
import numpy as np
import databricks.koalas as ks
from pyspark.sql import SparkSession

koalas_df = ks.read_table("xxx.yyy")

Error log:

AnalysisException: "Table or view not found: `xxx`.`yyy`;;\n'UnresolvedRelation `xxx`.`yyy`\n"

However, I can load it successfully by directly using pyspark+pandas+pyarrow.

some snippets

from pyspark.sql import SparkSession

spark = SparkSession.builder.enableHiveSupport().getOrCreate()
spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")

spark_df = spark.read.table("xxx")
pandas_df = spark_df.toPandas()
...

And I check some source codes in

def read_table(name: str, index_col: Optional[Union[str, List[str]]] = None) -> DataFrame:

It uses default_session(without option configures) to load the table, but it does not set enableHiveSupport option.

def default_session(conf=None):
if conf is None:
conf = dict()
should_use_legacy_ipc = False
if LooseVersion(pyarrow.__version__) >= LooseVersion("0.15") and LooseVersion(
pyspark.__version__
) < LooseVersion("3.0"):
conf["spark.executorEnv.ARROW_PRE_0_15_IPC_FORMAT"] = "1"
conf["spark.yarn.appMasterEnv.ARROW_PRE_0_15_IPC_FORMAT"] = "1"
conf["spark.mesos.driverEnv.ARROW_PRE_0_15_IPC_FORMAT"] = "1"
conf["spark.kubernetes.driverEnv.ARROW_PRE_0_15_IPC_FORMAT"] = "1"
should_use_legacy_ipc = True
builder = spark.SparkSession.builder.appName("Koalas")
for key, value in conf.items():
builder = builder.config(key, value)
# Currently, Koalas is dependent on such join due to 'compute.ops_on_diff_frames'
# configuration. This is needed with Spark 3.0+.
builder.config("spark.sql.analyzer.failAmbiguousSelfJoin", False)
if LooseVersion(pyspark.__version__) >= LooseVersion("3.0.1") and is_testing():
builder.config("spark.executor.allowSparkContext", False)
session = builder.getOrCreate()

So, I'm a little confused about ks.read_table, where does it load tables from?
Maybe link to Spark-warehouse?

@itholic itholic added question Further information is requested and removed question Further information is requested labels Aug 26, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants