`.toPandas()`: `AttributeError: 'SQLContext' object has no attribute '_conf'` #168

ggorlen · 2023-05-06T01:43:18Z

Taking the example from the readme and appending counts.toDF().toPandas() fails with the titular error. Complete code:

PS > pip freeze | grep spar
pysparkling==0.6.2
PS > py
Python 3.10.2 (tags/v3.10.2:a58ebcc, Jan 17 2022, 14:12:15) [MSC v.1929 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> from pysparkling import Context
>>> counts = (
...     Context()
...     .textFile('_vimrc')
...     .map(lambda line: ''.join(ch if ch.isalnum() else ' ' for ch in line))
...     .flatMap(lambda line: line.split(' '))
...     .map(lambda word: (word, 1))
...     .reduceByKey(lambda a, b: a + b)
... )
>>> counts.toDF().toPandas()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Python310\lib\site-packages\pysparkling\sql\dataframe.py", line 1636,
 in toPandas
    sql_ctx_conf = self.sql_ctx._conf
AttributeError: 'SQLContext' object has no attribute '_conf'

I also tried on Debian GNU/Linux 10 (buster), same error, same pysparkling version, same code, but with Python 3.11 rather than 3.10.

For context, I'd like to use pd.testing.assert_frame_equal() to write a unit test. I have no experience with this library or pyspark, so I might be missing something obvious.

The text was updated successfully, but these errors were encountered:

ggorlen · 2023-05-06T02:09:34Z

Here's a workaround that naively fills in the missing attributes:

import pandas as pd
from pysparkling import Context, RDD
import unittest


def make_counts(path: str) -> RDD:
    """Code from the pysparkling example"""
    return (
        Context()
        .textFile(path)
        .map(lambda line: ''.join(ch if ch.isalnum() else ' ' for ch in line))
        .flatMap(lambda line: line.split(' '))
        .map(lambda word: (word, 1))
        .reduceByKey(lambda a, b: a + b)
    )


def to_df(rdd: RDD) -> pd.DataFrame:
    """Workaround https://github.com/svenkreiss/pysparkling/issues/168"""
    df = rdd.toDF()
    class _X: pandasRespectSessionTimeZone = lambda: None
    class _Y: _conf = _X
    df.sql_ctx = _Y
    return df.toPandas()


class TestWordCount(unittest.TestCase):
    def test_word_count(self):
        "test that we can count the words in a file"
        expected = pd.DataFrame({
          "_1": ["foo", "bar", "baz"],
          "_2": [2, 1, 1],
        })

        with open("test.txt", "w") as f:
            f.write("foo bar foo baz")

        actual = to_df(make_counts("test.txt"))
        pd.testing.assert_frame_equal(expected, actual)


if __name__ == "__main__":
    unittest.main(verbosity=2)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`.toPandas()`: `AttributeError: 'SQLContext' object has no attribute '_conf'` #168

`.toPandas()`: `AttributeError: 'SQLContext' object has no attribute '_conf'` #168

ggorlen commented May 6, 2023 •

edited

ggorlen commented May 6, 2023 •

edited

.toPandas(): AttributeError: 'SQLContext' object has no attribute '_conf' #168

.toPandas(): AttributeError: 'SQLContext' object has no attribute '_conf' #168

Comments

ggorlen commented May 6, 2023 • edited

ggorlen commented May 6, 2023 • edited

`.toPandas()`: `AttributeError: 'SQLContext' object has no attribute '_conf'` #168

`.toPandas()`: `AttributeError: 'SQLContext' object has no attribute '_conf'` #168

ggorlen commented May 6, 2023 •

edited

ggorlen commented May 6, 2023 •

edited