Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

.toPandas(): AttributeError: 'SQLContext' object has no attribute '_conf' #168

Open
ggorlen opened this issue May 6, 2023 · 1 comment

Comments

@ggorlen
Copy link

ggorlen commented May 6, 2023

Taking the example from the readme and appending counts.toDF().toPandas() fails with the titular error. Complete code:

PS > pip freeze | grep spar
pysparkling==0.6.2
PS > py
Python 3.10.2 (tags/v3.10.2:a58ebcc, Jan 17 2022, 14:12:15) [MSC v.1929 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> from pysparkling import Context
>>> counts = (
...     Context()
...     .textFile('_vimrc')
...     .map(lambda line: ''.join(ch if ch.isalnum() else ' ' for ch in line))
...     .flatMap(lambda line: line.split(' '))
...     .map(lambda word: (word, 1))
...     .reduceByKey(lambda a, b: a + b)
... )
>>> counts.toDF().toPandas()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Python310\lib\site-packages\pysparkling\sql\dataframe.py", line 1636,
 in toPandas
    sql_ctx_conf = self.sql_ctx._conf
AttributeError: 'SQLContext' object has no attribute '_conf'

I also tried on Debian GNU/Linux 10 (buster), same error, same pysparkling version, same code, but with Python 3.11 rather than 3.10.

For context, I'd like to use pd.testing.assert_frame_equal() to write a unit test. I have no experience with this library or pyspark, so I might be missing something obvious.

@ggorlen
Copy link
Author

ggorlen commented May 6, 2023

Here's a workaround that naively fills in the missing attributes:

import pandas as pd
from pysparkling import Context, RDD
import unittest


def make_counts(path: str) -> RDD:
    """Code from the pysparkling example"""
    return (
        Context()
        .textFile(path)
        .map(lambda line: ''.join(ch if ch.isalnum() else ' ' for ch in line))
        .flatMap(lambda line: line.split(' '))
        .map(lambda word: (word, 1))
        .reduceByKey(lambda a, b: a + b)
    )


def to_df(rdd: RDD) -> pd.DataFrame:
    """Workaround https://github.com/svenkreiss/pysparkling/issues/168"""
    df = rdd.toDF()
    class _X: pandasRespectSessionTimeZone = lambda: None
    class _Y: _conf = _X
    df.sql_ctx = _Y
    return df.toPandas()


class TestWordCount(unittest.TestCase):
    def test_word_count(self):
        "test that we can count the words in a file"
        expected = pd.DataFrame({
          "_1": ["foo", "bar", "baz"],
          "_2": [2, 1, 1],
        })

        with open("test.txt", "w") as f:
            f.write("foo bar foo baz")

        actual = to_df(make_counts("test.txt"))
        pd.testing.assert_frame_equal(expected, actual)


if __name__ == "__main__":
    unittest.main(verbosity=2)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant