Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Koalas Missing APIs #1929

Open
12 of 18 tasks
xinrong-meng opened this issue Nov 24, 2020 · 18 comments
Open
12 of 18 tasks

Implement Koalas Missing APIs #1929

xinrong-meng opened this issue Nov 24, 2020 · 18 comments
Labels
enhancement New feature or request help wanted Extra attention is needed

Comments

@xinrong-meng
Copy link
Contributor

xinrong-meng commented Nov 24, 2020

Help wanted! A few popular pandas APIs are missing in Koalas. We are going to implement them!

Please use this thread to comment on which function you will be working so we don't duplicate work. Please mention this issue in your PR so that the list below can be updated.

@xinrong-meng xinrong-meng added enhancement New feature or request help wanted Extra attention is needed labels Nov 24, 2020
@ueshin ueshin pinned this issue Nov 24, 2020
@AishwaryaKalloli
Copy link

Hi, I would like to help. I was planning on picking combine_first functionality.
Can you let me know if I can work on it. Thanks!

@HyukjinKwon
Copy link
Member

Please go ahead @AishwaryaKalloli!

@xinrong-meng
Copy link
Contributor Author

Certainly, thank you @AishwaryaKalloli!

@AishwaryaKalloli
Copy link

Just finished the set up in my local, hopefully will have some updates soon!

@AishwaryaKalloli
Copy link

I have committed the code, let me know if it is in the right direction. If it is I'll add the test cases and docs.

ueshin pushed a commit that referenced this issue Dec 10, 2020
ref #1929

```
        >>> df = ks.DataFrame({'num_legs': [4, 2], 'num_wings': [0, 2]},
        ...                   index=['dog', 'hawk'])
        >>> df
              num_legs  num_wings
        dog          4          0
        hawk         2          2
        >>> for row in df.itertuples():
        ...     print(row)
        ...
        Koalas(Index='dog', num_legs=4, num_wings=0)
        Koalas(Index='hawk', num_legs=2, num_wings=2)
```
ueshin pushed a commit that referenced this issue Dec 11, 2020
This PR proposes `GroupBy.median()`.

Note: the result can be slightly different from pandas since we use an approximated median based upon approximate percentile computation because computing median across a large dataset is extremely expensive.

```python
>>> kdf = ks.DataFrame({'a': [1., 1., 1., 1., 2., 2., 2., 3., 3., 3.],
...                     'b': [2., 3., 1., 4., 6., 9., 8., 10., 7., 5.],
...                     'c': [3., 5., 2., 5., 1., 2., 6., 4., 3., 6.]},
...                    columns=['a', 'b', 'c'],
...                    index=[7, 2, 4, 1, 3, 4, 9, 10, 5, 6])
>>> kdf
      a     b    c
7   1.0   2.0  3.0
2   1.0   3.0  5.0
4   1.0   1.0  2.0
1   1.0   4.0  5.0
3   2.0   6.0  1.0
4   2.0   9.0  2.0
9   2.0   8.0  6.0
10  3.0  10.0  4.0
5   3.0   7.0  3.0
6   3.0   5.0  6.0

>>> kdf.groupby('a').median().sort_index()  # doctest: +NORMALIZE_WHITESPACE
       b    c
a
1.0  2.0  3.0
2.0  8.0  2.0
3.0  7.0  4.0

>>> kdf.groupby('a')['b'].median().sort_index()
a
1.0    2.0
2.0    8.0
3.0    7.0
Name: b, dtype: float64
```

ref #1929
@shril
Copy link
Contributor

shril commented Dec 11, 2020

Hi @ueshin, @HyukjinKwon can I proceed with the following Dataframe APIs -

  • cov
  • first
  • between_time
  • insert

I'll start with the dev once you give me the approval.

@ueshin
Copy link
Collaborator

ueshin commented Dec 11, 2020

@shril sure, please go ahead! Thanks!

@shril
Copy link
Contributor

shril commented Dec 12, 2020

@ueshin I was going through this blog post of yours - https://databricks.com/blog/2020/03/31/10-minutes-from-pandas-to-koalas-on-apache-spark.html, and it suggested using map_in_pandas as a better alternative workaround because it does not require moving data into a single client node and potentially causing out-of-memory errors.

Do you suggest to proceed with using the map_in_pandas() API?

Edit:
I found out that map_in_pandas() is getting deprecated.
Do you suggest to continue with apply_batch() API?

@ueshin
Copy link
Collaborator

ueshin commented Dec 12, 2020

@shril I don't have a strong opinion on it. If you can implement it without apply_batch(), you don't need to use it.

@shril
Copy link
Contributor

shril commented Dec 12, 2020

Hi @ueshin, I am slightly confused. We don't have DatetimeIndex like pandas, so can we convert our index to pandas and use it's indexer_between_time function.

i = pd.date_range('2018-04-09', periods=2000, freq='1D1min')
ts = ks.DataFrame({'A': ['timestamp']}, index=i)
indexer = ts.index.to_pandas().indexer_between_time(start_time='0:15', end_time='0:45')
result = ts.copy().take(indexer)

This is the small implementation I tried. Do you think that to_pandas() might result in out-of-memory errors?

ueshin pushed a commit that referenced this issue Jan 13, 2021
ref #1929
```
        >>> kser = ks.Series(['b', None, 'a', 'c', 'b'])
        >>> codes, uniques = kser.factorize()
        >>> codes
        0    1
        1   -1
        2    0
        3    2
        4    1
        dtype: int64
        >>> uniques
        Index(['a', 'b', 'c'], dtype='object')

        >>> codes, uniques = kser.factorize(na_sentinel=None)
        >>> codes
        0    1
        1    3
        2    0
        3    2
        4    1
        dtype: int64
        >>> uniques
        Index(['a', 'b', 'c', None], dtype='object')

        >>> codes, uniques = kser.factorize(na_sentinel=-2)
        >>> codes
        0    1
        1   -2
        2    0
        3    2
        4    1
        dtype: int64
        >>> uniques
        Index(['a', 'b', 'c'], dtype='object')
```
ueshin pushed a commit that referenced this issue Jan 20, 2021
ref #1929

Insert column into DataFrame at a specified location.

```
        >>> kdf = ks.DataFrame([1, 2, 3])
        >>> kdf.insert(0, 'x', 4)
        >>> kdf.sort_index()
           x  0
        0  4  1
        1  4  2
        2  4  3

        >>> from databricks.koalas.config import set_option, reset_option
        >>> set_option("compute.ops_on_diff_frames", True)

        >>> kdf.insert(1, 'y', [5, 6, 7])
        >>> kdf.sort_index()
           x  y  0
        0  4  5  1
        1  4  6  2
        2  4  7  3

        >>> kdf.insert(2, 'z', ks.Series([8, 9, 10]))
        >>> kdf.sort_index()
           x  y   z  0
        0  4  5   8  1
        1  4  6   9  2
        2  4  7  10  3

        >>> reset_option("compute.ops_on_diff_frames")
```
@chogg
Copy link

chogg commented Jan 21, 2021

Is iterating through groups on the roadmap for API coverage? I would find that helpful.

@xinrong-meng
Copy link
Contributor Author

@chogg Thanks for the suggestion! We'll look into this and keep you updated.

ueshin pushed a commit that referenced this issue Mar 20, 2021
ref #1929

Implement `DataFrame.between_time`

```py
>>> i = pd.date_range('2018-04-09', periods=4, freq='1D20min')
>>> ts = pd.DataFrame({'A': [1, 2, 3, 4]}, index=i)
>>> kts = ks.from_pandas(ts)
>>> kts
                     A
2018-04-09 00:00:00  1
2018-04-10 00:20:00  2
2018-04-11 00:40:00  3
2018-04-12 01:00:00  4

>>> kts.between_time('0:15', '0:45')
                     A
2018-04-10 00:20:00  2
2018-04-11 00:40:00  3

You get the times that are *not* between two times by setting
``start_time`` later than ``end_time``:

>>> kts.between_time('0:45', '0:15')
                     A
2018-04-09 00:00:00  1
2018-04-12 01:00:00  4
```
@awdavidson
Copy link
Contributor

Hi all, I was going to look at implementing functionality for last Is anyone looking at this?

@xinrong-meng
Copy link
Contributor Author

@awdavidson Certainly, please feel free to do so! Your PR seemed to be closed.

@awdavidson
Copy link
Contributor

@xinrong-databricks I'll reopen currently still working on it. Opened the PR to check build etc as had issue running a few things locally - didn't want to clutter your PR tab. Local environment is now working so should be able to completely test :)

@xinrong-meng
Copy link
Contributor Author

@awdavidson Thanks! You might mark it as a draft PR until it's ready for review. Let us know if you have any questions :)

ueshin pushed a commit that referenced this issue Mar 30, 2021
Please see change to implement `DataFrame.last` and `Series.last` functionality similar to that available in pandas. Requirement raised in issue: #1929

```python
>>> index = pd.date_range('2018-04-09', periods=4, freq='2D')
>>> ks_series = ks.Series([1, 2, 3, 4], index=index)
2018-04-09  1
2018-04-11  2
2018-04-13  3
2018-04-15  4
dtype: int64

>>> ks_series.last('3D')
2018-04-13  3
2018-04-15  4
dtype: int64
```

```python
>>> index = pd.date_range('2018-04-09', periods=4, freq='2D')
>>> pdf = pd.DataFrame({'A': [1, 2, 3, 4]}, index=i)
>>> kdf = fs.from_pandas(pdf)
            A
2018-04-09  1
2018-04-11  2
2018-04-13  3
2018-04-15  4

 >>> kdf.last('3D')
            A
2018-04-13  3
2018-04-15  4      
```
ueshin pushed a commit that referenced this issue Mar 31, 2021
Please see change to implement DataFrame.first and Series.first functionality similar to that available in pandas. Requirement raised in issue: #1929

```python
>>> index = pd.date_range('2018-04-09', periods=4, freq='2D')
>>> ks_series = ks.Series([1, 2, 3, 4], index=index)
2018-04-09  1
2018-04-11  2
2018-04-13  3
2018-04-15  4
dtype: int64

>>> ks_series.first('3D')
2018-04-09  1
2018-04-11  2
dtype: int64
```
@awdavidson
Copy link
Contributor

awdavidson commented Apr 1, 2021

@ueshin @xinrong-databricks has there been any discussion around how to implement Index.map? I have a few idea's that maybe useful, however, do not want to step on anyones toes! If you have any information or documentation that would be useful! :)

Note: example implementation can be found here master...awdavidson:feature/impl-index_map

@ueshin
Copy link
Collaborator

ueshin commented Apr 2, 2021

@awdavidson As we have not been working on it, you can go ahead.

One thing on the example implementation, using self._index.values is not a good idea because it collects all data into a Driver node which could cause OOM.

Thanks!

rising-star92 added a commit to rising-star92/databricks-koalas that referenced this issue Jan 27, 2023
Please see change to implement `DataFrame.last` and `Series.last` functionality similar to that available in pandas. Requirement raised in issue: databricks/koalas#1929

```python
>>> index = pd.date_range('2018-04-09', periods=4, freq='2D')
>>> ks_series = ks.Series([1, 2, 3, 4], index=index)
2018-04-09  1
2018-04-11  2
2018-04-13  3
2018-04-15  4
dtype: int64

>>> ks_series.last('3D')
2018-04-13  3
2018-04-15  4
dtype: int64
```

```python
>>> index = pd.date_range('2018-04-09', periods=4, freq='2D')
>>> pdf = pd.DataFrame({'A': [1, 2, 3, 4]}, index=i)
>>> kdf = fs.from_pandas(pdf)
            A
2018-04-09  1
2018-04-11  2
2018-04-13  3
2018-04-15  4

 >>> kdf.last('3D')
            A
2018-04-13  3
2018-04-15  4      
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

7 participants