Implement Koalas Missing APIs #1929

xinrong-meng · 2020-11-24T21:49:44Z

AishwaryaKalloli · 2020-11-26T08:49:44Z

Hi, I would like to help. I was planning on picking combine_first functionality.
Can you let me know if I can work on it. Thanks!

HyukjinKwon · 2020-11-27T01:56:19Z

Please go ahead @AishwaryaKalloli!

xinrong-meng · 2020-11-28T22:03:33Z

Certainly, thank you @AishwaryaKalloli!

AishwaryaKalloli · 2020-12-01T18:41:41Z

Just finished the set up in my local, hopefully will have some updates soon!

AishwaryaKalloli · 2020-12-03T14:08:22Z

I have committed the code, let me know if it is in the right direction. If it is I'll add the test cases and docs.

ref #1929 ``` >>> df = ks.DataFrame({'num_legs': [4, 2], 'num_wings': [0, 2]}, ... index=['dog', 'hawk']) >>> df num_legs num_wings dog 4 0 hawk 2 2 >>> for row in df.itertuples(): ... print(row) ... Koalas(Index='dog', num_legs=4, num_wings=0) Koalas(Index='hawk', num_legs=2, num_wings=2) ```

This PR proposes `GroupBy.median()`. Note: the result can be slightly different from pandas since we use an approximated median based upon approximate percentile computation because computing median across a large dataset is extremely expensive. ```python >>> kdf = ks.DataFrame({'a': [1., 1., 1., 1., 2., 2., 2., 3., 3., 3.], ... 'b': [2., 3., 1., 4., 6., 9., 8., 10., 7., 5.], ... 'c': [3., 5., 2., 5., 1., 2., 6., 4., 3., 6.]}, ... columns=['a', 'b', 'c'], ... index=[7, 2, 4, 1, 3, 4, 9, 10, 5, 6]) >>> kdf a b c 7 1.0 2.0 3.0 2 1.0 3.0 5.0 4 1.0 1.0 2.0 1 1.0 4.0 5.0 3 2.0 6.0 1.0 4 2.0 9.0 2.0 9 2.0 8.0 6.0 10 3.0 10.0 4.0 5 3.0 7.0 3.0 6 3.0 5.0 6.0 >>> kdf.groupby('a').median().sort_index() # doctest: +NORMALIZE_WHITESPACE b c a 1.0 2.0 3.0 2.0 8.0 2.0 3.0 7.0 4.0 >>> kdf.groupby('a')['b'].median().sort_index() a 1.0 2.0 2.0 8.0 3.0 7.0 Name: b, dtype: float64 ``` ref #1929

shril · 2020-12-11T22:30:25Z

Hi @ueshin, @HyukjinKwon can I proceed with the following Dataframe APIs -

cov
first
between_time
insert

I'll start with the dev once you give me the approval.

ueshin · 2020-12-11T22:33:08Z

@shril sure, please go ahead! Thanks!

shril · 2020-12-12T00:41:19Z

@ueshin I was going through this blog post of yours - https://databricks.com/blog/2020/03/31/10-minutes-from-pandas-to-koalas-on-apache-spark.html, and it suggested using map_in_pandas as a better alternative workaround because it does not require moving data into a single client node and potentially causing out-of-memory errors.

Do you suggest to proceed with using the map_in_pandas() API?

Edit:
I found out that map_in_pandas() is getting deprecated.
Do you suggest to continue with apply_batch() API?

ueshin · 2020-12-12T00:48:55Z

@shril I don't have a strong opinion on it. If you can implement it without apply_batch(), you don't need to use it.

shril · 2020-12-12T05:04:53Z

Hi @ueshin, I am slightly confused. We don't have DatetimeIndex like pandas, so can we convert our index to pandas and use it's indexer_between_time function.

i = pd.date_range('2018-04-09', periods=2000, freq='1D1min')
ts = ks.DataFrame({'A': ['timestamp']}, index=i)
indexer = ts.index.to_pandas().indexer_between_time(start_time='0:15', end_time='0:45')
result = ts.copy().take(indexer)

This is the small implementation I tried. Do you think that to_pandas() might result in out-of-memory errors?

ref #1929 ``` >>> kser = ks.Series(['b', None, 'a', 'c', 'b']) >>> codes, uniques = kser.factorize() >>> codes 0 1 1 -1 2 0 3 2 4 1 dtype: int64 >>> uniques Index(['a', 'b', 'c'], dtype='object') >>> codes, uniques = kser.factorize(na_sentinel=None) >>> codes 0 1 1 3 2 0 3 2 4 1 dtype: int64 >>> uniques Index(['a', 'b', 'c', None], dtype='object') >>> codes, uniques = kser.factorize(na_sentinel=-2) >>> codes 0 1 1 -2 2 0 3 2 4 1 dtype: int64 >>> uniques Index(['a', 'b', 'c'], dtype='object') ```

ref #1929 Insert column into DataFrame at a specified location. ``` >>> kdf = ks.DataFrame([1, 2, 3]) >>> kdf.insert(0, 'x', 4) >>> kdf.sort_index() x 0 0 4 1 1 4 2 2 4 3 >>> from databricks.koalas.config import set_option, reset_option >>> set_option("compute.ops_on_diff_frames", True) >>> kdf.insert(1, 'y', [5, 6, 7]) >>> kdf.sort_index() x y 0 0 4 5 1 1 4 6 2 2 4 7 3 >>> kdf.insert(2, 'z', ks.Series([8, 9, 10])) >>> kdf.sort_index() x y z 0 0 4 5 8 1 1 4 6 9 2 2 4 7 10 3 >>> reset_option("compute.ops_on_diff_frames") ```

chogg · 2021-01-21T20:10:43Z

Is iterating through groups on the roadmap for API coverage? I would find that helpful.

xinrong-meng · 2021-01-22T17:14:49Z

@chogg Thanks for the suggestion! We'll look into this and keep you updated.

ref #1929 Implement `DataFrame.between_time` ```py >>> i = pd.date_range('2018-04-09', periods=4, freq='1D20min') >>> ts = pd.DataFrame({'A': [1, 2, 3, 4]}, index=i) >>> kts = ks.from_pandas(ts) >>> kts A 2018-04-09 00:00:00 1 2018-04-10 00:20:00 2 2018-04-11 00:40:00 3 2018-04-12 01:00:00 4 >>> kts.between_time('0:15', '0:45') A 2018-04-10 00:20:00 2 2018-04-11 00:40:00 3 You get the times that are *not* between two times by setting ``start_time`` later than ``end_time``: >>> kts.between_time('0:45', '0:15') A 2018-04-09 00:00:00 1 2018-04-12 01:00:00 4 ```

awdavidson · 2021-03-25T21:00:59Z

Hi all, I was going to look at implementing functionality for last Is anyone looking at this?

xinrong-meng · 2021-03-26T17:37:07Z

@awdavidson Certainly, please feel free to do so! Your PR seemed to be closed.

awdavidson · 2021-03-26T19:39:32Z

@xinrong-databricks I'll reopen currently still working on it. Opened the PR to check build etc as had issue running a few things locally - didn't want to clutter your PR tab. Local environment is now working so should be able to completely test :)

xinrong-meng · 2021-03-26T21:03:42Z

@awdavidson Thanks! You might mark it as a draft PR until it's ready for review. Let us know if you have any questions :)

Please see change to implement `DataFrame.last` and `Series.last` functionality similar to that available in pandas. Requirement raised in issue: #1929 ```python >>> index = pd.date_range('2018-04-09', periods=4, freq='2D') >>> ks_series = ks.Series([1, 2, 3, 4], index=index) 2018-04-09 1 2018-04-11 2 2018-04-13 3 2018-04-15 4 dtype: int64 >>> ks_series.last('3D') 2018-04-13 3 2018-04-15 4 dtype: int64 ``` ```python >>> index = pd.date_range('2018-04-09', periods=4, freq='2D') >>> pdf = pd.DataFrame({'A': [1, 2, 3, 4]}, index=i) >>> kdf = fs.from_pandas(pdf) A 2018-04-09 1 2018-04-11 2 2018-04-13 3 2018-04-15 4 >>> kdf.last('3D') A 2018-04-13 3 2018-04-15 4 ```

Please see change to implement DataFrame.first and Series.first functionality similar to that available in pandas. Requirement raised in issue: #1929 ```python >>> index = pd.date_range('2018-04-09', periods=4, freq='2D') >>> ks_series = ks.Series([1, 2, 3, 4], index=index) 2018-04-09 1 2018-04-11 2 2018-04-13 3 2018-04-15 4 dtype: int64 >>> ks_series.first('3D') 2018-04-09 1 2018-04-11 2 dtype: int64 ```

awdavidson · 2021-04-01T12:39:50Z

@ueshin @xinrong-databricks has there been any discussion around how to implement Index.map? I have a few idea's that maybe useful, however, do not want to step on anyones toes! If you have any information or documentation that would be useful! :)

Note: example implementation can be found here master...awdavidson:feature/impl-index_map

ueshin · 2021-04-02T18:24:58Z

@awdavidson As we have not been working on it, you can go ahead.

One thing on the example implementation, using self._index.values is not a good idea because it collects all data into a Driver node which could cause OOM.

Thanks!

Please see change to implement `DataFrame.last` and `Series.last` functionality similar to that available in pandas. Requirement raised in issue: databricks/koalas#1929 ```python >>> index = pd.date_range('2018-04-09', periods=4, freq='2D') >>> ks_series = ks.Series([1, 2, 3, 4], index=index) 2018-04-09 1 2018-04-11 2 2018-04-13 3 2018-04-15 4 dtype: int64 >>> ks_series.last('3D') 2018-04-13 3 2018-04-15 4 dtype: int64 ``` ```python >>> index = pd.date_range('2018-04-09', periods=4, freq='2D') >>> pdf = pd.DataFrame({'A': [1, 2, 3, 4]}, index=i) >>> kdf = fs.from_pandas(pdf) A 2018-04-09 1 2018-04-11 2 2018-04-13 3 2018-04-15 4 >>> kdf.last('3D') A 2018-04-13 3 2018-04-15 4 ```

xinrong-meng added enhancement New feature or request help wanted Extra attention is needed labels Nov 24, 2020

xinrong-meng mentioned this issue Nov 24, 2020

Implement missing DataFrame/Series/Index APIs #362

Closed

ueshin pinned this issue Nov 24, 2020

AishwaryaKalloli mentioned this issue Dec 3, 2020

added combine first function #1950

Closed

This was referenced Dec 4, 2020

Implement DataFrame.swapaxes #1946

Merged

Implemented to_list to Index/MultiIndex #1948

Merged

Implement Series.swapaxes #1954

Merged

itholic mentioned this issue Dec 9, 2020

Implemented GroupBy.median() #1957

Merged

This was referenced Dec 9, 2020

Implemented GroupBy.tail #1949

Merged

Implement DataFrame.itertuples #1960

Merged

shril mentioned this issue Dec 12, 2020

between_time api for koalas dataframe #1968

Closed

This was referenced Dec 21, 2020

Implement Series.factorize() #1972

Merged

Implement DataFrame.insert #1983

Merged

LSturtew mentioned this issue Mar 19, 2021

Implemented dateframe.between_time #2111

Merged

awdavidson mentioned this issue Mar 26, 2021

Implement DataFrame.last and Series.last functionality #2121

Merged

awdavidson mentioned this issue Mar 31, 2021

Implement DataFrame.first and Series.first functionality #2128

Merged

awdavidson mentioned this issue Apr 5, 2021

Implement Index.map functionality #2136

Closed

LSturtew mentioned this issue Apr 8, 2021

implemented dataframe.cov #2142

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement Koalas Missing APIs #1929

Implement Koalas Missing APIs #1929

xinrong-meng commented Nov 24, 2020 •

edited by ueshin

AishwaryaKalloli commented Nov 26, 2020

HyukjinKwon commented Nov 27, 2020

xinrong-meng commented Nov 28, 2020

AishwaryaKalloli commented Dec 1, 2020

AishwaryaKalloli commented Dec 3, 2020

shril commented Dec 11, 2020 •

edited

ueshin commented Dec 11, 2020

shril commented Dec 12, 2020 •

edited

ueshin commented Dec 12, 2020

shril commented Dec 12, 2020

chogg commented Jan 21, 2021

xinrong-meng commented Jan 22, 2021

awdavidson commented Mar 25, 2021

xinrong-meng commented Mar 26, 2021

awdavidson commented Mar 26, 2021

xinrong-meng commented Mar 26, 2021

awdavidson commented Apr 1, 2021 •

edited

ueshin commented Apr 2, 2021

Implement Koalas Missing APIs #1929

Implement Koalas Missing APIs #1929

Comments

xinrong-meng commented Nov 24, 2020 • edited by ueshin

AishwaryaKalloli commented Nov 26, 2020

HyukjinKwon commented Nov 27, 2020

xinrong-meng commented Nov 28, 2020

AishwaryaKalloli commented Dec 1, 2020

AishwaryaKalloli commented Dec 3, 2020

shril commented Dec 11, 2020 • edited

ueshin commented Dec 11, 2020

shril commented Dec 12, 2020 • edited

ueshin commented Dec 12, 2020

shril commented Dec 12, 2020

chogg commented Jan 21, 2021

xinrong-meng commented Jan 22, 2021

awdavidson commented Mar 25, 2021

xinrong-meng commented Mar 26, 2021

awdavidson commented Mar 26, 2021

xinrong-meng commented Mar 26, 2021

awdavidson commented Apr 1, 2021 • edited

ueshin commented Apr 2, 2021

xinrong-meng commented Nov 24, 2020 •

edited by ueshin

shril commented Dec 11, 2020 •

edited

shril commented Dec 12, 2020 •

edited

awdavidson commented Apr 1, 2021 •

edited