manual benchmark to crunch dataypes #11

encrypted-soul · 2021-06-24T15:04:36Z

Brief Description

Contains a .ipynb that with a function to crunch datatypes. Was able to crunch datatypes of size 565.2839660644531 MB to 349.76945400238037 around 61.875% of the original size. The compression is lossless.

Edit @erwanp : file visible here https://github.com/radis/radis-benchmark/blob/e512ecfcdd6d399f10d2fca19d9b0ab6ae475904/manual_benchmarks/test_datatypes_crunch.ipynb

Proposed solution for implementation in Radis

Pass the pandas data frame through this function before returning it.

erwanp · 2021-06-24T18:08:19Z

Very promising !!

I'm surprised by this one :

******************************
Column:  int
dtype before:  float64
dtype after:  uint8
******************************

int is the (unscaled) linestrength.
i think it was caught by your [min/max] condition :

if mn >= 0:
                    if mx < 255:
                        props[col] = props[col].astype(np.uint8)

Linestrength values are very low but definitly not integers.

I'm quite sure such a change breaks the spectrum.

We can also hardcode the type change based on the column name ; given that the number of supported databases remains limited. This may be more robust eventually. Your function becomes useful to have an idea of which type to use.

Do you know how to check for accuracy ? You may need to override the sf.df0 loaded after a SpectrumFactory.load_databank. Usually overwriting sf.df0 is not recommended (because metadata may be lost in the process), but you can try. Look into the transfer_metadata function, or simply copy/paste the .attrs dict ?

encrypted-soul · 2021-07-04T22:31:15Z

@erwanp "int" was getting converted to uint8 before because of the lines

# test if column can be converted to an integer
asint = props[col].fillna(0).astype(np.int64)
result = (props[col] - asint)
result = result.sum()

# lines with the problem
if result > -0.01 and result < 0.01:
    IsInt = True

I just changed the bound of result to much smaller value and I was able to get a no assertion errors upon checking with assert_frame_equal

assert_frame_equal(df1, df0, check_exact=False, rtol=0.1e-50, atol=0.1e-50)

I would like to proceed with implementing this lossless crunching to the main codebase. The idea would be to crunch all the data types and also resolve the current issues we are facing with parsing quanta (like #280).

I tried to pass the reduce_mem_usage above through parse_global_quanta and parse_local_quanta but I am facing errors at a few places.

These errors aside, how would you like me to implement this in the codebase? Would it be good to hardcode the values of the datatypes and just set the NaN values to -1 and avoid dropping the lines itself as you were suggesting in the discussion of #280 ?

erwanp · 2021-07-04T23:40:38Z

Hello ! Unfortunately I'll have very limited time this week to provide any insightful answer ; maybe @dcmvdbekerom can help ?

erwanp · 2021-07-04T23:41:50Z

Do you have more information on the errors you're facing ?

encrypted-soul · 2021-07-05T09:12:31Z

@erwanp few are just assertions that check the datatype of columns. I don't think this would be an issue.

Other than that, I could conclude from the errors that it is mainly due to the generation of nan values during the calculation of Intensity. I am working on resolving this currently.

dcmvdbekerom

Looking good!! Requested a few small changes.
The NaN issue should be solved when loading the database, but I'm not sure yet what's the best way to do this.. Will let you know when I think of something!

manual_benchmarks/test_datatypes_crunch.ipynb

dcmvdbekerom · 2021-07-08T05:24:53Z

manual_benchmarks/test_datatypes_crunch.ipynb

+        "            if not np.isfinite(props[col]).all(): \n",
+        "                NAlist.append(col)\n",
+        "                props[col].fillna(-1,inplace=True)  \n",
+        "                   \n",


I agree with Erwan that we should do this conversion at an earlier step, namely when loading the database. When that is implemented, at this point you can assume all data is indeed finite.

dcmvdbekerom · 2021-07-08T05:30:31Z

manual_benchmarks/test_datatypes_crunch.ipynb

+        "                    elif mx < 4294967295:\n",
+        "                        props[col] = props[col].astype(np.uint32)\n",
+        "                    else:\n",
+        "                        props[col] = props[col].astype(np.uint64)\n",


For readability, I would write the limits as (1 << 8) - 1, (1 << 16) - 1, (1 << 32) -1, etc.
I believe that the -1 part is not even necessary, since it's integers and you're using "Lesser than" (without the "or equal to"), so you could use (1<<8), (1<<16), and (1<<32) as limits.

a << b means "bitshift integer a by b positions to the left", this means that 1 << b is shorthand for 2**b.

dcmvdbekerom · 2021-07-08T05:33:50Z

manual_benchmarks/test_datatypes_crunch.ipynb

+        "            # Print new column type\n",
+        "            print(\"dtype after: \",props[col].dtype)\n",
+        "            print(\"******************************\")\n",
+        "    \n",


As Hajime Kawahara recently pointed out, conversion to float32 may be too restrictive for the line position (v0) https://radis-radiation.slack.com/archives/C01G3117DJS/p1625716717008500?thread_ts=1625633254.001800&cid=C01G3117DJS. For all other decimals float32 is probably fine.

BTW, cool function : np.finfo . Shows the relative accuracy, it may be used to check if our dtype reduction are valid? (as we know the number of significant digits used in the HITRAN database)
https://stackoverflow.com/questions/39134956/accuracy-of-float32

erwanp · 2021-08-24T17:10:35Z

Related : see excellent benchmark f32/f64 in Exojax : HajimeKawahara/exojax#106

manual benchmark to crunch dataype

e512ecf

encrypted-soul changed the title ~~manual benchmark to crunch dataype~~ manual benchmark to crunch dataypes Jun 24, 2021

encrypted-soul requested review from dcmvdbekerom and erwanp June 24, 2021 17:12

dcmvdbekerom requested changes Jul 8, 2021

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

manual benchmark to crunch dataypes #11

manual benchmark to crunch dataypes #11

encrypted-soul commented Jun 24, 2021 •

edited by erwanp

erwanp commented Jun 24, 2021

encrypted-soul commented Jul 4, 2021

erwanp commented Jul 4, 2021

erwanp commented Jul 4, 2021

encrypted-soul commented Jul 5, 2021 •

edited

dcmvdbekerom left a comment

dcmvdbekerom Jul 8, 2021

dcmvdbekerom Jul 8, 2021

dcmvdbekerom Jul 8, 2021 •

edited

erwanp Jul 8, 2021

erwanp commented Aug 24, 2021

manual benchmark to crunch dataypes #11

Are you sure you want to change the base?

manual benchmark to crunch dataypes #11

Conversation

encrypted-soul commented Jun 24, 2021 • edited by erwanp

Brief Description

Proposed solution for implementation in Radis

erwanp commented Jun 24, 2021

encrypted-soul commented Jul 4, 2021

erwanp commented Jul 4, 2021

erwanp commented Jul 4, 2021

encrypted-soul commented Jul 5, 2021 • edited

dcmvdbekerom left a comment

Choose a reason for hiding this comment

dcmvdbekerom Jul 8, 2021

Choose a reason for hiding this comment

dcmvdbekerom Jul 8, 2021

Choose a reason for hiding this comment

dcmvdbekerom Jul 8, 2021 • edited

Choose a reason for hiding this comment

erwanp Jul 8, 2021

Choose a reason for hiding this comment

erwanp commented Aug 24, 2021

encrypted-soul commented Jun 24, 2021 •

edited by erwanp

encrypted-soul commented Jul 5, 2021 •

edited

dcmvdbekerom Jul 8, 2021 •

edited