Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

manual benchmark to crunch dataypes #11

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

encrypted-soul
Copy link
Member

@encrypted-soul encrypted-soul commented Jun 24, 2021

Brief Description

Contains a .ipynb that with a function to crunch datatypes. Was able to crunch datatypes of size 565.2839660644531 MB to 349.76945400238037 around 61.875% of the original size. The compression is lossless.

Edit @erwanp : file visible here https://github.com/radis/radis-benchmark/blob/e512ecfcdd6d399f10d2fca19d9b0ab6ae475904/manual_benchmarks/test_datatypes_crunch.ipynb

Proposed solution for implementation in Radis

Pass the pandas data frame through this function before returning it.

@encrypted-soul encrypted-soul changed the title manual benchmark to crunch dataype manual benchmark to crunch dataypes Jun 24, 2021
@erwanp
Copy link
Member

erwanp commented Jun 24, 2021

Very promising !!

I'm surprised by this one :

******************************
Column:  int
dtype before:  float64
dtype after:  uint8
******************************

int is the (unscaled) linestrength.
i think it was caught by your [min/max] condition :

if mn >= 0:
                    if mx < 255:
                        props[col] = props[col].astype(np.uint8)

Linestrength values are very low but definitly not integers.

I'm quite sure such a change breaks the spectrum.

We can also hardcode the type change based on the column name ; given that the number of supported databases remains limited. This may be more robust eventually. Your function becomes useful to have an idea of which type to use.

Do you know how to check for accuracy ? You may need to override the sf.df0 loaded after a SpectrumFactory.load_databank. Usually overwriting sf.df0 is not recommended (because metadata may be lost in the process), but you can try. Look into the transfer_metadata function, or simply copy/paste the .attrs dict ?

@encrypted-soul
Copy link
Member Author

@erwanp "int" was getting converted to uint8 before because of the lines

# test if column can be converted to an integer
asint = props[col].fillna(0).astype(np.int64)
result = (props[col] - asint)
result = result.sum()

# lines with the problem
if result > -0.01 and result < 0.01:
    IsInt = True

I just changed the bound of result to much smaller value and I was able to get a no assertion errors upon checking with assert_frame_equal

assert_frame_equal(df1, df0, check_exact=False, rtol=0.1e-50, atol=0.1e-50)

I would like to proceed with implementing this lossless crunching to the main codebase. The idea would be to crunch all the data types and also resolve the current issues we are facing with parsing quanta (like #280).

I tried to pass the reduce_mem_usage above through parse_global_quanta and parse_local_quanta but I am facing errors at a few places.

These errors aside, how would you like me to implement this in the codebase? Would it be good to hardcode the values of the datatypes and just set the NaN values to -1 and avoid dropping the lines itself as you were suggesting in the discussion of #280 ?

@erwanp
Copy link
Member

erwanp commented Jul 4, 2021

Hello ! Unfortunately I'll have very limited time this week to provide any insightful answer ; maybe @dcmvdbekerom can help ?

@erwanp
Copy link
Member

erwanp commented Jul 4, 2021

Do you have more information on the errors you're facing ?

@encrypted-soul
Copy link
Member Author

encrypted-soul commented Jul 5, 2021

@erwanp few are just assertions that check the datatype of columns. I don't think this would be an issue.

Other than that, I could conclude from the errors that it is mainly due to the generation of nan values during the calculation of Intensity. I am working on resolving this currently.

Copy link
Member

@dcmvdbekerom dcmvdbekerom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good!! Requested a few small changes.
The NaN issue should be solved when loading the database, but I'm not sure yet what's the best way to do this.. Will let you know when I think of something!

manual_benchmarks/test_datatypes_crunch.ipynb Show resolved Hide resolved
" if not np.isfinite(props[col]).all(): \n",
" NAlist.append(col)\n",
" props[col].fillna(-1,inplace=True) \n",
" \n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Erwan that we should do this conversion at an earlier step, namely when loading the database. When that is implemented, at this point you can assume all data is indeed finite.

" elif mx < 4294967295:\n",
" props[col] = props[col].astype(np.uint32)\n",
" else:\n",
" props[col] = props[col].astype(np.uint64)\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For readability, I would write the limits as (1 << 8) - 1, (1 << 16) - 1, (1 << 32) -1, etc.
I believe that the -1 part is not even necessary, since it's integers and you're using "Lesser than" (without the "or equal to"), so you could use (1<<8), (1<<16), and (1<<32) as limits.

a << b means "bitshift integer a by b positions to the left", this means that 1 << b is shorthand for 2**b.

" # Print new column type\n",
" print(\"dtype after: \",props[col].dtype)\n",
" print(\"******************************\")\n",
" \n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As Hajime Kawahara recently pointed out, conversion to float32 may be too restrictive for the line position (v0) https://radis-radiation.slack.com/archives/C01G3117DJS/p1625716717008500?thread_ts=1625633254.001800&cid=C01G3117DJS. For all other decimals float32 is probably fine.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, cool function : np.finfo . Shows the relative accuracy, it may be used to check if our dtype reduction are valid? (as we know the number of significant digits used in the HITRAN database)
https://stackoverflow.com/questions/39134956/accuracy-of-float32

@erwanp
Copy link
Member

erwanp commented Aug 24, 2021

Related : see excellent benchmark f32/f64 in Exojax : HajimeKawahara/exojax#106

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants