Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Breaking LabelledArrays into a separate package #41

Open
00krishna opened this issue Apr 24, 2024 · 4 comments
Open

Breaking LabelledArrays into a separate package #41

00krishna opened this issue Apr 24, 2024 · 4 comments

Comments

@00krishna
Copy link

Hello. I was wondering if there is any consideration about breaking LabelledArrays into its own package?

The reason is that LabelledArrays provide a really nice functionality that could be used in something like DataFrames.jl. Say I have a dataframe that has a categorical column, such as the month of the year. Here is an example.

julia> DataFrame(month = [1, 2, 3], sensor1 = [2.1, 2.4, 5.1])
3×2 DataFrame
 Row │ month  sensor1 
     │ Int64  Float64 
─────┼────────────────
   1 │     1      2.1
   2 │     2      2.4
   3 │     3      5.1

Using numerical indices for categorical variables like month, makes it harder for users to read. Hence a more intuitive interface is to swap the view for the month variable to look like:

Row │ month     sensor1 
     │ String    Float64 
─────┼───────────────────
   1 │ january       2.1
   2 │ february      2.4
   3 │ march         5.1

We could potentially use LabelledArrays in a dataframe, but right now that array library is bundled with the full ReadStatTables.jl package. Breaking the LabelledArray.jl library could allow some flexibility for using LabelledArrays in other place.

Please let me know if you can consider my request. Thank you.

@junyuan-chen
Copy link
Owner

@00krishna Thank you for your interest.

I haven't carefully thought about the pros and cons of doing so yet. But the primary reason for LabeledArray to live in this package is to make sure its design accommodates whatever peculiar requirement encountered for readstat and writestat. There are alternatives such as CategoricalArrays.jl that is more feature-rich and PooledArrays.jl that is very lightweight. There are subtle differences among the three regarding the design philosophy and priorities. Hopefully, one always finds something that fits the need best.

@00krishna
Copy link
Author

Thanks @junyuan-chen this is helpful. Yeah, I was looking at CategoricalArrays.jl too, and I understand your view on keeping LabelledArray within ReadStatTables.jl. However, I was looking at CategoricalArrays and it does not seem to have a mapping between an index value and a category name, such as 1 => "january". I read through the package docs as well as the DataFrames docs on categorical variables, but I did not see a way to preserve both the index values and categorical/text values at the same time.

Now I have not used CategoricalArrays before, hence I am just depending on what I read. And it could be the docs don't provide an example of this kind of key-value indexing as LabelledArrays does. IndirectArrays seems like the closest match, but that package seems to not have any docs, so I am just going off the README :). So I was just wondering if you had seen a package that supports this kind of key-value structure for categorical data?

@junyuan-chen
Copy link
Owner

junyuan-chen commented Apr 24, 2024

CategoricalArray decides how the numerical values are assigned by itself. So, the encoding process is something it takes as an internal implementation detail that users are not supposed to directly intervene. This is actually one of the main reasons why LabeledArray is introduced.

One possibility is to make LabeledArrays.jl a subpackage that lives inside this repo. This means that it will has its own UUID and registered with General registry. However, I still need to think about whether that's a good thing to do and will come back to this later once I figure out.

@00krishna
Copy link
Author

Excellent. Yeah, that is totally fair. I appreciate your consideration.

You are 100% correct that the issue seems to be that CategoricalArrays does not allow the user to specify the index values for each category. I am pulling the index values for my data samples from US census metadata, so they have their own elaborate system of categorizations.

But certainly, take your time to consider what you think is possible. Thanks again for your time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants