Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pygmt.x2sys_cross: Refactor to use virtualfiles for output tables #3182

Draft
wants to merge 27 commits into
base: main
Choose a base branch
from

Conversation

seisman
Copy link
Member

@seisman seisman commented Apr 19, 2024

Description of proposed changes

Refactor pygmt.x2sys_cross to use virtualfile for output.

Partially address #3160.

Need to note that x2sys_cross still use temporary files in the tempfile_from_dftrack function.

@seisman seisman added enhancement Improving an existing feature needs review This PR has higher priority and needs review. labels Apr 19, 2024
@seisman seisman added this to the 0.12.0 milestone Apr 19, 2024
@seisman seisman marked this pull request as ready for review April 19, 2024 15:51
@seisman seisman requested a review from weiji14 April 19, 2024 15:51
@seisman seisman force-pushed the refactor/x2sys_cross branch 3 times, most recently from bc341f6 to ff290da Compare April 20, 2024 02:35
@seisman seisman marked this pull request as draft April 20, 2024 03:03
@seisman seisman removed the needs review This PR has higher priority and needs review. label Apr 20, 2024
@seisman seisman marked this pull request as ready for review April 20, 2024 03:36
Comment on lines 232 to 235
# Convert 3rd and 4th columns to datetimes.
# These two columns have names "t_1"/"t_2" or "i_1"/"i_2".
# "t_1"/"t_2" means they are datetimes and should be converted.
# "i_1"/"i_2" means they are dummy times (i.e., floating-point values).
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I understanding the output correctly?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've never used x2sys, but here is my understanding of the C codes and the output:

  1. The 3rd and 4th columns are datetimes. They can be either absolute datetimes (e.g., 2023-01-01T01:23:45.678 or dummy datetimes (i.e., double-precision numbers), depending on whether the input tracks contain datetimes.
  2. Internally, absolute datetimes are also represented as double-precision numbers in GMT. So absolute datetimes and dummy datetimes are the same internally.
  3. When outputting to a file, GMT will convert double-precision numbers into absolute datetimes, since GMT know if the column has dummy datetimes or not.
  4. A GMT_DATASET container can only contain double-precision numbers and text strings. So when outputting to a virtual file, the 3rd and 4th columns always have double-precision numbers. If the column names are t_1/t_2, then we know they're absolute datetimes and should be converted; otherwise, they are just dummy datetimes and should not be converted.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little unsure if i_1/i_2 are actually dummy datetimes. This is a sample output from x2sys_cross:

# Tag: X2SYS4ivlhlo4
# Command: x2sys_cross @tut_ship.xyz -Qi -TX2SYS4ivlhlo4 ->/tmp/lala.txt
# x	y	i_1	i_2	dist_1	dist_2	head_1	head_2	vel_1	vel_2	z_X	z_M
> @tut_ship 0 @tut_ship 0 NaN/NaN/1357.17 NaN/NaN/1357.17
251.004840022	20.000079064	18053.5647431	13446.6562433	333.339586673	229.636557499	269.996783034	270.023614846	NaN	NaN	192.232797243	-2957.22757183
251.004840022	20.000079064	18053.5647431	71783.6562433	333.339586673	1148.20975878	269.996783034	270.023614846	NaN	NaN	192.232797243	-2957.22757183
250.534946327	20.0000526811	18053.3762934	66989.0210846	332.869692978	1022.68273972	269.996783034	269.360150109	NaN	NaN	-57.6485957585	-2686.4268008
250.532033147	20.0000525175	18053.3751251	66988.9936489	332.866779797	1022.67977813	269.996783034	22.0133296951	NaN	NaN	-64.5973890802	-2682.04812157
252.068705	20.000075	13447.5	71784.5	230.700422496	1149.27362378	269.995072235	269.995072235	NaN	NaN	0	-3206.5

It seems like the i_1/i_2 values vary between rows, but I can't quite remember what they represent... maybe an index of some sort? I might need to inspect the C code to see what's going on, can you point me to where these i_1/i_2 columns are being output?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dummy times are just double-precision indexes from 0 to n (xref: https://github.com/GenericMappingTools/gmt/blob/b56be20bee0b8de22a682fdcd458f9b9eeb76f64/src/x2sys/x2sys.c#L533).

The column name i_1 or t_1 is controlled by the variable t_or_i in the C code (https://github.com/GenericMappingTools/gmt/blob/b56be20bee0b8de22a682fdcd458f9b9eeb76f64/src/x2sys/x2sys_cross.c#L998). From https://github.com/GenericMappingTools/gmt/blob/b56be20bee0b8de22a682fdcd458f9b9eeb76f64/src/x2sys/x2sys_cross.c#L568, it's clear that, if got_time is True, then the column is absolute time (GMT_IS_ABSTIME), otherwise it's double-precision numbers (GMT_IS_FLOAT).

We can keep the dummy times as double-precision numbers or think them as seconds since unix epoch and then convert them to absolute times.

Copy link
Member

@weiji14 weiji14 Apr 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can keep the dummy times as double-precision numbers or think them as seconds since unix epoch and then convert them to absolute times.

Maybe convert the relative time to pandas.Timedelta or numpy.timedelta64? Xref #2848.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. Done in 9d12ae1.

Copy link
Member

@weiji14 weiji14 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are 2 main changes happening in this PR:

  1. Adding the output_type="numpy" option
  2. Handling the different dtypes of the i_1/i_2 or t_1/t_2 columns

We can keep this as a single PR since it's hard to separate the two things, but might need to discuss the implementation a bit more.

def x2sys_cross(tracks=None, outfile=None, **kwargs):
def x2sys_cross(
tracks=None,
output_type: Literal["pandas", "numpy", "file"] = "pandas",
Copy link
Member

@weiji14 weiji14 Apr 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly, I'm not sure if we should support numpy output type for x2sys_cross because all 'columns' will need to be the same dtype in a np.ndarray. If there are datetime values in the columns, they will get converted to floating point (?), which makes it more difficult to use later. Try adding a unit test for numpy output_type and see if it makes sense.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there are datetime values in the columns, they will get converted to floating point (?)

You're right. Datetimes are converted to floating points by df.to_numpy(). Will remove the numpy output type.

Comment on lines 232 to 235
# Convert 3rd and 4th columns to datetimes.
# These two columns have names "t_1"/"t_2" or "i_1"/"i_2".
# "t_1"/"t_2" means they are datetimes and should be converted.
# "i_1"/"i_2" means they are dummy times (i.e., floating-point values).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little unsure if i_1/i_2 are actually dummy datetimes. This is a sample output from x2sys_cross:

# Tag: X2SYS4ivlhlo4
# Command: x2sys_cross @tut_ship.xyz -Qi -TX2SYS4ivlhlo4 ->/tmp/lala.txt
# x	y	i_1	i_2	dist_1	dist_2	head_1	head_2	vel_1	vel_2	z_X	z_M
> @tut_ship 0 @tut_ship 0 NaN/NaN/1357.17 NaN/NaN/1357.17
251.004840022	20.000079064	18053.5647431	13446.6562433	333.339586673	229.636557499	269.996783034	270.023614846	NaN	NaN	192.232797243	-2957.22757183
251.004840022	20.000079064	18053.5647431	71783.6562433	333.339586673	1148.20975878	269.996783034	270.023614846	NaN	NaN	192.232797243	-2957.22757183
250.534946327	20.0000526811	18053.3762934	66989.0210846	332.869692978	1022.68273972	269.996783034	269.360150109	NaN	NaN	-57.6485957585	-2686.4268008
250.532033147	20.0000525175	18053.3751251	66988.9936489	332.866779797	1022.67977813	269.996783034	22.0133296951	NaN	NaN	-64.5973890802	-2682.04812157
252.068705	20.000075	13447.5	71784.5	230.700422496	1149.27362378	269.995072235	269.995072235	NaN	NaN	0	-3206.5

It seems like the i_1/i_2 values vary between rows, but I can't quite remember what they represent... maybe an index of some sort? I might need to inspect the C code to see what's going on, can you point me to where these i_1/i_2 columns are being output?

@seisman seisman added the needs review This PR has higher priority and needs review. label Apr 22, 2024
@seisman seisman removed this from the 0.12.0 milestone Apr 29, 2024
@weiji14 weiji14 marked this pull request as draft May 6, 2024 22:03
pygmt/src/x2sys_cross.py Outdated Show resolved Hide resolved
# These two columns have names "t_1"/"t_2" or "i_1"/"i_2".
# "t_1"/"t_2" means they are absolute datetimes.
# "i_1"/"i_2" means they are dummy times relative to unix epoch.
if output_type == "pandas":
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More checks and tests are needed because the conversion rely on GMT configurations like TIME_EPOCH and TIME_UNIT.

@seisman seisman changed the title pygmt.x2sys_cross: Add 'output_type' parameter for output in pandas/numpy/file formats pygmt.x2sys_cross: Refactor to use virtualfiles for output tables May 7, 2024
@seisman seisman removed the needs review This PR has higher priority and needs review. label May 7, 2024
os.environ["X2SYS_HOME"], kwargs["T"], f"{kwargs['T']}.tag"
)
# Last line is like "-Dxyz -Etsv -I1/1"
lastline = tagfile.read_text().splitlines()[-1]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is encoding="utf8" not needed anymore?

Suggested change
lastline = tagfile.read_text().splitlines()[-1]
lastline = tagfile.read_text(encoding="utf8").splitlines()[-1]

Comment on lines -241 to 225
comment=">", # Skip the 3rd row with a ">"
parse_dates=[2, 3], # Datetimes on 3rd and 4th column
**date_format_kwarg, # Parse dates in ISO8601 format on pandas>=2
result = lib.virtualfile_to_dataset(
vfname=vouttbl, output_type=output_type, header=2
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that x2sys_cross can output multi-segment files (each segment is separated by IO_SEGMENT_MARKER which is > by default, see https://docs.generic-mapping-tools.org/6.5/reference/file-formats.html#optional-segment-header-records). If I'm not mistaken, the current virtualfile_to_dataset method does not implement multi-segment file handling yet? To be fair though, the current implementation in x2sys_cross simply merges all segments into one, since we skip rows starting with >, but we need to check that virtualfile_to_dataset will return all segments in a multi-segment file instead of just the first one.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm not mistaken, the current virtualfile_to_dataset method does not implement multi-segment file handling yet? To be fair though, the current implementation in x2sys_cross simply merges all segments into one, since we skip rows starting with >, but we need to check that virtualfile_to_dataset will return all segments in a multi-segment file instead of just the first one.

Yes. The main problem is that, as far as I know, there is no equivalent way to represent a multi-segment file in pandas. The multi-segment support was also mentioned in #2729 (comment).

If we can have a general way to represent multi-segment in pandas, then it should be straightforward to output multi-segments from _GMT_DATASET to the desired data structure.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be fair though, the current implementation in x2sys_cross simply merges all segments into one, since we skip rows starting with >, but we need to check that virtualfile_to_dataset will return all segments in a multi-segment file instead of just the first one.

It's already tested in the _GMT_DATASET docstrings

>>> with GMTTempFile(suffix=".txt") as tmpfile:
.

For x2sys_cross, we also test the shape of the output pd.DataFrame.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Improving an existing feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants