pygmt.x2sys_cross: Refactor to use virtualfiles for output tables #3182

seisman · 2024-04-19T15:16:08Z

Description of proposed changes

Refactor pygmt.x2sys_cross to use virtualfile for output.

Partially address #3160.

Need to note that x2sys_cross still use temporary files in the tempfile_from_dftrack function.

…umpy/file formats

seisman · 2024-04-20T03:36:55Z

pygmt/src/x2sys_cross.py

+                # Convert 3rd and 4th columns to datetimes.
+                # These two columns have names "t_1"/"t_2" or "i_1"/"i_2".
+                # "t_1"/"t_2" means they are datetimes and should be converted.
+                # "i_1"/"i_2" means they are dummy times (i.e., floating-point values).


Am I understanding the output correctly?

I've never used x2sys, but here is my understanding of the C codes and the output:

The 3rd and 4th columns are datetimes. They can be either absolute datetimes (e.g., 2023-01-01T01:23:45.678 or dummy datetimes (i.e., double-precision numbers), depending on whether the input tracks contain datetimes.

Internally, absolute datetimes are also represented as double-precision numbers in GMT. So absolute datetimes and dummy datetimes are the same internally.

When outputting to a file, GMT will convert double-precision numbers into absolute datetimes, since GMT know if the column has dummy datetimes or not.

A GMT_DATASET container can only contain double-precision numbers and text strings. So when outputting to a virtual file, the 3rd and 4th columns always have double-precision numbers. If the column names are t_1/t_2, then we know they're absolute datetimes and should be converted; otherwise, they are just dummy datetimes and should not be converted.

I'm a little unsure if i_1/i_2 are actually dummy datetimes. This is a sample output from x2sys_cross:

# Tag: X2SYS4ivlhlo4 # Command: x2sys_cross @tut_ship.xyz -Qi -TX2SYS4ivlhlo4 ->/tmp/lala.txt # x y i_1 i_2 dist_1 dist_2 head_1 head_2 vel_1 vel_2 z_X z_M > @tut_ship 0 @tut_ship 0 NaN/NaN/1357.17 NaN/NaN/1357.17 251.004840022 20.000079064 18053.5647431 13446.6562433 333.339586673 229.636557499 269.996783034 270.023614846 NaN NaN 192.232797243 -2957.22757183 251.004840022 20.000079064 18053.5647431 71783.6562433 333.339586673 1148.20975878 269.996783034 270.023614846 NaN NaN 192.232797243 -2957.22757183 250.534946327 20.0000526811 18053.3762934 66989.0210846 332.869692978 1022.68273972 269.996783034 269.360150109 NaN NaN -57.6485957585 -2686.4268008 250.532033147 20.0000525175 18053.3751251 66988.9936489 332.866779797 1022.67977813 269.996783034 22.0133296951 NaN NaN -64.5973890802 -2682.04812157 252.068705 20.000075 13447.5 71784.5 230.700422496 1149.27362378 269.995072235 269.995072235 NaN NaN 0 -3206.5

It seems like the i_1/i_2 values vary between rows, but I can't quite remember what they represent... maybe an index of some sort? I might need to inspect the C code to see what's going on, can you point me to where these i_1/i_2 columns are being output?

Dummy times are just double-precision indexes from 0 to n (xref: https://github.com/GenericMappingTools/gmt/blob/b56be20bee0b8de22a682fdcd458f9b9eeb76f64/src/x2sys/x2sys.c#L533).

The column name i_1 or t_1 is controlled by the variable t_or_i in the C code (https://github.com/GenericMappingTools/gmt/blob/b56be20bee0b8de22a682fdcd458f9b9eeb76f64/src/x2sys/x2sys_cross.c#L998). From https://github.com/GenericMappingTools/gmt/blob/b56be20bee0b8de22a682fdcd458f9b9eeb76f64/src/x2sys/x2sys_cross.c#L568, it's clear that, if got_time is True, then the column is absolute time (GMT_IS_ABSTIME), otherwise it's double-precision numbers (GMT_IS_FLOAT).

We can keep the dummy times as double-precision numbers or think them as seconds since unix epoch and then convert them to absolute times.

We can keep the dummy times as double-precision numbers or think them as seconds since unix epoch and then convert them to absolute times.

Maybe convert the relative time to pandas.Timedelta or numpy.timedelta64? Xref #2848.

Sounds good. Done in 9d12ae1.

weiji14

There are 2 main changes happening in this PR:

Adding the output_type="numpy" option
Handling the different dtypes of the i_1/i_2 or t_1/t_2 columns

We can keep this as a single PR since it's hard to separate the two things, but might need to discuss the implementation a bit more.

weiji14 · 2024-04-21T21:56:14Z

pygmt/src/x2sys_cross.py

-def x2sys_cross(tracks=None, outfile=None, **kwargs):
+def x2sys_cross(
+    tracks=None,
+    output_type: Literal["pandas", "numpy", "file"] = "pandas",


Honestly, I'm not sure if we should support numpy output type for x2sys_cross because all 'columns' will need to be the same dtype in a np.ndarray. If there are datetime values in the columns, they will get converted to floating point (?), which makes it more difficult to use later. Try adding a unit test for numpy output_type and see if it makes sense.

If there are datetime values in the columns, they will get converted to floating point (?)

You're right. Datetimes are converted to floating points by df.to_numpy(). Will remove the numpy output type.

weiji14 · 2024-04-21T22:26:35Z

pygmt/src/x2sys_cross.py

+                # Convert 3rd and 4th columns to datetimes.
+                # These two columns have names "t_1"/"t_2" or "i_1"/"i_2".
+                # "t_1"/"t_2" means they are datetimes and should be converted.
+                # "i_1"/"i_2" means they are dummy times (i.e., floating-point values).


I'm a little unsure if i_1/i_2 are actually dummy datetimes. This is a sample output from x2sys_cross:

# Tag: X2SYS4ivlhlo4 # Command: x2sys_cross @tut_ship.xyz -Qi -TX2SYS4ivlhlo4 ->/tmp/lala.txt # x y i_1 i_2 dist_1 dist_2 head_1 head_2 vel_1 vel_2 z_X z_M > @tut_ship 0 @tut_ship 0 NaN/NaN/1357.17 NaN/NaN/1357.17 251.004840022 20.000079064 18053.5647431 13446.6562433 333.339586673 229.636557499 269.996783034 270.023614846 NaN NaN 192.232797243 -2957.22757183 251.004840022 20.000079064 18053.5647431 71783.6562433 333.339586673 1148.20975878 269.996783034 270.023614846 NaN NaN 192.232797243 -2957.22757183 250.534946327 20.0000526811 18053.3762934 66989.0210846 332.869692978 1022.68273972 269.996783034 269.360150109 NaN NaN -57.6485957585 -2686.4268008 250.532033147 20.0000525175 18053.3751251 66988.9936489 332.866779797 1022.67977813 269.996783034 22.0133296951 NaN NaN -64.5973890802 -2682.04812157 252.068705 20.000075 13447.5 71784.5 230.700422496 1149.27362378 269.995072235 269.995072235 NaN NaN 0 -3206.5

It seems like the i_1/i_2 values vary between rows, but I can't quite remember what they represent... maybe an index of some sort? I might need to inspect the C code to see what's going on, can you point me to where these i_1/i_2 columns are being output?

pygmt/src/x2sys_cross.py

seisman · 2024-05-07T01:51:16Z

pygmt/src/x2sys_cross.py

+                # These two columns have names "t_1"/"t_2" or "i_1"/"i_2".
+                # "t_1"/"t_2" means they are absolute datetimes.
+                # "i_1"/"i_2" means they are dummy times relative to unix epoch.
+                if output_type == "pandas":


More checks and tests are needed because the conversion rely on GMT configurations like TIME_EPOCH and TIME_UNIT.

weiji14 · 2024-05-26T23:22:51Z

pygmt/src/x2sys_cross.py

+                    os.environ["X2SYS_HOME"], kwargs["T"], f"{kwargs['T']}.tag"
+                )
+                # Last line is like "-Dxyz -Etsv -I1/1"
+                lastline = tagfile.read_text().splitlines()[-1]


Is encoding="utf8" not needed anymore?

Suggested change

lastline = tagfile.read_text().splitlines()[-1]

lastline = tagfile.read_text(encoding="utf8").splitlines()[-1]

weiji14 · 2024-05-26T23:28:20Z

pygmt/src/x2sys_cross.py

-                    comment=">",  # Skip the 3rd row with a ">"
-                    parse_dates=[2, 3],  # Datetimes on 3rd and 4th column
-                    **date_format_kwarg,  # Parse dates in ISO8601 format on pandas>=2
+                result = lib.virtualfile_to_dataset(
+                    vfname=vouttbl, output_type=output_type, header=2
                )


Note that x2sys_cross can output multi-segment files (each segment is separated by IO_SEGMENT_MARKER which is > by default, see https://docs.generic-mapping-tools.org/6.5/reference/file-formats.html#optional-segment-header-records). If I'm not mistaken, the current virtualfile_to_dataset method does not implement multi-segment file handling yet? To be fair though, the current implementation in x2sys_cross simply merges all segments into one, since we skip rows starting with >, but we need to check that virtualfile_to_dataset will return all segments in a multi-segment file instead of just the first one.

If I'm not mistaken, the current virtualfile_to_dataset method does not implement multi-segment file handling yet? To be fair though, the current implementation in x2sys_cross simply merges all segments into one, since we skip rows starting with >, but we need to check that virtualfile_to_dataset will return all segments in a multi-segment file instead of just the first one.

Yes. The main problem is that, as far as I know, there is no equivalent way to represent a multi-segment file in pandas. The multi-segment support was also mentioned in #2729 (comment).

If we can have a general way to represent multi-segment in pandas, then it should be straightforward to output multi-segments from _GMT_DATASET to the desired data structure.

To be fair though, the current implementation in x2sys_cross simply merges all segments into one, since we skip rows starting with >, but we need to check that virtualfile_to_dataset will return all segments in a multi-segment file instead of just the first one.

It's already tested in the _GMT_DATASET docstrings

pygmt/pygmt/datatypes/dataset.py

Line 215 in 466c8b6

>>> with GMTTempFile(suffix=".txt") as tmpfile:

.

For x2sys_cross, we also test the shape of the output pd.DataFrame.

seisman force-pushed the refactor/x2sys_cross branch from 295afc0 to 5280524 Compare April 19, 2024 15:51

seisman added enhancement Improving an existing feature needs review This PR has higher priority and needs review. labels Apr 19, 2024

seisman added this to the 0.12.0 milestone Apr 19, 2024

seisman marked this pull request as ready for review April 19, 2024 15:51

seisman requested a review from weiji14 April 19, 2024 15:51

seisman force-pushed the refactor/x2sys_cross branch 3 times, most recently from bc341f6 to ff290da Compare April 20, 2024 02:35

seisman added 4 commits April 20, 2024 10:50

pygmt.x2sys_cross: Add 'output_type' parameter for output in pandas/n…

95fab98

…umpy/file formats

Move session-unrelated code block outside the session block

ce926a0

Refactor if-else using match statements

86278cb

Fix static typing issue

58c6ea4

seisman force-pushed the refactor/x2sys_cross branch from ff290da to 58c6ea4 Compare April 20, 2024 02:55

seisman marked this pull request as draft April 20, 2024 03:03

seisman removed the needs review This PR has higher priority and needs review. label Apr 20, 2024

Fix warnings

d6eeade

seisman marked this pull request as ready for review April 20, 2024 03:36

seisman commented Apr 20, 2024

View reviewed changes

weiji14 reviewed Apr 21, 2024

View reviewed changes

Convert dummpy times to timedelta

9d12ae1

seisman added the needs review This PR has higher priority and needs review. label Apr 22, 2024

seisman mentioned this pull request Apr 22, 2024

Allow validate_output_table_type to specify the supported output types #3191

Merged

seisman added 3 commits April 22, 2024 14:02

Let validate_output_table_type specify the supported output types

28eb1df

Fix

5e926e8

Merge branch 'main' into validator/valid-types

c1c756d

seisman removed this from the 0.12.0 milestone Apr 29, 2024

seisman added 3 commits May 6, 2024 09:48

Update docstrings

3a3df0a

Merge branch 'main' into validator/valid-types

3aea9a6

Merge branch 'main' into validator/valid-types

d869a32

weiji14 marked this pull request as draft May 6, 2024 22:03

seisman added 3 commits May 7, 2024 09:07

Remove support of numpy output type

b46d21d

Merge branch 'validator/valid-types' into refactor/x2sys_cross

81a1ec0

Merge branch 'main' into refactor/x2sys_cross

07fe53e

weiji14 reviewed May 7, 2024

View reviewed changes

pygmt/src/x2sys_cross.py Outdated Show resolved Hide resolved

Remove the output_type parameter

5f04506

seisman commented May 7, 2024

View reviewed changes

Remove the output_type parameter from tests

84765e4

seisman changed the title ~~pygmt.x2sys_cross: Add 'output_type' parameter for output in pandas/numpy/file formats~~ pygmt.x2sys_cross: Refactor to use virtualfiles for output tables May 7, 2024

seisman mentioned this pull request May 7, 2024

Revert "Allow validate_output_table_type to specify the supported output types (#3191)" #3233

Merged

Improve tests

b9b4098

seisman removed the needs review This PR has higher priority and needs review. label May 7, 2024

seisman added 7 commits May 25, 2024 14:13

Merge branch 'main' into refactor/x2sys_cross

bf59e61

Deal with TIME_EPOCH

9bc063a

Support TIME_UNIT for 'd' and 's'

b0b5099

Fix a typo

1396ee8

Update

6f2671a

Fix

a9a4179

Typo

04a1986

weiji14 reviewed May 26, 2024

View reviewed changes

seisman added 2 commits May 27, 2024 09:43

Fix

b27212b

Merge branch 'main' into refactor/x2sys_cross

aa3e9af

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pygmt.x2sys_cross: Refactor to use virtualfiles for output tables #3182

pygmt.x2sys_cross: Refactor to use virtualfiles for output tables #3182

seisman commented Apr 19, 2024

seisman Apr 20, 2024

seisman Apr 20, 2024

weiji14 Apr 21, 2024

seisman Apr 21, 2024

weiji14 Apr 22, 2024 •

edited

seisman Apr 22, 2024

weiji14 left a comment

weiji14 Apr 21, 2024 •

edited

seisman Apr 22, 2024

weiji14 Apr 21, 2024

seisman May 7, 2024

weiji14 May 26, 2024

weiji14 May 26, 2024

seisman May 27, 2024

seisman May 27, 2024

	lastline = tagfile.read_text().splitlines()[-1]
	lastline = tagfile.read_text(encoding="utf8").splitlines()[-1]

pygmt.x2sys_cross: Refactor to use virtualfiles for output tables #3182

Are you sure you want to change the base?

pygmt.x2sys_cross: Refactor to use virtualfiles for output tables #3182

Conversation

seisman commented Apr 19, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

weiji14 Apr 22, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

weiji14 left a comment

Choose a reason for hiding this comment

weiji14 Apr 21, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

weiji14 Apr 22, 2024 •

edited

weiji14 Apr 21, 2024 •

edited