Bump LLVM commit to 20ed5b1f4587 #2799

flemairen6 · 2024-04-15T14:06:05Z

Bump LLVM commit to 20ed5b1f4587 and Stablehlo to fd0c20a10

stablehlo to 49dc86c0df9ac3a8a556208674204b0f68d8eb6d Signed-off-by: Ferdinand Lemaire <ferdinan@xilinx.com>

jenkins-droid · 2024-04-15T14:06:17Z

Can one of the admins verify this patch?

flemairen6 · 2024-04-16T14:36:37Z

Would someone know why the MacOS build is failing? The error seems to come from stablehlo and to be self-contained in a file from there so I would assume it's not the bump that made it fail

chentong319

LGTM!

Build failed.

chentong319 · 2024-04-16T15:58:43Z

Would someone know why the MacOS build is failing? The error seems to come from stablehlo and to be self-contained in a file from there so I would assume it's not the bump that made it fail

The related error message:

/Users/runner/work/onnx-mlir/onnx-mlir/third_party/stablehlo/stablehlo/reference/Ops.cpp:252:16: note: candidate template ignored: deduced conflicting types for parameter 'T' ('long' vs. 'long long')

is long vs. long long a special case on Mac?

gongsu832 · 2024-04-16T17:31:06Z

long and long long can be different or the same depending on the compiler, OS, and arch. So one really shouldn't make any assumption about whether they are the same or not.

flemairen6 · 2024-04-17T06:59:09Z

long and long long can be different or the same depending on the compiler, OS, and arch. So one really shouldn't make any assumption about whether they are the same or not.

Right, but since it seems to be a stablehlo contained problem (the problematic function is defined and only used in onnx-mlir/third_party/stablehlo/stablehlo/reference/Ops.cpp ), how do we go about this? Should we open an issue over there and wait for a new green commit?

gongsu832 · 2024-04-17T13:21:56Z

Unless they already knew about the problem, I think opening an issue at stablehlo is probably the way to go.

flemairen6 · 2024-04-17T13:37:15Z

Unless they already knew about the problem, I think opening an issue at stablehlo is probably the way to go.

Just checked and it appears they fixed it already. It seems they bump LLVM to a newer commit, so I'll bump again to a new LLVM to match, and bump stablehlo to the fixed commit.

hamptonm1 · 2024-04-17T13:59:57Z

Unless they already knew about the problem, I think opening an issue at stablehlo is probably the way to go.

Just checked and it appears they fixed it already. It seems they bump LLVM to a newer commit, so I'll bump again to a new LLVM to match, and bump stablehlo to the fixed commit.

Thanks @flemairen6 for your hard work! Are you looking at LLVM commit 20ed5b1f4587 and Stablehlo commit 4f6df1b or a newer Stablehlo commit?

flemairen6 · 2024-04-18T07:09:05Z

Unless they already knew about the problem, I think opening an issue at stablehlo is probably the way to go.

Just checked and it appears they fixed it already. It seems they bump LLVM to a newer commit, so I'll bump again to a new LLVM to match, and bump stablehlo to the fixed commit.

Thanks @flemairen6 for your hard work! Are you looking at LLVM commit 20ed5b1f4587 and Stablehlo commit 4f6df1b or a newer Stablehlo commit?

Yes for the LLVM commit, but I was planning to use fd0c20a10 for the stablehlo commit since its the one with the fix for the current issue. It's a few commits more recent than the LLVM bump, but the other solution would be to wait another integration of LLVM in stablehlo, which probably isn't necessary. What do you think?

Revert the changes to properties from stablehlo and fix some references where variable names had changed. Signed-off-by: Ferdinand Lemaire <ferdinand.lemaire@amd.com>

jenkins-droid · 2024-04-18T08:56:35Z

Can one of the admins verify this patch?

flemairen6 · 2024-04-18T13:29:11Z

@chentong319 I see conversion/onnx_to_krnl/Sequence/Sequence_with_dealloc.mlir failing on MacOS only - would you have an idea what could cause it? I've seen you mentionned in discussions surrounding this test before so I figured you might know

hamptonm1 · 2024-04-18T15:34:25Z

Unless they already knew about the problem, I think opening an issue at stablehlo is probably the way to go.

Just checked and it appears they fixed it already. It seems they bump LLVM to a newer commit, so I'll bump again to a new LLVM to match, and bump stablehlo to the fixed commit.

Thanks @flemairen6 for your hard work! Are you looking at LLVM commit 20ed5b1f4587 and Stablehlo commit 4f6df1b or a newer Stablehlo commit?

Yes for the LLVM commit, but I was planning to use fd0c20a10 for the stablehlo commit since its the one with the fix for the current issue. It's a few commits more recent than the LLVM bump, but the other solution would be to wait another integration of LLVM in stablehlo, which probably isn't necessary. What do you think?

Agreed! Makes sense to me :)

hamptonm1 · 2024-04-18T17:53:22Z

@chentong319 I see conversion/onnx_to_krnl/Sequence/Sequence_with_dealloc.mlir failing on MacOS only - would you have an idea what could cause it? I've seen you mentionned in discussions surrounding this test before so I figured you might know

@flemairen6 You can remove or comment out the Sequence_with_dealloc.mli test... I ran into issues with that test for the last LLVM upgrade. @chentong319 is planning on fixing the test. I just changed the test so that it can pass for the build.

unstable Signed-off-by: Ferdinand Lemaire <ferdinand.lemaire@amd.com>

jenkins-droid · 2024-04-19T06:50:17Z

Can one of the admins verify this patch?

hamptonm1 · 2024-04-19T14:21:30Z

@jenkins-droid test this please

hamptonm1 · 2024-04-19T15:38:54Z

@flemairen6 New issue... it seems like the following backend tests are failing:

=========================== short test summary info ============================
 Debug/test.py::OnnxBackendNodeModelTest::test_unique_not_sorted_without_axis_cpu
 Debug/test.py::OnnxBackendNodeModelTest::test_unique_sorted_with_axis_3d_cpu
 Debug/test.py::OnnxBackendNodeModelTest::test_unique_sorted_with_axis_cpu - ...
 Debug/test.py::OnnxBackendNodeModelTest::test_unique_sorted_with_negative_axis_cpu
 Debug/test.py::OnnxBackendNodeModelTest::test_unique_sorted_without_axis_cpu

hamptonm1 · 2024-04-20T19:15:33Z

@chentong319 I created an issue to track the lit test fix: #2803

flemairen6 · 2024-04-22T15:07:58Z

@flemairen6 New issue... it seems like the following backend tests are failing:

=========================== short test summary info ============================
 Debug/test.py::OnnxBackendNodeModelTest::test_unique_not_sorted_without_axis_cpu
 Debug/test.py::OnnxBackendNodeModelTest::test_unique_sorted_with_axis_3d_cpu
 Debug/test.py::OnnxBackendNodeModelTest::test_unique_sorted_with_axis_cpu - ...
 Debug/test.py::OnnxBackendNodeModelTest::test_unique_sorted_with_negative_axis_cpu
 Debug/test.py::OnnxBackendNodeModelTest::test_unique_sorted_without_axis_cpu

I could reproduce the failing tests, but I have a hard time finding out the root cause, I'm seeing a double free in the OMRunner. My knowledge on the backend is really limited, who would be a Backend expert that could help me with this?

chentong319 · 2024-04-22T15:33:12Z

When a lit test failed only on Mac, it is usually caused by the different order of DAG operations. The solution is either to avoid ambiguity in code gen or modify the CHECK in lit test.
I will come back to fix the lit test for sequence after onnx-mlir is moved to the new llvm.

hamptonm1 · 2024-04-22T18:02:36Z

@flemairen6 Do you have the exact error message that you can post here? I see Segmentation fault in the stack trace

flemairen6 · 2024-04-23T10:26:43Z

When a lit test failed only on Mac, it is usually caused by the different order of DAG operations. The solution is either to avoid ambiguity in code gen or modify the CHECK in lit test. I will come back to fix the lit test for sequence after onnx-mlir is moved to the new llvm.

Looks like it's segfaulting so I don't think it's just a CHECK issue - the same bug seems to trigger on other architectures too

flemairen6 · 2024-04-23T10:30:53Z

@flemairen6 Do you have the exact error message that you can post here? I see Segmentation fault in the stack trace

I don't have a lot to work with, this is the full trace. Looks like some JSON didn't get dumped, or did with a wrong format, it's hard to tell

chentong319 · 2024-04-29T20:57:26Z

I looked into the backend test failures. They are related to ONNXUniqueOp. The new buffer deallocation in llvm cannot handle unrealized_conversion_cast correctly. Now deallocation is generated for allocated memref used in ReturnOp. When I modify the code to avoid the unrealized_conversion_cast, the test case is fine.
While waiting for llvm to fix this bug, we can avoid generating these unrealized_conversion_cast. It will be a better code anyway. @negiyas , when you lower UniqueOp to krnl, you may use the types of the output to avoid unrealized_conversion_cast. Could you check whether it is easy to do that?

Comment: I create #2820 to fix this issue. No need to worry about it if the PR is fine.

chentong319 · 2024-04-30T21:11:25Z

The PR will fail on some cast related test if the deallocation pass is commented. I use test_cast_DOUBLE_to_FLOAT16_cpu as an example. The failure is after onnx-mlir generated llvm IR. In this test case, the only special op is arith.truncf. I do noticed that the definition of TruncFOp is changed. But the output of --EmitLLVMLIR are the same for old and new version of onnx-mlir. I suspect that the error is in lowering llvm::fptrunc. But I did not go further.

@hamptonm1 Do you have some bandwidth to track this problem down? To me, the first thing is to verify the lowering arith.trucf to llvm.fptrunc is correct in the new version of llvm. Then focus on the lowering of llvm.fptrunc.

hamptonm1 · 2024-05-01T21:57:53Z

The PR will fail on some cast related test if the deallocation pass is commented. I use test_cast_DOUBLE_to_FLOAT16_cpu as an example. The failure is after onnx-mlir generated llvm IR. In this test case, the only special op is arith.truncf. I do noticed that the definition of TruncFOp is changed. But the output of --EmitLLVMLIR are the same for old and new version of onnx-mlir. I suspect that the error is in lowering llvm::fptrunc. But I did not go further.

@hamptonm1 Do you have some bandwidth to track this problem down? To me, the first thing is to verify the lowering arith.trucf to llvm.fptrunc is correct in the new version of llvm. Then focus on the lowering of llvm.fptrunc.

Okay I have a few things I need to take care of and then I can try to look at this myself. Thanks!

hamptonm1 · 2024-05-07T14:50:28Z

@flemairen6 Can you do me a favor and update your description for this PR? It seems like you checked out fd0c20a for StableHLO but you specify a different commit hash above. I am trying to test things now using your branch :)

hamptonm1 · 2024-05-07T19:43:35Z

@jenkins-droid test this please

hamptonm1 · 2024-05-13T19:22:37Z

@chentong319 I merged your recent update into the PR and now it seems like all layer_normalization tests are failing only.... hmmm. Maybe the same applies with instance norm and layer norm that occurred with unique. @AlexandreEichenberger Since you wrote a lot of the code for normalization, you mind taking a look at the failed backed tests (and maybe refer to Tong's recent PR for the changes made to unique due to the deallocation pass)?

hamptonm1

LGTM!

jenkins-droid · 2024-05-29T16:59:18Z

Jenkins Linux ppc64le Build #13916 [push] Bump LLVM commit to 20ed... started at 13:09

jenkins-droid · 2024-05-29T16:59:18Z

Jenkins Linux amd64 Build #14886 [push] Bump LLVM commit to 20ed... started at 11:59

jenkins-droid · 2024-05-29T16:59:18Z

Jenkins Linux s390x Build #14891 [push] Bump LLVM commit to 20ed... started at 12:59

jenkins-droid · 2024-05-29T19:12:35Z

Jenkins Linux amd64 Build #14886 [push] Bump LLVM commit to 20ed... passed after 2 hr 13 min

jenkins-droid · 2024-05-29T19:13:29Z

Jenkins Linux s390x Build #14891 [push] Bump LLVM commit to 20ed... passed after 2 hr 14 min

jenkins-droid · 2024-05-29T20:07:10Z

Jenkins Linux ppc64le Build #13916 [push] Bump LLVM commit to 20ed... passed after 3 hr 7 min

Bump LLVM commit to 1e6ce5e284f5c0e8d64eee21af727bb164eb3caf and

9011df7

stablehlo to 49dc86c0df9ac3a8a556208674204b0f68d8eb6d Signed-off-by: Ferdinand Lemaire <ferdinan@xilinx.com>

flemairen6 requested a review from hamptonm1 April 16, 2024 06:59

chentong319 previously approved these changes Apr 16, 2024

View reviewed changes

Bump LLVM to 20ed5b1f4587 and stablehlo to fd0c20a10

1e0298c

Revert the changes to properties from stablehlo and fix some references where variable names had changed. Signed-off-by: Ferdinand Lemaire <ferdinand.lemaire@amd.com>

flemairen6 changed the title ~~Bump LLVM commit to 1e6ce5e284f5c0e8d64eee21af727bb164eb3caf~~ Bump LLVM commit to 20ed5b1f4587 Apr 18, 2024

Merge branch 'main' into ferdinand.update_llvm_april_2024

4a4c568

Remove Sequence_with_dealloc.mlir from the ran tests because it's

7ef1b58

unstable Signed-off-by: Ferdinand Lemaire <ferdinand.lemaire@amd.com>

hamptonm1 added 2 commits April 23, 2024 16:27

Merge branch 'main' into ferdinand.update_llvm_april_2024

e16c813

Merge branch 'main' into ferdinand.update_llvm_april_2024

a9708e1

Merge branch 'main' into ferdinand.update_llvm_april_2024

c38e5a8

chentong319 mentioned this pull request May 8, 2024

Use output shape when lowering Unique to krnl #2820

Merged

hamptonm1 added 2 commits May 10, 2024 11:20

Merge branch 'main' into ferdinand.update_llvm_april_2024

d95c187

Merge branch 'main' into ferdinand.update_llvm_april_2024

6d43ba7

hamptonm1 and others added 7 commits May 14, 2024 09:17

Merge branch 'main' into ferdinand.update_llvm_april_2024

dea4e10

Merge branch 'main' into ferdinand.update_llvm_april_2024

2dd1d31

Merge branch 'main' into ferdinand.update_llvm_april_2024

7bbf15a

Merge branch 'main' into ferdinand.update_llvm_april_2024

60000d4

Merge branch 'main' into ferdinand.update_llvm_april_2024

f58cad9

Merge branch 'main' into ferdinand.update_llvm_april_2024

25df99b

Merge branch 'main' into ferdinand.update_llvm_april_2024

458bc31

hamptonm1 approved these changes May 29, 2024

View reviewed changes

hamptonm1 merged commit f2bccef into onnx:main May 29, 2024
6 of 7 checks passed

Bump LLVM commit to 20ed5b1f4587 #2799

Bump LLVM commit to 20ed5b1f4587 #2799

Conversation

flemairen6 commented Apr 15, 2024 • edited

jenkins-droid commented Apr 15, 2024

flemairen6 commented Apr 16, 2024

chentong319 left a comment

Choose a reason for hiding this comment

chentong319 commented Apr 16, 2024

gongsu832 commented Apr 16, 2024

flemairen6 commented Apr 17, 2024

gongsu832 commented Apr 17, 2024

flemairen6 commented Apr 17, 2024

hamptonm1 commented Apr 17, 2024

flemairen6 commented Apr 18, 2024

jenkins-droid commented Apr 18, 2024

flemairen6 commented Apr 18, 2024

hamptonm1 commented Apr 18, 2024

hamptonm1 commented Apr 18, 2024

jenkins-droid commented Apr 19, 2024

hamptonm1 commented Apr 19, 2024

hamptonm1 commented Apr 19, 2024

hamptonm1 commented Apr 20, 2024

flemairen6 commented Apr 22, 2024 • edited

chentong319 commented Apr 22, 2024

hamptonm1 commented Apr 22, 2024

flemairen6 commented Apr 23, 2024 • edited

flemairen6 commented Apr 23, 2024

chentong319 commented Apr 29, 2024 • edited

chentong319 commented Apr 30, 2024

hamptonm1 commented May 1, 2024

hamptonm1 commented May 7, 2024

hamptonm1 commented May 7, 2024

hamptonm1 commented May 13, 2024 • edited

hamptonm1 left a comment

Choose a reason for hiding this comment

jenkins-droid commented May 29, 2024

jenkins-droid commented May 29, 2024

jenkins-droid commented May 29, 2024

jenkins-droid commented May 29, 2024

jenkins-droid commented May 29, 2024

jenkins-droid commented May 29, 2024

flemairen6 commented Apr 15, 2024 •

edited

flemairen6 commented Apr 22, 2024 •

edited

flemairen6 commented Apr 23, 2024 •

edited

chentong319 commented Apr 29, 2024 •

edited

hamptonm1 commented May 13, 2024 •

edited