Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bump LLVM commit to 20ed5b1f4587 #2799

Merged
merged 16 commits into from
May 29, 2024

Conversation

flemairen6
Copy link
Collaborator

@flemairen6 flemairen6 commented Apr 15, 2024

Bump LLVM commit to 20ed5b1f4587 and Stablehlo to fd0c20a10

stablehlo to 49dc86c0df9ac3a8a556208674204b0f68d8eb6d

Signed-off-by: Ferdinand Lemaire <ferdinan@xilinx.com>
@jenkins-droid
Copy link
Collaborator

Can one of the admins verify this patch?

@flemairen6
Copy link
Collaborator Author

Would someone know why the MacOS build is failing? The error seems to come from stablehlo and to be self-contained in a file from there so I would assume it's not the bump that made it fail

chentong319
chentong319 previously approved these changes Apr 16, 2024
Copy link
Collaborator

@chentong319 chentong319 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@chentong319 chentong319 dismissed their stale review April 16, 2024 15:44

Build failed.

@chentong319
Copy link
Collaborator

Would someone know why the MacOS build is failing? The error seems to come from stablehlo and to be self-contained in a file from there so I would assume it's not the bump that made it fail

The related error message:

/Users/runner/work/onnx-mlir/onnx-mlir/third_party/stablehlo/stablehlo/reference/Ops.cpp:252:16: note: candidate template ignored: deduced conflicting types for parameter 'T' ('long' vs. 'long long')

is long vs. long long a special case on Mac?

@gongsu832
Copy link
Collaborator

long and long long can be different or the same depending on the compiler, OS, and arch. So one really shouldn't make any assumption about whether they are the same or not.

@flemairen6
Copy link
Collaborator Author

long and long long can be different or the same depending on the compiler, OS, and arch. So one really shouldn't make any assumption about whether they are the same or not.

Right, but since it seems to be a stablehlo contained problem (the problematic function is defined and only used in onnx-mlir/third_party/stablehlo/stablehlo/reference/Ops.cpp ), how do we go about this? Should we open an issue over there and wait for a new green commit?

@gongsu832
Copy link
Collaborator

Unless they already knew about the problem, I think opening an issue at stablehlo is probably the way to go.

@flemairen6
Copy link
Collaborator Author

Unless they already knew about the problem, I think opening an issue at stablehlo is probably the way to go.

Just checked and it appears they fixed it already. It seems they bump LLVM to a newer commit, so I'll bump again to a new LLVM to match, and bump stablehlo to the fixed commit.

@hamptonm1
Copy link
Collaborator

Unless they already knew about the problem, I think opening an issue at stablehlo is probably the way to go.

Just checked and it appears they fixed it already. It seems they bump LLVM to a newer commit, so I'll bump again to a new LLVM to match, and bump stablehlo to the fixed commit.

Thanks @flemairen6 for your hard work! Are you looking at LLVM commit 20ed5b1f4587 and Stablehlo commit 4f6df1b or a newer Stablehlo commit?

@flemairen6
Copy link
Collaborator Author

Unless they already knew about the problem, I think opening an issue at stablehlo is probably the way to go.

Just checked and it appears they fixed it already. It seems they bump LLVM to a newer commit, so I'll bump again to a new LLVM to match, and bump stablehlo to the fixed commit.

Thanks @flemairen6 for your hard work! Are you looking at LLVM commit 20ed5b1f4587 and Stablehlo commit 4f6df1b or a newer Stablehlo commit?

Yes for the LLVM commit, but I was planning to use fd0c20a10 for the stablehlo commit since its the one with the fix for the current issue. It's a few commits more recent than the LLVM bump, but the other solution would be to wait another integration of LLVM in stablehlo, which probably isn't necessary. What do you think?

Revert the changes to properties from stablehlo and fix some references
where variable names had changed.

Signed-off-by: Ferdinand Lemaire <ferdinand.lemaire@amd.com>
@jenkins-droid
Copy link
Collaborator

Can one of the admins verify this patch?

@flemairen6 flemairen6 changed the title Bump LLVM commit to 1e6ce5e284f5c0e8d64eee21af727bb164eb3caf Bump LLVM commit to 20ed5b1f4587 Apr 18, 2024
@flemairen6
Copy link
Collaborator Author

@chentong319 I see conversion/onnx_to_krnl/Sequence/Sequence_with_dealloc.mlir failing on MacOS only - would you have an idea what could cause it? I've seen you mentionned in discussions surrounding this test before so I figured you might know

@hamptonm1
Copy link
Collaborator

Unless they already knew about the problem, I think opening an issue at stablehlo is probably the way to go.

Just checked and it appears they fixed it already. It seems they bump LLVM to a newer commit, so I'll bump again to a new LLVM to match, and bump stablehlo to the fixed commit.

Thanks @flemairen6 for your hard work! Are you looking at LLVM commit 20ed5b1f4587 and Stablehlo commit 4f6df1b or a newer Stablehlo commit?

Yes for the LLVM commit, but I was planning to use fd0c20a10 for the stablehlo commit since its the one with the fix for the current issue. It's a few commits more recent than the LLVM bump, but the other solution would be to wait another integration of LLVM in stablehlo, which probably isn't necessary. What do you think?

Agreed! Makes sense to me :)

@hamptonm1
Copy link
Collaborator

@chentong319 I see conversion/onnx_to_krnl/Sequence/Sequence_with_dealloc.mlir failing on MacOS only - would you have an idea what could cause it? I've seen you mentionned in discussions surrounding this test before so I figured you might know

@flemairen6 You can remove or comment out the Sequence_with_dealloc.mli test... I ran into issues with that test for the last LLVM upgrade. @chentong319 is planning on fixing the test. I just changed the test so that it can pass for the build.

unstable

Signed-off-by: Ferdinand Lemaire <ferdinand.lemaire@amd.com>
@jenkins-droid
Copy link
Collaborator

Can one of the admins verify this patch?

@hamptonm1
Copy link
Collaborator

@jenkins-droid test this please

@hamptonm1
Copy link
Collaborator

@flemairen6 New issue... it seems like the following backend tests are failing:

=========================== short test summary info ============================
 Debug/test.py::OnnxBackendNodeModelTest::test_unique_not_sorted_without_axis_cpu
 Debug/test.py::OnnxBackendNodeModelTest::test_unique_sorted_with_axis_3d_cpu
 Debug/test.py::OnnxBackendNodeModelTest::test_unique_sorted_with_axis_cpu - ...
 Debug/test.py::OnnxBackendNodeModelTest::test_unique_sorted_with_negative_axis_cpu
 Debug/test.py::OnnxBackendNodeModelTest::test_unique_sorted_without_axis_cpu

@hamptonm1
Copy link
Collaborator

@chentong319 I created an issue to track the lit test fix: #2803

@flemairen6
Copy link
Collaborator Author

flemairen6 commented Apr 22, 2024

@flemairen6 New issue... it seems like the following backend tests are failing:

=========================== short test summary info ============================
 Debug/test.py::OnnxBackendNodeModelTest::test_unique_not_sorted_without_axis_cpu
 Debug/test.py::OnnxBackendNodeModelTest::test_unique_sorted_with_axis_3d_cpu
 Debug/test.py::OnnxBackendNodeModelTest::test_unique_sorted_with_axis_cpu - ...
 Debug/test.py::OnnxBackendNodeModelTest::test_unique_sorted_with_negative_axis_cpu
 Debug/test.py::OnnxBackendNodeModelTest::test_unique_sorted_without_axis_cpu

I could reproduce the failing tests, but I have a hard time finding out the root cause, I'm seeing a double free in the OMRunner. My knowledge on the backend is really limited, who would be a Backend expert that could help me with this?

@chentong319
Copy link
Collaborator

When a lit test failed only on Mac, it is usually caused by the different order of DAG operations. The solution is either to avoid ambiguity in code gen or modify the CHECK in lit test.
I will come back to fix the lit test for sequence after onnx-mlir is moved to the new llvm.

@hamptonm1
Copy link
Collaborator

@flemairen6 Do you have the exact error message that you can post here? I see Segmentation fault in the stack trace

@flemairen6
Copy link
Collaborator Author

flemairen6 commented Apr 23, 2024

When a lit test failed only on Mac, it is usually caused by the different order of DAG operations. The solution is either to avoid ambiguity in code gen or modify the CHECK in lit test. I will come back to fix the lit test for sequence after onnx-mlir is moved to the new llvm.

Looks like it's segfaulting so I don't think it's just a CHECK issue - the same bug seems to trigger on other architectures too

@flemairen6
Copy link
Collaborator Author

@flemairen6 Do you have the exact error message that you can post here? I see Segmentation fault in the stack trace

I don't have a lot to work with, this is the full trace. Looks like some JSON didn't get dumped, or did with a wrong format, it's hard to tell

@chentong319
Copy link
Collaborator

chentong319 commented Apr 29, 2024

I looked into the backend test failures. They are related to ONNXUniqueOp. The new buffer deallocation in llvm cannot handle unrealized_conversion_cast correctly. Now deallocation is generated for allocated memref used in ReturnOp. When I modify the code to avoid the unrealized_conversion_cast, the test case is fine.
While waiting for llvm to fix this bug, we can avoid generating these unrealized_conversion_cast. It will be a better code anyway. @negiyas , when you lower UniqueOp to krnl, you may use the types of the output to avoid unrealized_conversion_cast. Could you check whether it is easy to do that?

Comment: I create #2820 to fix this issue. No need to worry about it if the PR is fine.

@chentong319
Copy link
Collaborator

The PR will fail on some cast related test if the deallocation pass is commented. I use test_cast_DOUBLE_to_FLOAT16_cpu as an example. The failure is after onnx-mlir generated llvm IR. In this test case, the only special op is arith.truncf. I do noticed that the definition of TruncFOp is changed. But the output of --EmitLLVMLIR are the same for old and new version of onnx-mlir. I suspect that the error is in lowering llvm::fptrunc. But I did not go further.

@hamptonm1 Do you have some bandwidth to track this problem down? To me, the first thing is to verify the lowering arith.trucf to llvm.fptrunc is correct in the new version of llvm. Then focus on the lowering of llvm.fptrunc.

@hamptonm1
Copy link
Collaborator

The PR will fail on some cast related test if the deallocation pass is commented. I use test_cast_DOUBLE_to_FLOAT16_cpu as an example. The failure is after onnx-mlir generated llvm IR. In this test case, the only special op is arith.truncf. I do noticed that the definition of TruncFOp is changed. But the output of --EmitLLVMLIR are the same for old and new version of onnx-mlir. I suspect that the error is in lowering llvm::fptrunc. But I did not go further.

@hamptonm1 Do you have some bandwidth to track this problem down? To me, the first thing is to verify the lowering arith.trucf to llvm.fptrunc is correct in the new version of llvm. Then focus on the lowering of llvm.fptrunc.

Okay I have a few things I need to take care of and then I can try to look at this myself. Thanks!

@hamptonm1
Copy link
Collaborator

@flemairen6 Can you do me a favor and update your description for this PR? It seems like you checked out fd0c20a for StableHLO but you specify a different commit hash above. I am trying to test things now using your branch :)

@hamptonm1
Copy link
Collaborator

@jenkins-droid test this please

@hamptonm1
Copy link
Collaborator

hamptonm1 commented May 13, 2024

@chentong319 I merged your recent update into the PR and now it seems like all layer_normalization tests are failing only.... hmmm. Maybe the same applies with instance norm and layer norm that occurred with unique. @AlexandreEichenberger Since you wrote a lot of the code for normalization, you mind taking a look at the failed backed tests (and maybe refer to Tong's recent PR for the changes made to unique due to the deallocation pass)?

Copy link
Collaborator

@hamptonm1 hamptonm1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@hamptonm1 hamptonm1 merged commit f2bccef into onnx:main May 29, 2024
6 of 7 checks passed
@jenkins-droid
Copy link
Collaborator

Jenkins Linux ppc64le Build #13916 [push] Bump LLVM commit to 20ed... started at 13:09

@jenkins-droid
Copy link
Collaborator

Jenkins Linux amd64 Build #14886 [push] Bump LLVM commit to 20ed... started at 11:59

@jenkins-droid
Copy link
Collaborator

Jenkins Linux s390x Build #14891 [push] Bump LLVM commit to 20ed... started at 12:59

@jenkins-droid
Copy link
Collaborator

Jenkins Linux amd64 Build #14886 [push] Bump LLVM commit to 20ed... passed after 2 hr 13 min

@jenkins-droid
Copy link
Collaborator

Jenkins Linux s390x Build #14891 [push] Bump LLVM commit to 20ed... passed after 2 hr 14 min

@jenkins-droid
Copy link
Collaborator

Jenkins Linux ppc64le Build #13916 [push] Bump LLVM commit to 20ed... passed after 3 hr 7 min

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants