[NNPA] Check return values of zdnn calls. (follow-up of PR#2267) #2338

imaihal · 2023-06-27T14:14:17Z

This PR is follow-up of PR #2267.

Current main branch does not check the return values of zDNN functions and continues the execution without displaying any messages in onnx-mlir if the function returns errors . (We can see errors of zDNN, but they are not enough to identify the operations that generate errors.) This PR implements code to check it and display API name and the value when it returns errors. This message helps to identify the operation that generates errors and its causes. Also, this PR provides an option --func-call-error-exit to stop execution when the functions return an error.

Example error message in T5

onnx-mlir: Error in zDNN call(ZDNN_LOG): returned 0x20001

Signed-off-by: Haruki Imai imaihal@jp.ibm.com
Co-authored-by: Yasushi Negishi negishi@jp.ibm.com

… krnl-to-llvm conversion. Signed-off-by: Yasushi Negishi <negishi@jp.ibm.com>

Signed-off-by: Haruki Imai <imaihal@jp.ibm.com>

tungld

How about having two options (feel free to change the option names):

nnpa-check-zdnn-return-value to emit code that checks zdnn return values
nnpa-stop-if-zdnn-failed to terminate if the zdnn function returns error, this is valid if nnpa-check-zddn-return-value is on.

tungld · 2023-07-03T01:22:12Z

src/Accelerators/NNPA/Conversion/ZLowToLLVM/ZLowToLLVMCommon.cpp

+        // Exit
+        if (funcCallErrorExit) {
+          MLIRContext *context = rewriter.getContext();
+          Type int32Ty = IntegerType::get(context, 32);


Please use i64 since we have an issue with i32 on big endian machine. More info: #778

tungld · 2023-07-03T01:24:01Z

src/Accelerators/NNPA/Conversion/ZLowToLLVM/ZLowToLLVMCommon.cpp

+  // compare the return value with ref
+  std::string errorMsg =
+      "onnx-mlir: Error in zDNN call(" + apiIdStr(apiId) + "): returned ";
+  equalOrExit(module, rewriter, loc, ref, ret, errorMsg, funcCallErrorExit);


Since NNPA calls already check the return value, I recommend to make this check optional when we need by providing an option to onnx-mlir.

tungld · 2023-07-03T01:26:06Z

src/Accelerators/NNPA/Conversion/ZLowToLLVM/ZLowToLLVMCommon.cpp

+          Type int32Ty = IntegerType::get(context, 32);
+          Value one = create.math.constant(int32Ty, 1);
+          FlatSymbolRefAttr exitRef = krnl::getOrInsertExit(rewriter, module);
+          create.llvm.call({}, exitRef, {one});


In onnx-mlir, we return null instead of exit. Could you change it to return null?

tungld · 2023-07-03T01:29:37Z

src/Compiler/CompilerOptions.cpp

+    llvm::cl::desc("Execution failed when external function call failed."
+                   " Currently only zDNN calls in NNPA are supported."),
+    llvm::cl::init(false), llvm::cl::cat(OnnxMlirOptions));
+


Move this inside the NNPA folder, since it is for NNPA only. Perhaps, prefixing NNPA options with nnpa is easier for user.

tungld · 2023-07-03T01:33:50Z

test/mlir/accelerators/nnpa/conversion/lower-all-to-llvm-error-exit.mlir

+  %0 = memref.alloc() : memref<10x10xf32>
+  %1 = memref.alloc() : memref<1x1x32x64xf32>
+  "zlow.stick"(%0, %1) : (memref<10x10xf32>, memref<1x1x32x64xf32>) -> ()
+  return


The following tests are sharing the same pattern. It looks redundant to test for all operations.

Instead, you can have a test with two or three zlow ops. e.g.

y = zlow.stick(x) z = zlow.relu(y) out = zlow.unstick(z) return out

tungld · 2023-07-03T01:36:22Z

test/mlir/accelerators/nnpa/conversion/lower-all-to-llvm-error-exit.mlir

+// CHECK:           llvm.call @printf([[VAR_73_1_]], [[VAR_69_1_]]) : (!llvm.ptr, i32) -> ()
+// CHECK:           [[VAR_74_1_:%.+]] = llvm.mlir.constant(1 : i32) : i32
+// CHECK:           llvm.call @exit([[VAR_74_1_]]) : (i32) -> ()
+// CHECK:           llvm.br ^bb2


Since you call exit, llvm.br ^bb2 looks like a dead code?

AlexandreEichenberger · 2023-07-06T13:08:45Z

@imaihal @tungld Hi level question: have we approached the zDNN folks to see if it makes more sense to have them generate an error log and/or abort?

imaihal · 2023-08-02T06:16:29Z

@imaihal @tungld Hi level question: have we approached the zDNN folks to see if it makes more sense to have them generate an error log and/or abort?

@AlexandreEichenberger They don't have any plan to update error handling in zDNN now. The zDNN returns error code, but currently onnx-mlir ignores it. So, this PR adds mechanism to handle it.

tungld · 2023-08-02T07:18:58Z

@AlexandreEichenberger They don't have any plan to update error handling in zDNN now. The zDNN returns error code, but currently onnx-mlir ignores it. So, this PR adds mechanism to handle it.

@imaihal zDNN checks the error code and display an error message for every zdnn function: https://github.com/IBM/zDNN/blob/main/zdnn/aiu_ops.c#L151

if (status == ZDNN_OK) {
    if (ef & EF_RANGE_VIOLATION_MASK) {
      status =
          ZDNN_STATUS(ZDNN_ELEMENT_RANGE_VIOLATION,
                      "Range violation on tensor data", NO_ARG); /*
                               AIU operation returned a RANGE VIOLATION, set as
                               a warning code and continue processing */
    } else if (ef & ~EF_RANGE_VIOLATION_MASK) {
      return status = ZDNN_STATUS(ZDNN_UNSUPPORTED_AIU_EXCEPTION,
                                  "Unsupported exception on ZDNN operation",
                                  NO_ARG); /* AIU operation returned an
                               unexpected exception, return as a failure */
    }

It would be very easy to add the function_code into the ZDNN_STATUS's message. Do you know why they reject updating the message with function code?

imaihal · 2023-08-02T15:27:23Z

@tungld Sorry, correctly speaking, it is not rejected. @negiyas created an issue about IBM/zDNN#17, but not answered yet.
I remember there was a question about whether onnx-mlir checks the return value of zDNN. Currently onnx-mlir does not check, so, we implemented this mechanism.
Are you concerning about performance overhead?

AlexandreEichenberger · 2023-08-02T15:32:05Z

Are you concerning about performance overhead?

If you think about it, if you have 100 calls to zdnn matmul, you will have 100 insertion of code that does this check. If it is in the library, there is already some code checking for errors, and it would be only once per operation (e.g. only 1 copy in the matmul code regardless of how many instances of calls there is to the zdnn matmul).

Signed-off-by: Haruki Imai <imaihal@jp.ibm.com>

…pdate_pr2267_bak

It seems I accidentary updated before. Signed-off-by: Haruki Imai <imaihal@jp.ibm.com>

negiyas and others added 20 commits May 24, 2023 22:47

Check return value of zdnn_softmax. This patch still causes errors in…

950b66a

… krnl-to-llvm conversion. Signed-off-by: Yasushi Negishi <negishi@jp.ibm.com>

Merge branch 'main' into work_check_zdnn_api_return_value

31f6d20

Merge branch 'main' into work_check_zdnn_api_return_value

13b256e

Merge branch 'main' into pr2267

42b6522

Enable elementwise ops.

d81d8b9

Signed-off-by: Haruki Imai <imaihal@jp.ibm.com>

Add Exit func call.

99a6937

Signed-off-by: Haruki Imai <imaihal@jp.ibm.com>

Add ifThenElseTest for new ifThenElse func.

3e31211

Signed-off-by: Haruki Imai <imaihal@jp.ibm.com>

Merge branch 'main' into check_return_val_update_pr2267

4bf0c77

Merge branch 'main' into check_return_val_update_pr2267

1146c86

Update equalOrFailure()

83995ab

Signed-off-by: Haruki Imai <imaihal@jp.ibm.com>

Remove test code for IfThenElse.

3a8cc53

Signed-off-by: Haruki Imai <imaihal@jp.ibm.com>

Display return value as hex.

10d5da7

Signed-off-by: Haruki Imai <imaihal@jp.ibm.com>

Merge branch 'main' into check_return_val_update_pr2267

461b3eb

Add an option to enable exit when returning an error.

8eb3981

Signed-off-by: Haruki Imai <imaihal@jp.ibm.com>

Merge branch 'main' into check_return_val_update_pr2267

4003d5e

format

035beb8

Signed-off-by: Haruki Imai <imaihal@jp.ibm.com>

Apply to other tos

e1ac5be

Signed-off-by: Haruki Imai <imaihal@jp.ibm.com>

Add lit tests.

322defc

Signed-off-by: Haruki Imai <imaihal@jp.ibm.com>

Merge branch 'main' into check_return_val_update_pr2267

3f5e979

Change option name

6443b02

Signed-off-by: Haruki Imai <imaihal@jp.ibm.com>

imaihal marked this pull request as ready for review June 29, 2023 16:00

imaihal added the Ready for Review label Jun 29, 2023

imaihal requested review from tungld, AlexandreEichenberger and negiyas June 30, 2023 03:12

Merge branch 'main' into check_return_val_update_pr2267

6b615ab

tungld reviewed Jul 3, 2023

View reviewed changes

Merge branch 'main' into check_return_val_update_pr2267

58c81d1

Merge branch 'main' into check_return_val_update_pr2267

1da04af

imaihal removed the Ready for Review label Aug 3, 2023

imaihal added 3 commits August 3, 2023 02:24

Merge branch 'main' into HEAD

7b58e35

Signed-off-by: Haruki Imai <imaihal@jp.ibm.com>

Merge branch 'check_return_val_update_pr2267' into check_return_val_u…

24ea3ac

…pdate_pr2267_bak

Fix commit id for third_party/mlir-hlo.

00c22a5

It seems I accidentary updated before. Signed-off-by: Haruki Imai <imaihal@jp.ibm.com>

imaihal marked this pull request as draft August 28, 2023 06:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NNPA] Check return values of zdnn calls. (follow-up of PR#2267) #2338

[NNPA] Check return values of zdnn calls. (follow-up of PR#2267) #2338

imaihal commented Jun 27, 2023 •

edited by negiyas

tungld left a comment

tungld Jul 3, 2023

tungld Jul 3, 2023

tungld Jul 3, 2023

tungld Jul 3, 2023

tungld Jul 3, 2023

tungld Jul 3, 2023

AlexandreEichenberger commented Jul 6, 2023

imaihal commented Aug 2, 2023

tungld commented Aug 2, 2023 •

edited

imaihal commented Aug 2, 2023

AlexandreEichenberger commented Aug 2, 2023

[NNPA] Check return values of zdnn calls. (follow-up of PR#2267) #2338

Are you sure you want to change the base?

[NNPA] Check return values of zdnn calls. (follow-up of PR#2267) #2338

Conversation

imaihal commented Jun 27, 2023 • edited by negiyas

tungld left a comment

Choose a reason for hiding this comment

tungld Jul 3, 2023

Choose a reason for hiding this comment

tungld Jul 3, 2023

Choose a reason for hiding this comment

tungld Jul 3, 2023

Choose a reason for hiding this comment

tungld Jul 3, 2023

Choose a reason for hiding this comment

tungld Jul 3, 2023

Choose a reason for hiding this comment

tungld Jul 3, 2023

Choose a reason for hiding this comment

AlexandreEichenberger commented Jul 6, 2023

imaihal commented Aug 2, 2023

tungld commented Aug 2, 2023 • edited

imaihal commented Aug 2, 2023

AlexandreEichenberger commented Aug 2, 2023

imaihal commented Jun 27, 2023 •

edited by negiyas

tungld commented Aug 2, 2023 •

edited