Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error taxonomy: detail errors in "spack_error" #441

Open
alalazo opened this issue Mar 30, 2023 · 6 comments
Open

Error taxonomy: detail errors in "spack_error" #441

alalazo opened this issue Mar 30, 2023 · 6 comments

Comments

@alalazo
Copy link
Member

alalazo commented Mar 30, 2023

I started looking at failures in the last week, and I have some proposal for extending the taxonomy:

# Spack timed out while waiting for a lock
lock_timeout:
    grep_for:
    - "Timed out waiting for a write lock after"

# A file was not found during installation    
file_not_found:
    grep_for:
    - "FileNotFoundError"
    
# A command on the system exited non-zero
system_command_failure:
    grep_for:
    - "ProcessError: Command exited with status"
    
# Couldn't fetch an archive to build a spec
fetch_error:
    grep_for:
    - "FetchError:"

# Some method in a build_system base class failed
build_systems:
    grep_for:
    - "File \"/builds/spack/spack/lib/spack/spack/build_systems"

# Some attribute in a recipe was not found
attribute_error:
    grep_for:
    - "AttributeError:"

# Relocation refused to grow a string from buildcache
cannot_grow_string:
    grep_for:
    - "CannotGrowString"

I also noticed a few socket.timeout but I see that those are already taken care of in #439

Notes

  • The lock timeout, so far, seems to happen only on POWER
  • I have seen a few cases of CMake returning -4 Does that have a particular meaning?
@kwryankrattiger
Copy link
Collaborator

I will work on these. In the meantime, can you send the failed jobs that you found these in? That way I can make sure the classification order and regex's are working correctly.

@alalazo
Copy link
Member Author

alalazo commented Mar 30, 2023

To get the categorization above, I checked the raw logs for the all the "spack_error" of last week. Posting one link per refined category:

@tgamblin
Copy link
Member

tgamblin commented Mar 30, 2023

@alalazo do you have numbers on the frequency of each of these?

@alalazo
Copy link
Member Author

alalazo commented Mar 30, 2023

Number of logs in each category (out of ~500 logs downloaded):

  • lock_timeout: 210 (165 on POWER)
  • file_not_found: 18
  • system_command_failure: 171
  • fetch_error: 37
  • build_systems: 6
  • attribute_error: 12
  • cannot_grow_string: 29

Also, forgot to mention:

@alalazo
Copy link
Member Author

alalazo commented Mar 31, 2023

So, looking at system command failures. The one I linked seems to be a case of scheduling to the wrong node, so (thanks @haampie for the suggestion) most of these cases there might be illegal instructions.1

For https://gitlab.spack.io/spack/spack/-/jobs/6366035 specifically I wonder:

  • Why are we having a skylake_avx512 buildcache? Was there some transient error in Ci backwards compat spack#36045 that caused that?
  • Why are we mapping an AVX512 build to a zen2 (I think) node?

Footnotes

  1. Hopefully all of them, since the other hypothesis is errors in relocation.

@haampie
Copy link
Member

haampie commented Mar 31, 2023

I fixed that here: spack/spack@dba57ff

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants