Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

torch causes Fatal Python error: Floating point exception #1198

Open
rw57 opened this issue Apr 9, 2024 · 3 comments
Open

torch causes Fatal Python error: Floating point exception #1198

rw57 opened this issue Apr 9, 2024 · 3 comments
Labels
bug Something isn't working

Comments

@rw57
Copy link

rw57 commented Apr 9, 2024

🐛 Bug Report

📝 Description of issue:

The log is filled with python exception traces like the below. I'm scanning in tens of thousands of photos on a fresh Docker install.

00:31:21 [Q] CRITICAL reincarnated worker Process-e59e78ff6711490fb016575816db4f62 after death
00:31:21 [Q] INFO Process-5affe1a61cf44377ab85d669f69acbb0 ready for work at 11707
00:31:21 [Q] INFO Process-5affe1a61cf44377ab85d669f69acbb0 processing coffee-uniform-ack-papa 'api.directory_watcher.handle_new_image'
INFO:ownphotos:job f61d95b4-fbe3-4bda-a5e9-3e591c2aefed: calculate aspect ratio: /data/XXXXXXPATHTOMYPHOTOXXXXX.jpg, elapsed: 1.269778
Fatal Python error: Floating point exception

Current thread 0x00007fa671ffd040 (most recent call first):
File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/conv.py", line 456 in _conv_forward
File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/conv.py", line 460 in forward
File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1520 in _call_impl
File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1511 in _wrapped_call_impl
File "/code/api/places365/wideresnet.py", line 95 in forward
File "/code/api/places365/places365.py", line 140 in inference_places365
File "/code/api/models/photo.py", line 271 in _generate_captions
File "/code/api/directory_watcher.py", line 168 in handle_new_image
File "/usr/local/lib/python3.11/dist-packages/django_q/worker.py", line 97 in worker
File "/usr/lib/python3.11/multiprocessing/process.py", line 108 in run
File "/usr/lib/python3.11/multiprocessing/process.py", line 314 in _bootstrap
File "/usr/lib/python3.11/multiprocessing/popen_fork.py", line 71 in _launch
File "/usr/lib/python3.11/multiprocessing/popen_fork.py", line 19 in init
File "/usr/lib/python3.11/multiprocessing/context.py", line 281 in _Popen
File "/usr/lib/python3.11/multiprocessing/context.py", line 224 in _Popen
File "/usr/lib/python3.11/multiprocessing/process.py", line 121 in start
File "/usr/local/lib/python3.11/dist-packages/django_q/cluster.py", line 191 in spawn_process
File "/usr/local/lib/python3.11/dist-packages/django_q/cluster.py", line 198 in spawn_worker
File "/usr/local/lib/python3.11/dist-packages/django_q/cluster.py", line 227 in reincarnate
File "/usr/local/lib/python3.11/dist-packages/django_q/cluster.py", line 306 in guard
File "/usr/local/lib/python3.11/dist-packages/django_q/cluster.py", line 167 in start
File "/usr/local/lib/python3.11/dist-packages/django_q/cluster.py", line 158 in init
File "/usr/lib/python3.11/multiprocessing/process.py", line 108 in run
File "/usr/lib/python3.11/multiprocessing/process.py", line 314 in _bootstrap
File "/usr/lib/python3.11/multiprocessing/popen_fork.py", line 71 in _launch
File "/usr/lib/python3.11/multiprocessing/popen_fork.py", line 19 in init
File "/usr/lib/python3.11/multiprocessing/context.py", line 281 in _Popen
File "/usr/lib/python3.11/multiprocessing/context.py", line 224 in _Popen
File "/usr/lib/python3.11/multiprocessing/process.py", line 121 in start
File "/usr/local/lib/python3.11/dist-packages/django_q/cluster.py", line 66 in start
File "/usr/local/lib/python3.11/dist-packages/django_q/management/commands/qcluster.py", line 37 in handle
File "/usr/local/lib/python3.11/dist-packages/django/core/management/base.py", line 458 in execute
File "/usr/local/lib/python3.11/dist-packages/django/core/management/base.py", line 412 in run_from_argv
File "/usr/local/lib/python3.11/dist-packages/django/core/management/init.py", line 436 in execute
File "/usr/local/lib/python3.11/dist-packages/django/core/management/init.py", line 442 in execute_from_command_line
File "/code/manage.py", line 31 in

Extension modules: psutil._psutil_linux, psutil._psutil_posix, numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, markupsafe._speedups, charset_normalizer.md, _cffi_backend, torch._C, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, PIL._imaging, PIL._imagingft, yaml._yaml, matplotlib._c_internal_utils, matplotlib._path, kiwisolver._cext, pandas._libs.tslibs.ccalendar, pandas._libs.tslibs.np_datetime, pandas._libs.tslibs.dtypes, pandas._libs.tslibs.base, pandas._libs.tslibs.nattype, pandas._libs.tslibs.timezones, pandas._libs.tslibs.fields, pandas._libs.tslibs.timedeltas, pandas._libs.tslibs.tzconversion, pandas._libs.tslibs.timestamps, pandas._libs.properties, pandas._libs.tslibs.offsets, pandas._libs.tslibs.strptime, pandas._libs.tslibs.parsing, pandas._libs.tslibs.conversion, pandas._libs.tslibs.period, pandas._libs.tslibs.vectorized, pandas._libs.ops_dispatch, pandas._libs.missing, pandas._libs.hashtable, pandas._libs.algos, pandas._libs.interval, pandas._libs.lib, pandas._libs.ops, pandas._libs.hashing, pandas._libs.arrays, pandas._libs.tslib, pandas._libs.sparse, pandas._libs.internals, pandas._libs.indexing, pandas._libs.index, pandas._libs.writers, pandas._libs.join, pandas._libs.window.aggregations, pandas._libs.window.indexers, pandas._libs.reshape, pandas._libs.groupby, pandas._libs.json, pandas._libs.parsers, pandas._libs.testing, matplotlib._image, scipy._lib._ccallback_c, scipy.sparse._sparsetools, _csparsetools, scipy.sparse._csparsetools, scipy.linalg._fblas, scipy.linalg._flapack, scipy.linalg.cython_lapack, scipy.linalg._cythonized_array_utils, scipy.linalg._solve_toeplitz, scipy.linalg._decomp_lu_cython, scipy.linalg._matfuncs_sqrtm_triu, scipy.linalg.cython_blas, scipy.linalg._matfuncs_expm, scipy.linalg._decomp_update, scipy.sparse.linalg._dsolve._superlu, scipy.sparse.linalg._eigen.arpack._arpack, scipy.sparse.linalg._propack._spropack, scipy.sparse.linalg._propack._dpropack, scipy.sparse.linalg._propack._cpropack, scipy.sparse.linalg._propack._zpropack, scipy.sparse.csgraph._tools, scipy.sparse.csgraph._shortest_path, scipy.sparse.csgraph._traversal, scipy.sparse.csgraph._min_spanning_tree, scipy.sparse.csgraph._flow, scipy.sparse.csgraph._matching, scipy.sparse.csgraph._reordering, scipy.spatial._ckdtree, scipy._lib.messagestream, scipy.spatial._qhull, scipy.spatial._voronoi, scipy.spatial._distance_wrap, scipy.spatial._hausdorff, scipy.special._ufuncs_cxx, scipy.special._cdflib, scipy.special._ufuncs, scipy.special._specfun, scipy.special._comb, scipy.special._ellip_harm_2, scipy.spatial.transform._rotation, scipy.ndimage._nd_image, _ni_label, scipy.ndimage._ni_label, scipy.optimize._minpack2, scipy.optimize._group_columns, scipy.optimize._trlib._trlib, scipy.optimize._lbfgsb, _moduleTNC, scipy.optimize._moduleTNC, scipy.optimize._cobyla, scipy.optimize._slsqp, scipy.optimize._minpack, scipy.optimize._lsq.givens_elimination, scipy.optimize._zeros, scipy.optimize._highs.cython.src._highs_wrapper, scipy.optimize._highs._highs_wrapper, scipy.optimize._highs.cython.src._highs_constants, scipy.optimize._highs._highs_constants, scipy.linalg._interpolative, scipy.optimize._bglu_dense, scipy.optimize._lsap, scipy.optimize._direct, scipy.integrate._odepack, scipy.integrate._quadpack, scipy.integrate._vode, scipy.integrate._dop, scipy.integrate._lsoda, scipy.special.cython_special, scipy.stats._stats, scipy.stats.beta_ufunc, scipy.stats._boost.beta_ufunc, scipy.stats.binom_ufunc, scipy.stats._boost.binom_ufunc, scipy.stats.nbinom_ufunc, scipy.stats._boost.nbinom_ufunc, scipy.stats.hypergeom_ufunc, scipy.stats._boost.hypergeom_ufunc, scipy.stats.ncf_ufunc, scipy.stats._boost.ncf_ufunc, scipy.stats.ncx2_ufunc, scipy.stats._boost.ncx2_ufunc, scipy.stats.nct_ufunc, scipy.stats._boost.nct_ufunc, scipy.stats.skewnorm_ufunc, scipy.stats._boost.skewnorm_ufunc, scipy.stats.invgauss_ufunc, scipy.stats._boost.invgauss_ufunc, scipy.interpolate._fitpack, scipy.interpolate.dfitpack, scipy.interpolate._bspl, scipy.interpolate._ppoly, scipy.interpolate.interpnd, scipy.interpolate._rbfinterp_pythran, scipy.interpolate._rgi_cython, scipy.stats._biasedurn, scipy.stats._levy_stable.levyst, scipy.stats._stats_pythran, scipy._lib._uarray._uarray, scipy.stats._ansari_swilk_statistics, scipy.stats._sobol, scipy.stats._qmc_cy, scipy.stats._mvn, scipy.stats._rcont.rcont, scipy.stats._unuran.unuran_wrapper, scipy.cluster._vq, scipy.cluster._hierarchy, scipy.cluster._optimal_leaf_ordering, sklearn.__check_build._check_build, sklearn.utils._isfinite, sklearn.utils.murmurhash, sklearn.utils._openmp_helpers, sklearn.metrics.cluster._expected_mutual_info_fast, sklearn.utils.sparsefuncs_fast, sklearn.preprocessing._csr_polynomial_expansion, sklearn.preprocessing._target_encoder_fast, sklearn.metrics._dist_metrics, sklearn.metrics._pairwise_distances_reduction._datasets_pair, sklearn.utils._cython_blas, sklearn.metrics._pairwise_distances_reduction._base, sklearn.metrics._pairwise_distances_reduction._middle_term_computer, sklearn.utils._heap, sklearn.utils._sorting, sklearn.metrics._pairwise_distances_reduction._argkmin, sklearn.metrics._pairwise_distances_reduction._argkmin_classmode, sklearn.utils._vector_sentinel, sklearn.metrics._pairwise_distances_reduction._radius_neighbors, sklearn.metrics._pairwise_distances_reduction._radius_neighbors_classmode, sklearn.metrics._pairwise_fast, sklearn.neighbors._partition_nodes, sklearn.neighbors._ball_tree, sklearn.neighbors._kd_tree, sklearn.utils.arrayfuncs, sklearn.utils._random, sklearn.utils._seq_dataset, sklearn.linear_model._cd_fast, sklearn._loss._loss, sklearn.svm._liblinear, sklearn.svm._libsvm, sklearn.svm._libsvm_sparse, sklearn.utils._weight_vector, sklearn.linear_model._sgd_fast, sklearn.linear_model._sag_fast, sklearn.decomposition._online_lda_fast, sklearn.decomposition._cdnmf_fast, hdbscan.dist_metrics, hdbscan._hdbscan_linkage, hdbscan._hdbscan_tree, hdbscan._hdbscan_reachability, hdbscan._hdbscan_boruvka, sklearn._isotonic, sklearn.tree._utils, sklearn.tree._tree, sklearn.tree._splitter, sklearn.tree._criterion, sklearn.neighbors._quad_tree, sklearn.manifold._barnes_hut_tsne, sklearn.manifold._utils, hdbscan._prediction_utils, PIL._imagingmath, PIL._webp (total: 232)

🔁 How can we reproduce it:

Unsure. This happened on a fresh install. I reproduced it by deleting all the librephotos and database folders and running again. I'm running on podman instead of docker but the web interface is working well and I can see that it has found my photos. I don't think the torch library should cause the librephotos job to crash like this. Does it need some exception handling to fail more gracefully?

It's certainly possible this is an artifact of using podman. Here is the podman kube file I'm using with podman play kube (note that in podman Pods, all containers share an IP address and localhost):

# Save the output of this file and use kubectl create -f to import
# it into Kubernetes.
#
# Created with podman-4.9.3

# NOTE: If you generated this yaml from an unprivileged and rootless podman container on an SELinux
# enabled system, check the podman generate kube man page for steps to follow to ensure that your pod/container
# has the right permissions to access the volumes added.
---
apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: "2024-04-08T09:10:59Z"
  labels:
    app: librephotos
  name: librephotos
spec:
  containers:
  - args:
    - postgres
    - -c
    - fsync=off
    - -c
    - synchronous_commit=off
    - -c
    - full_page_writes=off
    - -c
    - random_page_cost=1.0
    env:
    - name: POSTGRES_USER
      value: docker
    - name: POSTGRES_PASSWORD
      value: MYPASSWORDHERE
    - name: POSTGRES_DB
      value: librephotos
    image: docker.io/library/postgres:13
    name: db
    volumeMounts:
    - mountPath: /var/lib/postgresql/data
      name: storage-storage-librephotos-data-db-host-0
  - args:
    - nginx
    - -g
    - daemon off;
    image: docker.io/reallibrephotos/librephotos-proxy:latest
    name: proxy
    ports:
    - containerPort: 80
      hostPort: 3000
    volumeMounts:
    - mountPath: /data
      name: storage-pictures-host-0
      readOnly: true
    - mountPath: /protected_media
      name: storage-storage-librephotos-data-protected_media-host-1
  - image: docker.io/reallibrephotos/librephotos-frontend:latest
    name: frontend
    securityContext: {}
  - env:
    - name: DB_PORT
      value: "5432"
    - name: BACKEND_HOST
      value: backend
    - name: DB_NAME
      value: librephotos
    - name: DB_BACKEND
      value: postgresql
    - name: DB_PASS
      value: MYPASSWORDHERE
    - name: DB_USER
      value: docker
    - name: DB_HOST
      value: localhost
    - name: DEBUG
      value: "0"
    - name: WEB_CONCURRENCY
      value: "1"
    - name: ALLOW_UPLOAD
      value: "false"
    image: docker.io/reallibrephotos/librephotos:latest
    name: backend
    volumeMounts:
    - mountPath: /root/.cache
      name: storage-storage-librephotos-data-cache-host-0
    - mountPath: /data
      name: storage-pictures-host-1
      readOnly: true
    - mountPath: /protected_media
      name: storage-storage-librephotos-data-protected_media-host-2
    - mountPath: /logs
      name: storage-storage-librephotos-data-logs-host-3
  volumes:
  - hostPath:
      path: /storage/librephotos/data/db
      type: Directory
    name: storage-storage-librephotos-data-db-host-0
  - hostPath:
      path: /pictures
      type: Directory
    name: storage-pictures-host-0
  - hostPath:
      path: /storage/librephotos/data/protected_media
      type: Directory
    name: storage-storage-librephotos-data-protected_media-host-1
  - hostPath:
      path: /storage/librephotos/data/cache
      type: Directory
    name: storage-storage-librephotos-data-cache-host-0
  - hostPath:
      path: /pictures
      type: Directory
    name: storage-pictures-host-1
  - hostPath:
      path: /storage/librephotos/data/protected_media
      type: Directory
    name: storage-storage-librephotos-data-protected_media-host-2
  - hostPath:
      path: /storage/librephotos/data/logs
      type: Directory
    name: storage-storage-librephotos-data-logs-host-3

Please provide additional information:

  • 💻 Operating system: Linux (Fedora CoreOS)
  • ⚙ Architecture (x86 or ARM): x86_64
  • 🔢 Librephotos version: 2024w14p1 (docker latest)
  • 📸 Librephotos installation method (Docker, Kubernetes, .deb, etc.): Docker (but using podman on Fedora CoreOS)
    • 🐋 If Docker or Kubernets, provide docker-compose image tag: latest
  • 📁 How is you picture library mounted (Local file system (Type), NFS, SMB, etc.): Local file system
@rw57 rw57 added the bug Something isn't working label Apr 9, 2024
@derneuere
Copy link
Member

This seems to be related to PyTorch. Hard crashes of PyTorch usually involve a bug in some instruction set of the CPU. Can you give me more information what kind of CPU you use and if there is maybe any virtualization involved?

@rw57
Copy link
Author

rw57 commented Apr 9, 2024

It is an older computer but should have sufficient memory and storage. I'm running Fedora CoreOS on bare metal so no virtualization. I didn't see a particular CPU requirement in the pyTorch documentation. Any idea what it needs? How do I bypass or disable pyTorch?

uname -a
Linux hostname 6.7.7-200.fc39.x86_64 #1 SMP PREEMPT_DYNAMIC Fri Mar 1 16:53:59 UTC 2024 x86_64 GNU/Linux

cat /proc/cpuinfo

processor       : 0
vendor_id       : AuthenticAMD
cpu family      : 16
model           : 5
model name      : AMD Athlon(tm) II X4 635 Processor
stepping        : 3
microcode       : 0x10000b6
cpu MHz         : 800.000
cache size      : 512 KB
physical id     : 0
siblings        : 4
core id         : 0
cpu cores       : 4
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 5
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid pni monitor cx16 popcnt lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt hw_pstate vmmcall npt lbrv svm_lock nrip_save
bugs            : tlb_mmatch fxsave_leak sysret_ss_attrs null_seg spectre_v1 spectre_v2
bogomips        : 5786.01
TLB size        : 1024 4K pages
clflush size    : 64
cache_alignment : 64
address sizes   : 48 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate

processor       : 1
vendor_id       : AuthenticAMD
cpu family      : 16
model           : 5
model name      : AMD Athlon(tm) II X4 635 Processor
stepping        : 3
microcode       : 0x10000b6
cpu MHz         : 800.000
cache size      : 512 KB
physical id     : 0
siblings        : 4
core id         : 1
cpu cores       : 4
apicid          : 1
initial apicid  : 1
fpu             : yes
fpu_exception   : yes
cpuid level     : 5
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid pni monitor cx16 popcnt lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt hw_pstate vmmcall npt lbrv svm_lock nrip_save
bugs            : tlb_mmatch fxsave_leak sysret_ss_attrs null_seg spectre_v1 spectre_v2
bogomips        : 5786.01
TLB size        : 1024 4K pages
clflush size    : 64
cache_alignment : 64
address sizes   : 48 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate

processor       : 2
vendor_id       : AuthenticAMD
cpu family      : 16
model           : 5
model name      : AMD Athlon(tm) II X4 635 Processor
stepping        : 3
microcode       : 0x10000b6
cpu MHz         : 800.000
cache size      : 512 KB
physical id     : 0
siblings        : 4
core id         : 2
cpu cores       : 4
apicid          : 2
initial apicid  : 2
fpu             : yes
fpu_exception   : yes
cpuid level     : 5
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid pni monitor cx16 popcnt lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt hw_pstate vmmcall npt lbrv svm_lock nrip_save
bugs            : tlb_mmatch fxsave_leak sysret_ss_attrs null_seg spectre_v1 spectre_v2
bogomips        : 5786.01
TLB size        : 1024 4K pages
clflush size    : 64
cache_alignment : 64
address sizes   : 48 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate

processor       : 3
vendor_id       : AuthenticAMD
cpu family      : 16
model           : 5
model name      : AMD Athlon(tm) II X4 635 Processor
stepping        : 3
microcode       : 0x10000b6
cpu MHz         : 800.000
cache size      : 512 KB
physical id     : 0
siblings        : 4
core id         : 3
cpu cores       : 4
apicid          : 3
initial apicid  : 3
fpu             : yes
fpu_exception   : yes
cpuid level     : 5
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid pni monitor cx16 popcnt lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt hw_pstate vmmcall npt lbrv svm_lock nrip_save
bugs            : tlb_mmatch fxsave_leak sysret_ss_attrs null_seg spectre_v1 spectre_v2
bogomips        : 5786.01
TLB size        : 1024 4K pages
clflush size    : 64
cache_alignment : 64
address sizes   : 48 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate

@derneuere
Copy link
Member

We upgraded to PyTorch 2.3, maybe this got fixed in that release :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants