-
Notifications
You must be signed in to change notification settings - Fork 301
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
upload client sometimes misses either declaring the file replica or declaring the file to a dataset #6670
Comments
After looking through the code and the log above, I don't think the problem is exceptions being missed. It's true that the upload client doesn't explicitly catch exceptions from some of the server calls, but they should still be propagated to a higher level. In any case, it doesn't look as though any relevant exceptions are being thrown here. The server calls for the DID registration and replica registration are returning 200 or 201 (success). So the server is reporting to the client that the file was registered correctly, but it wasn't. Do you know if the replica registration is missing entirely or is it present but in the wrong state? (The replica is initially created before the file is uploaded, in state "C", then is set to state "A" if the upload succeeds, so this distinction could give us a clue as to what's going wrong). Is there any possibility of getting the server logs for while this was going on? I can try to reproduce this in a test environment but I'm not sure how easy it will be to reproduce a heavily loaded Rucio server. |
unfortunately the server logs rotated out already. The rucio replicas in question showed in state "U"when I did |
We will watch again during the upcoming dress rehearsal to see if this happens again, I expect it will. |
So one thing to look at is the following. DUNE has a privilege package which calls out to an external server (metacat) It could be that some of the external calls to this server are not properly catching exceptions and thus the server side is returning a OK status when things are not really OK and thus the fault is not in the client at all.. how would we investigate this? |
I don't think the symptoms match what would happen if the MetaCat call was failing. If the file being uploaded doesn't exist in MetaCat then that should prevent the upload (this may still not be working yet due to issue #6058), but it should kill the upload entirely rather than allow the file to be uploaded but not registered. Conversely, if the check returned success when it should have returned failure, Rucio will just go ahead and upload the file as normal. The MetaCat entry will be missing but this shouldn't have any effect on the replica registration in Rucio. |
What if the call to metacat times out and returns neither success nor failure what happens then?
In our latest run of 22000 jobs we saw only one failure where the replica was not created even though the file was there.. see the log at
***@***.***
We are trying to get the server logs for that instant but it may be tricky.
Steve
…________________________________
From: jamesp-epcc ***@***.***>
Sent: Monday, April 29, 2024 3:34 AM
To: rucio/rucio ***@***.***>
Cc: Steven C Timm ***@***.***>; Author ***@***.***>
Subject: Re: [rucio/rucio] upload client sometimes misses either declaring the file replica or declaring the file to a dataset (Issue #6670)
[EXTERNAL] – This message is from an external sender
I don't think the symptoms match what would happen if the MetaCat call was failing. If the file being uploaded doesn't exist in MetaCat then that should prevent the upload (this may still not be working yet due to issue #6058<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_rucio_rucio_issues_6058&d=DwMFaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=10BCTK25QMgkMYibLRbpYg&m=kiXYwvai2HxR8p6HPMvSdP-sUbXzebFsW8nwQ1mNFHDDrHMQISgFq1utpKWd2VUz&s=gI2x99_ZQfHxRJUJYh8qpRTtcGmJJZ20gGR7Bdaz-aI&e=>), but it should kill the upload entirely rather than allow the file to be uploaded but not registered. Conversely, if the check returned success when it should have returned failure, Rucio will just go ahead and upload the file as normal. The MetaCat entry will be missing but this shouldn't have any effect on the replica registration in Rucio.
—
Reply to this email directly, view it on GitHub<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_rucio_rucio_issues_6670-23issuecomment-2D2082157433&d=DwMFaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=10BCTK25QMgkMYibLRbpYg&m=kiXYwvai2HxR8p6HPMvSdP-sUbXzebFsW8nwQ1mNFHDDrHMQISgFq1utpKWd2VUz&s=xoY_v_s3ovn0_e3QJp7W3becJkxMTExjgTczrAP0VWw&e=>, or unsubscribe<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AGG4SOECUDGMRRVKASJXB4DY7YAX5AVCNFSM6AAAAABGCPFHKCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOBSGE2TONBTGM&d=DwMFaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=10BCTK25QMgkMYibLRbpYg&m=kiXYwvai2HxR8p6HPMvSdP-sUbXzebFsW8nwQ1mNFHDDrHMQISgFq1utpKWd2VUz&s=PbxQfEmhcP4Sme0zYfe24Rbf2ZQQ9wY2CGwCfSMpumQ&e=>.
You are receiving this because you authored the thread.Message ID: ***@***.***>
|
I'm not certain, but most likely either an empty list would be returned causing the permission check to fail, or an exception would be thrown in the MetaCat API library which would be caught somewhere in the Rucio server code and propagated back to the client as an internal server error, or at the very least would appear in the server logs. |
I have a capture of a log where this happened and all server logs from that time but I am not seeing any fails This is from our job log Steven Timm OK now I have isolated the failure mode to the following: (apologies that this got cut into and out of mattermost) In any case I have 9 rucio server pods active right now, I have searched through all of the logs for this file name, anything with the "Failed" error message What should I be looking for? Lines 288-298 of the rucio upload client appear to be where the code must be failing but I can't find any error message like that either in the Rucio-server logs or the httpd-error-logs log What am I missing.. what should I be looking for? It appears that for most uploads we don't see the name of the file we are uploading in the rucio server logs Suggestions? What should we be looking for? These files will rotate off of the server pretty soon.. I have them available in Elasticsearch going farther back but don't know what to search for.. especially since the upload client is returning zero there may be no error messages to be found. |
I have a couple of questions.
|
(1) Yes--in this latest test we have four files missing replicas and if you do rucio list-file-replicas --missing they all show in state U. (2) So here is the stderr of the corresponding stdout I posted above: |
DEBUG:root:Num. of files that upload client is processing: 1 |
So yes here in this case it did indeed fail to update the replica state for the file. |
Looking more carefully at lines 277-304 it seems clear that if either the self.client.update_replicas_state call on line 294 fails (which in this case it did) or the self.client.attach call on line 302 failed (which it didn't in this case but has at other times) the client is still going to count that uploaded file as successful, and if all files in the upload are successful ( in this case only 1) will return 0 at line 337. This is a bug.. if the replica is not declared/updated this leads to dark data. We need this fixed. |
Do you still think there's anything to be found on the server side? I am not sure what I am looking for. the dids are often truncated in the Rucio-server logs. the server logs are going to rotate out pretty soon if they haven't already so time is of the essence. |
To be clear, it would really be nice to have a rucio server that doesn't timeout, but if it must timeout then we have to catch it in the client and have a way to fix the damage that's caused. |
Yes, I think in this case the client should return non-zero. Specifically, I'm a bit concerned that maybe there is something else going on as well, because the original log doesn't show the |
The earlier client log snippet that I pasted above was going to RAL-PP where we have a noRename implementation so the flow of things could have been slightly different. The one from 1 May was going to NIKHEF where the file does get renamed. Or so I think. |
Eventually we have to also get to the bigger problem of why is the server timing out? These 4 timeouts out of 27500 files processed happened with no more than 3000 simultaneous jobs running at any point on Wednesday 5/1. With 10K jobs running we see timeouts in the hundreds and that is the job level we have to get to eventually. |
From the code it looks as though the replica state should initially be set to C (copying), then changed to A (available) if the upload succeeds, so I am not sure where the U is coming from. However it also looks as if I will put in a PR to ensure that failures to update the replica state get flagged, and similarly for the call to attach the file to the dataset, which can also fail right now and still have the command return 0. |
Indeed--we have many documented instances where the call to attach the file to the dataset also has failed.. these are much more easy to recover afterwards. |
Backport 32.8.2 |
Description
On several occasions DUNE has observed that the uploadclient.py. of rucio v33.3 (as called in our utility justin-rucio-upload)
https://github.com/DUNE/dune-justin/blob/main/commands/justin-rucio-upload
either fails to register the file replica, or fails to attach the DID to the requested dataset.
This tends to happen when the rucio server is under high load.
6093.90.justin-prod-sched02.dune.hep.ac.uk.txt
The snippet of the debug output generated from the rucio upload client is below.
In short what you see is 5 attempts at rucio upload--3 to QMUL and 2 to RAL-PP the last of which
finally showed a success. the QMUL RSE was very busy but we were also getting 503 errors against the
server at the time.
But what actually happened under the covers is that the file in question was attached to the dataset
but the replica was not declared to rucio.
A layman's review of uploadclient.py seems to indicate that there are very few if any checks for exceptions in the several lower level tasks that this client does.. i.e. there is an attach to the DID and no attached method to handle the exception--so it's quite possible that the 2 tasks are not atomic--what needs to be done is either to make the client code retry if it fails, or return an error if one of those two fails.
DEBUG:root:Num. of files that upload client is processing: 1
DEBUG:dogpile.cache.region:No value present for key: "host_to_choose_choice['https://dune-rucio.fnal.gov']"
DEBUG:dogpile.lock:NeedRegenerationException
DEBUG:dogpile.lock:no value, waiting for create lock
DEBUG:dogpile.lock:value creation lock <dogpile.cache.region.CacheRegion._LockWrapper object at 0x2ba369504ca0> acquired
DEBUG:dogpile.cache.region:No value present for key: "host_to_choose_choice['https://dune-rucio.fnal.gov']"
DEBUG:dogpile.lock:Calling creation function for not-yet-present value
DEBUG:dogpile.cache.region:Cache value generated in 0.000 seconds for key(s): "host_to_choose_choice['https://dune-rucio.fnal.gov']"
DEBUG:dogpile.lock:Released creation lock
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): dune-rucio.fnal.gov:443
DEBUG:urllib3.connectionpool:https://dune-rucio.fnal.gov:443 "GET /rses/?expression=QMUL HTTP/1.1" 503 299
ESC[33;1m2024-04-07 00:02:59,967 WARNING Waiting 0.25s due to reason: server returned 503 ESC[0m
WARNING:baseclient:Waiting 0.25s due to reason: server returned 503
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (2): dune-rucio.fnal.gov:443
DEBUG:urllib3.connectionpool:https://dune-rucio.fnal.gov:443 "GET /rses/?expression=QMUL HTTP/1.1" 200 None
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): dune-rucio.fnal.gov:443
DEBUG:urllib3.connectionpool:https://dune-rucio.fnal.gov:443 "GET /rses/QMUL HTTP/1.1" 503 299
ESC[33;1m2024-04-07 00:03:22,081 WARNING Waiting 0.25s due to reason: server returned 503 ESC[0m
WARNING:baseclient:Waiting 0.25s due to reason: server returned 503
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (2): dune-rucio.fnal.gov:443
DEBUG:urllib3.connectionpool:https://dune-rucio.fnal.gov:443 "GET /rses/QMUL HTTP/1.1" 503 299
ESC[33;1m2024-04-07 00:03:37,845 WARNING Waiting 0.5s due to reason: server returned 503 ESC[0m
WARNING:baseclient:Waiting 0.5s due to reason: server returned 503
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (3): dune-rucio.fnal.gov:443
DEBUG:urllib3.connectionpool:https://dune-rucio.fnal.gov:443 "GET /rses/QMUL HTTP/1.1" 503 299
ESC[33;1m2024-04-07 00:03:53,947 WARNING Waiting 1.0s due to reason: server returned 503 ESC[0m
WARNING:baseclient:Waiting 1.0s due to reason: server returned 503
justin-rucio-upload fails: An unknown exception occurred.
Details: no error information passed (http status code: 503)
Setting up larsoft UPS area... /cvmfs/larsoft.opensciencegrid.org
Setting up DUNE UPS area... /cvmfs/dune.opensciencegrid.org/products/dune/
DEBUG:root:Num. of files that upload client is processing: 1
DEBUG:dogpile.cache.region:No value present for key: "host_to_choose_choice['https://dune-rucio.fnal.gov']"
DEBUG:dogpile.lock:NeedRegenerationException
DEBUG:dogpile.lock:no value, waiting for create lock
DEBUG:dogpile.lock:value creation lock <dogpile.cache.region.CacheRegion._LockWrapper object at 0x2b17ceb98ca0> acquired
DEBUG:dogpile.cache.region:No value present for key: "host_to_choose_choice['https://dune-rucio.fnal.gov']"
DEBUG:dogpile.lock:Calling creation function for not-yet-present value
DEBUG:dogpile.cache.region:Cache value generated in 0.000 seconds for key(s): "host_to_choose_choice['https://dune-rucio.fnal.gov']"
DEBUG:dogpile.lock:Released creation lock
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): dune-rucio.fnal.gov:443
DEBUG:urllib3.connectionpool:https://dune-rucio.fnal.gov:443 "GET /rses/?expression=QMUL HTTP/1.1" 503 299
ESC[33;1m2024-04-07 00:04:24,097 WARNING Waiting 0.25s due to reason: server returned 503 ESC[0m
WARNING:baseclient:Waiting 0.25s due to reason: server returned 503
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (2): dune-rucio.fnal.gov:443
DEBUG:urllib3.connectionpool:https://dune-rucio.fnal.gov:443 "GET /rses/?expression=QMUL HTTP/1.1" 503 299
ESC[33;1m2024-04-07 00:04:39,902 WARNING Waiting 0.5s due to reason: server returned 503 ESC[0m
WARNING:baseclient:Waiting 0.5s due to reason: server returned 503
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (3): dune-rucio.fnal.gov:443
DEBUG:urllib3.connectionpool:https://dune-rucio.fnal.gov:443 "GET /rses/?expression=QMUL HTTP/1.1" 503 299
ESC[33;1m2024-04-07 00:04:55,837 WARNING Waiting 1.0s due to reason: server returned 503 ESC[0m
WARNING:baseclient:Waiting 1.0s due to reason: server returned 503
justin-rucio-upload fails: An unknown exception occurred.
Details: no error information passed (http status code: 503)
Setting up larsoft UPS area... /cvmfs/larsoft.opensciencegrid.org
Setting up DUNE UPS area... /cvmfs/dune.opensciencegrid.org/products/dune/
DEBUG:root:Num. of files that upload client is processing: 1
DEBUG:dogpile.cache.region:No value present for key: "host_to_choose_choice['https://dune-rucio.fnal.gov']"
DEBUG:dogpile.lock:NeedRegenerationException
DEBUG:dogpile.lock:no value, waiting for create lock
DEBUG:dogpile.lock:value creation lock <dogpile.cache.region.CacheRegion._LockWrapper object at 0x2b674216bca0> acquired
DEBUG:dogpile.cache.region:No value present for key: "host_to_choose_choice['https://dune-rucio.fnal.gov']"
DEBUG:dogpile.lock:Calling creation function for not-yet-present value
DEBUG:dogpile.cache.region:Cache value generated in 0.000 seconds for key(s): "host_to_choose_choice['https://dune-rucio.fnal.gov']"
DEBUG:dogpile.lock:Released creation lock
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): dune-rucio.fnal.gov:443
DEBUG:urllib3.connectionpool:https://dune-rucio.fnal.gov:443 "GET /rses/?expression=QMUL HTTP/1.1" 503 299
ESC[33;1m2024-04-07 00:05:21,496 WARNING Waiting 0.25s due to reason: server returned 503 ESC[0m
WARNING:baseclient:Waiting 0.25s due to reason: server returned 503
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (2): dune-rucio.fnal.gov:443
DEBUG:urllib3.connectionpool:https://dune-rucio.fnal.gov:443 "GET /rses/?expression=QMUL HTTP/1.1" 200 None
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): dune-rucio.fnal.gov:443
DEBUG:urllib3.connectionpool:https://dune-rucio.fnal.gov:443 "GET /rses/QMUL HTTP/1.1" 200 1192
DEBUG:root:Input validation done.
INFO:root:Preparing upload for file nu_numu2nutau_nue2numu_dunevd10kt_1x8x6_3view_30deg_1004_695_20230810T010722Z_gen_g4_detsim_hitreco__20240406T223117Z_reco2_ana.root
DEBUG:urllib3.connectionpool:Resetting dropped connection: dune-rucio.fnal.gov
DEBUG:urllib3.connectionpool:https://dune-rucio.fnal.gov:443 "GET /rses/QMUL/attr/ HTTP/1.1" 503 299
ESC[33;1m2024-04-07 00:06:07,145 WARNING Waiting 0.25s due to reason: server returned 503 ESC[0m
WARNING:baseclient:Waiting 0.25s due to reason: server returned 503
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (3): dune-rucio.fnal.gov:443
DEBUG:urllib3.connectionpool:https://dune-rucio.fnal.gov:443 "GET /rses/QMUL/att
r/ HTTP/1.1" 503 299
ESC[33;1m2024-04-07 00:06:22,830 WARNING Waiting 0.5s due to reason: server returned 503 ESC[0m
WARNING:baseclient:Waiting 0.5s due to reason: server returned 503
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (4): dune-rucio.fnal.gov:443
DEBUG:urllib3.connectionpool:https://dune-rucio.fnal.gov:443 "GET /rses/QMUL/attr/ HTTP/1.1" 503 299
ESC[33;1m2024-04-07 00:06:38,838 WARNING Waiting 1.0s due to reason: server returned 503 ESC[0m
WARNING:baseclient:Waiting 1.0s due to reason: server returned 503
WARNING:root:Attributes of the RSE: QMUL not available.
DEBUG:root:wan domain is used for the upload
DEBUG:root:Registering file
DEBUG:dogpile.lock:value creation lock <dogpile.cache.region.CacheRegion._LockWrapper object at 0x2b674216bdf0> acquired
DEBUG:dogpile.lock:Calling creation function for previously expired value
DEBUG:dogpile.cache.region:Cache value generated in 0.000 seconds for key(s): "host_to_choose_choice['https://dune-rucio.fnal.gov']"
DEBUG:dogpile.lock:Released creation lock
DEBUG:urllib3.connectionpool:Resetting dropped connection: dune-rucio.fnal.gov
DEBUG:urllib3.connectionpool:https://dune-rucio.fnal.gov:443 "GET /accounts/dunepro/scopes/ HTTP/1.1" 200 67
DEBUG:root:Trying to create dataset: fardet-vd:fardet-vd-reco2ana_ritm1780305_nu_numu2nutau_nue2numu_fhc_skip0_end_1779
DEBUG:urllib3.connectionpool:https://dune-rucio.fnal.gov:443 "POST /dids/fardet-vd/fardet-vd-reco2ana_ritm1780305_nu_numu2nutau_nue2numu_fhc_skip0_end_1779 HTTP/1.1" 409 104
INFO:root:Dataset fardet-vd:fardet-vd-reco2ana_ritm1780305_nu_numu2nutau_nue2numu_fhc_skip0_end_1779 already exists - no rule will be created
DEBUG:urllib3.connectionpool:https://dune-rucio.fnal.gov:443 "GET /dids/fardet-vd/nu_numu2nutau_nue2numu_dunevd10kt_1x8x6_3view_30deg_1004_695_20230810T010722Z_gen_g4_detsim_hitreco__20240406T223117Z_reco2_ana.root/meta?plugin=DID_COLUMN HTTP/1.1" 503 299
ESC[33;1m2024-04-07 00:07:43,747 WARNING Waiting 0.25s due to reason: server returned 503 ESC[0m
WARNING:baseclient:Waiting 0.25s due to reason: server returned 503
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (5): dune-rucio.fnal.gov:443
DEBUG:urllib3.connectionpool:https://dune-rucio.fnal.gov:443 "GET /dids/fardet-vd/nu_numu2nutau_nue2numu_dunevd10kt_1x8x6_3view_30deg_1004_695_20230810T010722Z_gen_g4_detsim_hitreco__20240406T223117Z_reco2_ana.root/meta?plugin=DID_COLUMN HTTP/1.1" 404 238
DEBUG:root:File DID does not exist
DEBUG:dogpile.lock:value creation lock <dogpile.cache.region.CacheRegion._LockWrapper object at 0x2b674344dca0> acquired
DEBUG:dogpile.lock:Calling creation function for previously expired value
DEBUG:dogpile.cache.region:Cache value generated in 0.000 seconds for key(s): "host_to_choose_choice['https://dune-rucio.fnal.gov']"
DEBUG:dogpile.lock:Released creation lock
DEBUG:urllib3.connectionpool:https://dune-rucio.fnal.gov:443 "POST /replicas HTTP/1.1" 503 299
ESC[33;1m2024-04-07 00:08:09,206 WARNING Waiting 0.25s due to reason: server returned 503 ESC[0m
WARNING:baseclient:Waiting 0.25s due to reason: server returned 503
DEBUG:urllib3.connectionpool:Resetting dropped connection: dune-rucio.fnal.gov
DEBUG:urllib3.connectionpool:https://dune-rucio.fnal.gov:443 "POST /replicas HTTP/1.1" 503 299
ESC[33;1m2024-04-07 00:08:24,971 WARNING Waiting 0.5s due to reason: server returned 503 ESC[0m
WARNING:baseclient:Waiting 0.5s due to reason: server returned 503
DEBUG:urllib3.connectionpool:Resetting dropped connection: dune-rucio.fnal.gov
DEBUG:urllib3.connectionpool:https://dune-rucio.fnal.gov:443 "POST /replicas HTTP/1.1" 503 299
ESC[33;1m2024-04-07 00:08:40,984 WARNING Waiting 1.0s due to reason: server returned 503 ESC[0m
WARNING:baseclient:Waiting 1.0s due to reason: server returned 503
justin-rucio-upload fails: An unknown exception occurred.
Details: no error information passed (http status code: 503)
Setting up larsoft UPS area... /cvmfs/larsoft.opensciencegrid.org
Setting up DUNE UPS area... /cvmfs/dune.opensciencegrid.org/products/dune/
DEBUG:root:Num. of files that upload client is processing: 1
DEBUG:dogpile.cache.region:No value present for key: "host_to_choose_choice['https://dune-rucio.fnal.gov']"
DEBUG:dogpile.lock:NeedRegenerationException
DEBUG:dogpile.lock:no value, waiting for create lock
DEBUG:dogpile.lock:value creation lock <dogpile.cache.region.CacheRegion._LockWrapper object at 0x2b77fa844ca0> acquired
DEBUG:dogpile.cache.region:No value present for key: "host_to_choose_choice['https://dune-rucio.fnal.gov']"
DEBUG:dogpile.lock:Calling creation function for not-yet-present value
DEBUG:dogpile.cache.region:Cache value generated in 0.000 seconds for key(s): "host_to_choose_choice['https://dune-rucio.fnal.gov']"
DEBUG:dogpile.lock:Released creation lock
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): dune-rucio.fnal.gov:443
DEBUG:urllib3.connectionpool:https://dune-rucio.fnal.gov:443 "GET /rses/?expression=RAL-PP HTTP/1.1" 503 299
ESC[33;1m2024-04-07 00:09:04,890 WARNING Waiting 0.25s due to reason: server returned 503 ESC[0m
WARNING:baseclient:Waiting 0.25s due to reason: server returned 503
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (2): dune-rucio.fnal.
gov:443
DEBUG:urllib3.connectionpool:https://dune-rucio.fnal.gov:443 "GET /rses/?expression=RAL-PP HTTP/1.1" 503 299
ESC[33;1m2024-04-07 00:09:20,654 WARNING Waiting 0.5s due to reason: server returned 503 ESC[0m
WARNING:baseclient:Waiting 0.5s due to reason: server returned 503
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (3): dune-rucio.fnal.gov:443
DEBUG:urllib3.connectionpool:https://dune-rucio.fnal.gov:443 "GET /rses/?expression=RAL-PP HTTP/1.1" 503 299
ESC[33;1m2024-04-07 00:09:36,591 WARNING Waiting 1.0s due to reason: server returned 503 ESC[0m
WARNING:baseclient:Waiting 1.0s due to reason: server returned 503
justin-rucio-upload fails: An unknown exception occurred.
Details: no error information passed (http status code: 503)
Setting up larsoft UPS area... /cvmfs/larsoft.opensciencegrid.org
Setting up DUNE UPS area... /cvmfs/dune.opensciencegrid.org/products/dune/
DEBUG:root:Num. of files that upload client is processing: 1
DEBUG:dogpile.cache.region:No value present for key: "host_to_choose_choice['https://dune-rucio.fnal.gov']"
DEBUG:dogpile.lock:NeedRegenerationException
DEBUG:dogpile.lock:no value, waiting for create lock
DEBUG:dogpile.lock:value creation lock <dogpile.cache.region.CacheRegion._LockWr:
apper object at 0x2b9c7f6e6ca0> acquired
DEBUG:dogpile.cache.region:No value present for key: "host_to_choose_choice['https://dune-rucio.fnal.gov']"
DEBUG:dogpile.lock:Calling creation function for not-yet-present value
DEBUG:dogpile.cache.region:Cache value generated in 0.000 seconds for key(s): "host_to_choose_choice['https://dune-rucio.fnal.gov']"
DEBUG:dogpile.lock:Released creation lock
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): dune-rucio.fnal.gov:443
DEBUG:urllib3.connectionpool:https://dune-rucio.fnal.gov:443 "GET /rses/?expression=RAL-PP HTTP/1.1" 200 None
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): dune-rucio.fnal.gov:443
DEBUG:urllib3.connectionpool:https://dune-rucio.fnal.gov:443 "GET /rses/RAL-PP HTTP/1.1" 200 1634
DEBUG:root:Input validation done.
INFO:root:Preparing upload for file nu_numu2nutau_nue2numu_dunevd10kt_1x8x6_3view_30deg_1004_695_20230810T010722Z_gen_g4_detsim_hitreco__20240406T223117Z_reco2_ana.root
DEBUG:urllib3.connectionpool:Resetting dropped connection: dune-rucio.fnal.gov
DEBUG:urllib3.connectionpool:https://dune-rucio.fnal.gov:443 "GET /rses/RAL-PP/attr/ HTTP/1.1" 503 299
ESC[33;1m2024-04-07 00:10:18,472 WARNING Waiting 0.25s due to reason: ser:
ver returned 503 ESC[0m
WARNING:baseclient:Waiting 0.25s due to reason: server returned 503
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (2): dune-rucio.fnal.gov:443
DEBUG:urllib3.connectionpool:https://dune-rucio.fnal.gov:443 "GET /rses/RAL-PP/attr/ HTTP/1.1" 503 299
ESC[33;1m2024-04-07 00:10:34,158 WARNING Waiting 0.5s due to reason: server returned 503 ESC[0m
WARNING:baseclient:Waiting 0.5s due to reason: server returned 503
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (3): dune-rucio.fnal.gov:443
DEBUG:urllib3.connectionpool:https://dune-rucio.fnal.gov:443 "GET /rses/RAL-PP/attr/ HTTP/1.1" 200 350
DEBUG:root:wan domain is used for the upload
DEBUG:root:Registering file
DEBUG:dogpile.lock:value creation lock <dogpile.cache.region.CacheRegion._LockWrapper object at 0x2b9c809c78b0> acquired
DEBUG:dogpile.lock:Calling creation function for previously expired value
DEBUG:dogpile.cache.region:Cache value generated in 0.000 seconds for key(s): "host_to_choose_choice['https://dune-rucio.fnal.gov']"
DEBUG:dogpile.lock:Released creation lock
DEBUG:urllib3.connectionpool:https://dune-rucio.fnal.gov:443 "GET /accounts/dunepro/scopes/ HTTP/1.1" 200 678
DEBUG:root:Trying to create dataset: fardet-vd:fardet-vd-reco2ana_ritm1780305_nu_numu2nutau_nue2numu_fhc_skip0_end_1779
DEBUG:urllib3.connectionpool:https://dune-rucio.fnal.gov:443 "POST /dids/fardet-vd/fardet-vd-reco2ana_ritm1780305_nu_numu2nutau_nue2numu_fhc_skip0_end_1779 HTTP/1.1" 409 104
INFO:root:Dataset fardet-vd:fardet-vd-reco2ana_ritm1780305_nu_numu2nutau_nue2numu_fhc_skip0_end_1779 already exists - no rule will be created
DEBUG:urllib3.connectionpool:https://dune-rucio.fnal.gov:443 "GET /dids/fardet-vd/nu_numu2nutau_nue2numu_dunevd10kt_1x8x6_3view_30deg_1004_695_20230810T010722Z_gen_g4_detsim_hitreco__20240406T223117Z_reco2_ana.root/meta?plugin=DID_COLUMN HTTP/1.1" 404 238
DEBUG:root:File DID does not exist
DEBUG:urllib3.connectionpool:https://dune-rucio.fnal.gov:443 "POST /replicas HTTP/1.1" 201 7
INFO:root:Successfully added replica in Rucio catalogue at RAL-PP
DEBUG:root:gfal.NoRename: connecting to storage
DEBUG:root:gfal.NoRename: checking if file exists None
DEBUG:root:Checking if root://mover.pp.rl.ac.uk:1094/pnfs/pp.rl.ac.uk/data/dune/fardet-vd/3f/1f/nu_numu2nutau_nue2numu_dunevd10kt_1x8x6_3view_30deg_1004_695_20230810T010722Z_gen_g4_detsim_hitreco__20240406T223117Z_reco2_ana.root exists
DEBUG:root:gfal.NoRename: checking if file exists root://mover.pp.rl.ac.uk:1094/pnfs/pp.rl.ac.uk/data/dune/fardet-vd/3f/1f/nu_numu2nutau_nue2numu_dunevd10kt_1x8x6_3view_30deg_1004_695_20230810T010722Z_gen_g4_detsim_hitreco__20240406T223117Z
_reco2_ana.root
DEBUG:root:gfal.NoRename: closing protocol connection
DEBUG:root:[{'hostname': 'mover.pp.rl.ac.uk', 'scheme': 'root', 'port': 1094, 'prefix': '/pnfs/pp.rl.ac.uk/data/dune', 'impl': 'rucio.rse.protocols.gfal.NoRename', 'domains': {'lan': {'read': 1, 'write': 2, 'delete': 2}, 'wan': {'read': 1, 'write': 2, 'delete': 2, 'third_party_copy_read': 10, 'third_party_copy_write': 10}}, 'extended_attributes': None}, {'hostname': 'heplnx204.pp.rl.ac.uk', 'scheme': 'srm', 'port': 8443, 'prefix': '/pnfs/pp.rl.ac.uk/data/dune', 'impl': 'rucio.rse.protocols.gfal.Default', 'domains': {'lan': {'read': 0, 'write': 0, 'delete': 0}, 'wan': {'read': 3, 'write': 2, 'delete': 3, 'third_party_copy_read': 0, 'third_party_copy_write': 0}}, 'extended_attributes': {'space_token': 'DATADISK', 'web_service_path': '/srm/managerv2?SFN='}}, {'hostname': 'mover.pp.rl.ac.uk', 'scheme': 'davs', 'port': 2880, 'prefix': '/pnfs/pp.rl.ac.uk/data/dune', 'impl': 'rucio.rse.protocols.gfal.NoRename', 'domains': {'lan': {'read': 2, 'write': 1, 'delete': 1}, 'wan': {'read': 2, 'write': 1, 'delete': 1, 'third_party_copy_read': 1, 'third_party_copy_write': 1}}, 'extended_attributes': None}]
INFO:root:Trying upload with davs to RAL-PP
DEBUG:root:Processing upload with the domain: wan
DEBUG:root:gfal.NoRename: connecting to storage
DEBUG:root:The PFN created from the LFN: davs://mover.pp.rl.ac.uk:2880/pnfs/pp.rl.ac.uk/data/dune/fardet-vd/3f/1f/nu_numu2nutau_nue2numu_dunevd10kt_1x8x6_3view_30deg_1004_695_20230810T010722Z_gen_g4_detsim_hitreco__20240406T223117Z_reco2_ana.root
DEBUG:root:gfal.NoRename: checking if file exists davs://mover.pp.rl.ac.uk:2880/pnfs/pp.rl.ac.uk/data/dune/fardet-vd/3f/1f/nu_numu2nutau_nue2numu_dunevd10kt_1x8x6_3view_30deg_1004_695_20230810T010722Z_gen_g4_detsim_hitreco__20240406T223117Z_reco2_ana.root
DEBUG:root:gfal.NoRename: checking if file exists davs://mover.pp.rl.ac.uk:2880/pnfs/pp.rl.ac.uk/data/dune/fardet-vd/3f/1f/nu_numu2nutau_nue2numu_dunevd10kt_1x8x6_3view_30deg_1004_695_20230810T010722Z_gen_g4_detsim_hitreco__20240406T223117Z_reco2_ana.root
DEBUG:root:put: Attempt 1
DEBUG:root:gfal.NoRename: uploading file from nu_numu2nutau_nue2numu_dunevd10kt_1x8x6_3view_30deg_1004_695_20230810T010722Z_gen_g4_detsim_hitreco__20240406T223117Z_reco2_ana.root to davs://mover.pp.rl.ac.uk:2880/pnfs/pp.rl.ac.uk/data/dune/fardet-vd/3f/1f/nu_numu2nutau_nue2numu_dunevd10kt_1x8x6_3view_30deg_1004_695_20230810T010722Z_gen_g4_detsim_hitreco__20240406T223117Z_reco2_ana.root
INFO:root:Successful upload of temporary file. davs://mover.pp.rl.ac.uk:2880/pnfs/pp.rl.ac.uk/data/dune/fardet-vd/3f/1f/nu_numu2nutau_nue2numu_dunevd10kt_1x8x6_3view_30deg_1004_695_20230810T010722Z_gen_g4_detsim_hitreco__20240406T223117Z_reco2_ana.root
DEBUG:root:skip_upload_stat=False
DEBUG:root:stat: pfn=davs://mover.pp.rl.ac.uk:2880/pnfs/pp.rl.ac.uk/data/dune/fardet-vd/3f/1f/nu_numu2nutau_nue2numu_dunevd10kt_1x8x6_3view_30deg_1004_695_20230810T010722Z_gen_g4_detsim_hitreco__20240406T223117Z_reco2_ana.root
DEBUG:root:gfal.NoRename: getting stats of file davs://mover.pp.rl.ac.uk:2880/pn
fs/pp.rl.ac.uk/data/dune/fardet-vd/3f/1f/nu_numu2nutau_nue2numu_dunevd10kt_1x8x6_3view_30deg_1004_695_20230810T010722Z_gen_g4_detsim_hitreco__20240406T223117Z_reco2_ana.root
DEBUG:root:Filesize: Expected=11222972 Found=11222972
DEBUG:root:Checksum: Expected=87c05d33 Found=87c05d33
DEBUG:root:gfal.NoRename: closing protocol connection
DEBUG:root:Upload done.
INFO:root:Successfully uploaded file nu_numu2nutau_nue2numu_dunevd10kt_1x8x6_3view_30deg_1004_695_20230810T010722Z_gen_g4_detsim_hitreco__20240406T223117Z_reco2_ana.root
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): dune-rucio.fnal.gov:443
/cvmfs/dune.opensciencegrid.org/products/dune/rucio/v33_3_0/NULL/lib/python3.9/site-packages/urllib3/connectionpool.py:1061: InsecureRequestWarning: Unverified HTTPS request is being made to host 'dune-rucio.fnal.gov'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/1.26.x/advanced-usage.html#ssl-warnings
warnings.warn(
DEBUG:urllib3.connectionpool:https://dune-rucio.fnal.gov:443 "POST /traces/ HTTP/1.1" 404 207
DEBUG:dogpile.lock:value creation lock <dogpile.cache.region.CacheRegion._LockWrapper object at 0x2b9c7f6e6df0> acquired
DEBUG:dogpile.lock:Calling creation function for previously expired value
DEBUG:dogpile.cache.region:Cache value generated in 0.000 seconds for key(s): "host_to_choose_choice['https://dune-rucio.fnal.gov']"
DEBUG:dogpile.lock:Released creation lock
DEBUG:urllib3.connectionpool:Resetting dropped connection: dune-rucio.fnal.gov
DEBUG:urllib3.connectionpool:https://dune-rucio.fnal.gov:443 "PUT /replicas HTTP/1.1" 200 0
DEBUG:urllib3.connectionpool:https://dune-rucio.fnal.gov:443 "POST /dids/fardet-vd/fardet-vd-reco2ana_ritm1780305_nu_numu2nutau_nue2numu_fhc_skip0_end_1779/dids HTTP/1.1" 201 7
Setting up larsoft UPS area... /cvmfs/larsoft.opensciencegrid.org
Setting up DUNE UPS area... /cvmfs/dune.opensciencegrid.org/products/dune/
User: dunepro
Expires: Sun Apr 14 00:13:07 2024
Setting up larsoft UPS area... /cvmfs/larsoft.opensciencegrid.org
Setting up DUNE UPS area... /cvmfs/dune.opensciencegrid.org/products/dune/
User: dunepro
Expires: Sun Apr 14 00:13:23 2024
{
"checksums": {
"adler32": "2b80a75b"
},
"created_timestamp": null,
"creator": "dunepro",
Steps to reproduce
Rucio Version
client v33.3 server v33.3 and v34.
Additional Information
This behavior happens rucio v33_3 client python version 3.9 under both SL7 and alma9
The text was updated successfully, but these errors were encountered: