Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bytesread in metrics varies depending on file source and disagrees with pure uproot #717

Open
alexander-held opened this issue Aug 29, 2022 · 2 comments
Labels
bug Something isn't working question Further information is requested

Comments

@alexander-held
Copy link
Contributor

alexander-held commented Aug 29, 2022

Describe the bug
The bytesread metric changes when processing a local file or a file read through https. It also differs from what pure uproot reports.

To Reproduce

import urllib.request
from coffea import processor
import uproot

file_local = "data.root"
file_remote = "https://xrootd-local.unl.edu:1094//store/user/AGC/datasets/"\
    "RunIIFall15MiniAODv2/TT_TuneCUETP8M1_13TeV-powheg-pythia8/MINIAODSIM/"\
    "PU25nsData2015v1_76X_mcRun2_asymptotic_v12_ext3-v1/00000/"\
    "00DF0A73-17C2-E511-B086-E41D2D08DE30.root"

# download file
urllib.request.urlretrieve(file_remote, file_local)

class TtbarAnalysis(processor.ProcessorABC):
    def process(self, events):
        events["jet_pt"]**2
        events["jet_eta"]**2
        events["jet_phi"]**2
        return {}

    def postprocess(self, accumulator):
        return accumulator

# coffea with local file + https
for fileset, method in zip([{"ttbar": [file_local]}, {"ttbar": [file_remote]}], ["local", "https"]):
    executor = processor.IterativeExecutor()
    run = processor.Runner(executor=executor, savemetrics=True)
    _, metrics = run(fileset, "events", processor_instance=TtbarAnalysis())
    print(f"data read (coffea {method}): {metrics['bytesread']/1000**2} MB")

# uproot with local file + https
for filename, method in zip([file_local, file_remote], ["local", "https"]):
    f = uproot.open(filename)
    f['events'].arrays(["jet_pt", "jet_eta", "jet_phi"])
    print(f"data read (uproot {method}): {f.file.source.num_requested_bytes/1000**2} MB")

Expected behavior
All four numbers should match.

Output

Preprocessing 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1/1 [ 0:00:00 < 0:00:00 | ? file/s ]
Processing 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1/1 [ 0:00:00 < 0:00:00 | ? chunk/s ]
data read (coffea local): 9.887013 MB
Preprocessing 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1/1 [ 0:00:03 < 0:00:00 | ? file/s ]
Processing 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1/1 [ 0:00:11 < 0:00:00 | ? chunk/s ]
data read (coffea https): 1.704775 MB
data read (uproot local): 5.001088 MB
data read (uproot https): 5.001088 MB

Desktop (please complete the following information):
coffea 0.7.16, uproot 4.3.3

Additional context
n/a

@nsmith-
Copy link
Member

nsmith- commented Jul 31, 2023

Coffea's bytesread is the same as uproot

metrics["bytesread"] = file.file.source.num_requested_bytes

so there must be some issue with how and when we are accessing this information. Shared source object?

@lgray lgray added question Further information is requested discussion issues that require community input and removed discussion issues that require community input labels Dec 6, 2023
@lgray
Copy link
Collaborator

lgray commented Jan 20, 2024

Do we still care about this @alexander-held @nsmith-. I think we narrowed this down to retries / re-requesting data when done over xrootd/http, so data read is not deterministic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants