Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parsing from_root can be slow #23

Open
lukasheinrich opened this issue Jun 11, 2019 · 9 comments
Open

parsing from_root can be slow #23

lukasheinrich opened this issue Jun 11, 2019 · 9 comments

Comments

@lukasheinrich
Copy link

lukasheinrich commented Jun 11, 2019

Hi,

this is an excellent library. I am incorporating it into my ATLAS analysis code together with @jpivarski's uproot and awkward as a drop-in replacement for TTree::Draw

I notices that some expressions can be slow to parse e.g. this takes almost 2 seconds:

$> cat slow.py 
import formulate
formulate.from_root('((weight * (n_mu > 0)) * ((tt_cat==0 || tt_cat==3 || tt_cat==6)))')
$> time python slow.py 
real	0m1.803s
user	0m1.995s
sys	0m0.094s

is there any way this can be sped up?

Thanks,
Lukas

Note: it's not slow of course in an absolute sense, but if you do such an operation very often it can accumulate quite quickly and I was surprised parsing such an expression would have a noticable time cost

@jpivarski
Copy link
Member

At the risk of piling yet more on this, any thoughts about building it on a professional parser like Lark? It's a joy to use (see my tutorials on it) and its LALR parser is reportedly fast. Simple languages like NumExpr and TTree::Draw should be easy to express in the non-ambiguous rules that LALR requires.

I could get involved in this, particularly to support TTree::Draw's looping constructs with an awkward-array backend. I don't know when that would happen, but it's something I might want to help with someday. I've heard that TTree::Draw has about 100 unit tests in the ROOT testing suite, so there's a good place to start. At some point, I do have to integrate NumExpr with awkward-array better, and this could be part of that.

@lukasheinrich
Copy link
Author

lukasheinrich commented Jun 11, 2019

@jpivarski yes I'm also happy to get involved here (i.e. in providing a TTree::Draw drop-in replacement). Do you have a pointer to the TTree::Draw test suite? One of the beauties of using formulate is that it can return a list of variables it needs, and thus branches that need to be read.

I'm currently using cachetools to just pay the parsing cost once

@cachetools.cached(cache = {})
def expression_from_root(expr):
    return formulate.from_root(expr)

and then do something like

vars_a = expression_from_root(cut_weight).variables
vars_b = expression_from_root(var_name).variables
branches = vars_a.union(vars_b)

def stringy_observable(table,**kwargs):
    observable = kwargs.get('var_name')
    cut_weight = kwargs.get('cut_weight')

    observable = expression_from_root(observable).to_numexpr()
    cut_weight = expression_from_root(cut_weight).to_numexpr()

    observable = ne.evaluate(observable,table)
    cut_weight = ne.evaluate(cut_weight,table)
    return observable, cut_weight

def process_chunk(chunk):
    table = aatable.Table(chunk)
    observable, weights = stringy_observable(table, var_name = var_name, cut_weight = cut_weight)
    chunk_histo, chunk_edges = np.histogram(observable, weights = weights, bins = bin_edges)
    return chunk_histo, chunk_edges

#map
chunk_results = [
    process_chunk(chunk) for chunk in uproot.iterate(
        fname,
        branches = branches,
        treepath = treename,
        namedecode="utf-8",
)] #possibe parallelization here

#reduce
for chunk_histo, chunk_edges in chunk_results:
    histo = chunk_histo if histo is None else histo + chunk_histo
    edges = chunk_edges

which is more or less equivalent to TTree::Draw('<observable>','<cuts>')

@jpivarski
Copy link
Member

Until I saw that you're caching the parsed result, it didn't sink in what "slow" meant. 1.8 sec to parse a 65 character string is unbelievably slow. I ran it in timeit to get more statistics and the mean is 0.8 sec, still unbelievably slow. There's something seriously wrong.

I ran the profiler and saw a lot of pyparsing (in the parseImpl function). Sorry I suggested earlier a "professional parser like Lark" because I hadn't noticed that you're using a professional parser. PyParsing has some performance tips, but I don't know—this is really surprising!

@jpivarski
Copy link
Member

You know what? I bet it's recreating the parser every time it parses. I know there's a "cached" something in the profile output, but I'd find this time understandable if we're looking at the time to parse the grammar, create a state machine, and then parse the string, rather than starting with a state machine. I don't think it's necessary to save the state machine to avoid 1 second during the import as PLY does, but it should keep the state machine for the duration of the Python process.

@jonas-eschle
Copy link
Contributor

Do we have any news on this or ideas on how to proceed? Unfortunately, if the parsing string gets long, the timing blows up, e.g.

(abs( TCKCat<0 ) * B_plus_Hlt1TrackMVADecision_TOS + abs( ~ ( TCKCat<0 ) ) * ((e_plus_Hlt1TrackMVADecision_TOS) & (log(e_plus_IPCHI2_OWNPV) > (1.0/((e_plus_TRACK_PT * 1000.0-1.) ** 2) + ((abs( TCKCat==1 ) * 2.3 + abs( ~ ( TCKCat==1 ) ) * (abs( TCKCat==0 | TCKCat==2 ) * 1.1 + abs( ~ ( TCKCat==0 | TCKCat==2 ) ) * -9999999.9))/25) * (25-e_plus_TRACK_PT * 1000.0) + log(7.4) ))) | ((e_minus_Hlt1TrackMVADecision_TOS) & (log(e_minus_IPCHI2_OWNPV) > (1.0/((e_minus_TRACK_PT * 1000.0-1.) ** 2) + ((abs( TCKCat==1 ) * 2.3 + abs( ~ ( TCKCat==1 ) ) * (abs( TCKCat==0 | TCKCat==2 ) * 1.1 + abs( ~ ( TCKCat==0 | TCKCat==2 ) ) * -9999999.9))/25) * (25-e_minus_TRACK_PT * 1000.0) + log(7.4) ))) | ((K_Kst_Hlt1TrackMVADecision_TOS) & (log(K_Kst_IPCHI2_OWNPV) > (1.0/((K_Kst_TRACK_PT * 1000.0-1.) ** 2) + ((abs( TCKCat==1 ) * 2.3 + abs( ~ ( TCKCat==1 ) ) * (abs( TCKCat==0 | TCKCat==2 ) * 1.1 + abs( ~ ( TCKCat==0 | TCKCat==2 ) ) * -9999999.9))/25) * (25-K_Kst_TRACK_PT * 1000.0) + log(7.4) ))))

is unparseable.

@jpivarski
Copy link
Member

The fundamental problem is that this project doesn't have a maintainer. (Correct me if I'm wrong!) There's no reason why the above string should run a parser out of resources, whether that's memory or a reasonably small amount of time. I don't know PyParsing all that well, but ply is fast and Lark is fast and modern.

Last year, I gave a suite of tutorials on parsing, interpreting, and code transformations (using Lark):

https://github.com/jpivarski-talks/2019-05-06-adl-language-tools

These are fairly simple languages—basically calculator formulae—if anyone's willing to take it on, I can help you get started.

@eduardo-rodrigues
Copy link
Member

Let's see what @chrisburr has to comment om support and maintenance ...

FWIW I bet Lark will be faster as a parser, yes. It does a very good job in DecayLanguage parsing a master decay file with 11000 lines of decay descriptions in about 1.5 seconds, storing all parsed info in a structured dictionary.

@jpivarski
Copy link
Member

It also gives you the option between parsers with different time complexity. (One is faster but less flexible. For languages like the ones this project targets, the faster one would be just as easy to use.)

@eduardo-rodrigues
Copy link
Member

Yes, totally agree.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants