Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Find recursively "pure input" cells #142

Open
nopria opened this issue May 5, 2024 · 2 comments
Open

Find recursively "pure input" cells #142

nopria opened this issue May 5, 2024 · 2 comments

Comments

@nopria
Copy link

nopria commented May 5, 2024

I'm trying to disclose the computing structure of an excel file in order to reproduce and perform those computation from code. Basically I'd like to have a function able to find all the dependencies of a cell:

  • if the cell does not contain a formula the function should return its value
  • if the cell contains a formula the function should recursively find all the dependencies down to the "pure input" cells, and possibly write a simple pseudo-code showing how to compute the cell from the values of just "pure input" cells

Once you know all the "pure input" cells, computing the value of a formula cell is easy with calculate method:

xl_model.calculate(
    inputs={"'[test.xlsx]WS'!M11":1.35, ... },
    outputs=["'[test.xlsx]WS'!M15"] # To define the outputs that you want to calculate.
)

The initilization is easy, formulas does a great job at creating a model and all information seem to be there:

import formulas
xl_model = formulas.ExcelModel().loads("test.xlsx").finish()
xl_model.calculate()
xl_model.to_dict()

When the inputs method of a cell returns None (as far as I understand) the cell does NOT contain a formula (like "pure input" cells), but as soon as I look at a cell containing a formula

c_M15 = xl_model.cells.get("'[test.xlsx]WS'!M15")
c_M15.inputs # OrderedDict([("'[test.xlsx]WS'!M10:N14", ["'[test.xlsx]WS'!M10:N14"])])

or

c_M10 = xl_model.cells.get("'[test.xlsx]WS'!M10")
c_M10.inputs # OrderedDict([("'[test.xlsx]WS'!H10", ["'[test.xlsx]WS'!H10"]), ("'[test.xlsx]WS'!K10", ["'[test.xlsx]WS'!K10"])])

I could not understand if and how I can use formulas methods to:

  • find the list of all cells used in a formula (so that I can recursively repeat the analysis)
  • find the mathematical relations between the cells of a formula, so I can convert it to some pseudo-code

Thanks in advance for your attention.

@dberardo-com
Copy link

i dont know if this could be related to any of those: #119 #131

@nopria
Copy link
Author

nopria commented May 5, 2024

The script in #119 seems to go in the right direction, although, as already noted, does not work with ranges. In my case, it correctly returns

>>> get_one_line_formula(xl_model, "'[test.xlsx]WS'!M15")
'SUM((0.22 / 0.6):N14)'

With few changes and an helper function

import re

def expand_range(range_str,xl_model):
    '''
    Convert a string representing an Excel range like "'[test.xlsx]WS'!B3:D8" into a list of strings representing all single Excel cells of the range, filtering out cells unused in Excel document model `xl_model`.
    '''
    match = re.match(r"'([^']+)'\!([A-Z]+)(\d+):([A-Z]+)(\d+)", range_str)
    if match:
        sheet_name, col_start, row_start, col_end, row_end = match.groups()
        col_start_num = ord(col_start) - 65
        col_end_num = ord(col_end) - 65
        num_rows = int(row_end) - int(row_start) + 1
        cell_list = ["'{}'!{}{}".format(sheet_name, chr(65 + col), row)
                        for row in range(int(row_start), int(row_start) + num_rows)
                        for col in range(col_start_num, col_end_num + 1)]
         # cells really used in Excel document are only those in xl_model.dsp.dmap.pred dictionary keys
        return [c for c in cell_list if c in xl_model.dsp.dmap.pred]
    else:
        return None

import functools
import schedula as sh
@functools.lru_cache()
def pure_input_cells(xl_model, node):
    '''
    Knowing the model of an Excel document, find recursively the formula expressed in terms of "pure input" cells of a given formula.
    '''
    rl = expand_range(node,xl_model) # list of cells of a range, or None if argument is not a range
    if rl is not None:
        # even if in Excel parameters are separated by ';', in `formulas` model the separation character is ','
        return ','.join(pure_input_cells(xl_model, rc) for rc in rl if rc in xl_model.dsp.dmap.pred)
    try:
        formula = next(iter(xl_model.dsp.dmap.pred[node]))
    except StopIteration:
        return node
    for k in sorted(xl_model.dsp.dmap.pred[formula], key=lambda x: (len(x), x), reverse=True):
        if not isinstance(k, sh.Token):
            formula = formula.replace(k, pure_input_cells(xl_model, k))
    return formula[1:]

I was able to make it returns

>>> pure_input_cells(xl_model, "'[test.xlsx]WS'!M15")
"SUM(('[test.xlsx]WS'!K10 / '[test.xlsx]WS'!H10),'[test.xlsx]WS'!M11,('[test.xlsx]WS'!H12 * '[test.xlsx]WS'!K12))"

which express the formula of the input cell in terms of "pure input" cells expanding the range.

I hope a similar functionality will be added to the library core.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants