Find recursively "pure input" cells #142

nopria · 2024-05-05T11:30:36Z

I'm trying to disclose the computing structure of an excel file in order to reproduce and perform those computation from code. Basically I'd like to have a function able to find all the dependencies of a cell:

if the cell does not contain a formula the function should return its value
if the cell contains a formula the function should recursively find all the dependencies down to the "pure input" cells, and possibly write a simple pseudo-code showing how to compute the cell from the values of just "pure input" cells

Once you know all the "pure input" cells, computing the value of a formula cell is easy with calculate method:

xl_model.calculate(
    inputs={"'[test.xlsx]WS'!M11":1.35, ... },
    outputs=["'[test.xlsx]WS'!M15"] # To define the outputs that you want to calculate.
)

The initilization is easy, formulas does a great job at creating a model and all information seem to be there:

import formulas
xl_model = formulas.ExcelModel().loads("test.xlsx").finish()
xl_model.calculate()
xl_model.to_dict()

When the inputs method of a cell returns None (as far as I understand) the cell does NOT contain a formula (like "pure input" cells), but as soon as I look at a cell containing a formula

c_M15 = xl_model.cells.get("'[test.xlsx]WS'!M15")
c_M15.inputs # OrderedDict([("'[test.xlsx]WS'!M10:N14", ["'[test.xlsx]WS'!M10:N14"])])

or

c_M10 = xl_model.cells.get("'[test.xlsx]WS'!M10")
c_M10.inputs # OrderedDict([("'[test.xlsx]WS'!H10", ["'[test.xlsx]WS'!H10"]), ("'[test.xlsx]WS'!K10", ["'[test.xlsx]WS'!K10"])])

I could not understand if and how I can use formulas methods to:

find the list of all cells used in a formula (so that I can recursively repeat the analysis)
find the mathematical relations between the cells of a formula, so I can convert it to some pseudo-code

Thanks in advance for your attention.

The text was updated successfully, but these errors were encountered:

dberardo-com · 2024-05-05T12:54:40Z

i dont know if this could be related to any of those: #119 #131

nopria · 2024-05-05T16:07:22Z

The script in #119 seems to go in the right direction, although, as already noted, does not work with ranges. In my case, it correctly returns

>>> get_one_line_formula(xl_model, "'[test.xlsx]WS'!M15")
'SUM((0.22 / 0.6):N14)'

With few changes and an helper function

import re

def expand_range(range_str,xl_model):
    '''
    Convert a string representing an Excel range like "'[test.xlsx]WS'!B3:D8" into a list of strings representing all single Excel cells of the range, filtering out cells unused in Excel document model `xl_model`.
    '''
    match = re.match(r"'([^']+)'\!([A-Z]+)(\d+):([A-Z]+)(\d+)", range_str)
    if match:
        sheet_name, col_start, row_start, col_end, row_end = match.groups()
        col_start_num = ord(col_start) - 65
        col_end_num = ord(col_end) - 65
        num_rows = int(row_end) - int(row_start) + 1
        cell_list = ["'{}'!{}{}".format(sheet_name, chr(65 + col), row)
                        for row in range(int(row_start), int(row_start) + num_rows)
                        for col in range(col_start_num, col_end_num + 1)]
         # cells really used in Excel document are only those in xl_model.dsp.dmap.pred dictionary keys
        return [c for c in cell_list if c in xl_model.dsp.dmap.pred]
    else:
        return None

import functools
import schedula as sh
@functools.lru_cache()
def pure_input_cells(xl_model, node):
    '''
    Knowing the model of an Excel document, find recursively the formula expressed in terms of "pure input" cells of a given formula.
    '''
    rl = expand_range(node,xl_model) # list of cells of a range, or None if argument is not a range
    if rl is not None:
        # even if in Excel parameters are separated by ';', in `formulas` model the separation character is ','
        return ','.join(pure_input_cells(xl_model, rc) for rc in rl if rc in xl_model.dsp.dmap.pred)
    try:
        formula = next(iter(xl_model.dsp.dmap.pred[node]))
    except StopIteration:
        return node
    for k in sorted(xl_model.dsp.dmap.pred[formula], key=lambda x: (len(x), x), reverse=True):
        if not isinstance(k, sh.Token):
            formula = formula.replace(k, pure_input_cells(xl_model, k))
    return formula[1:]

I was able to make it returns

>>> pure_input_cells(xl_model, "'[test.xlsx]WS'!M15")
"SUM(('[test.xlsx]WS'!K10 / '[test.xlsx]WS'!H10),'[test.xlsx]WS'!M11,('[test.xlsx]WS'!H12 * '[test.xlsx]WS'!K12))"

which express the formula of the input cell in terms of "pure input" cells expanding the range.

I hope a similar functionality will be added to the library core.

nopria mentioned this issue May 5, 2024

Get one liner formula from graph node #119

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Find recursively "pure input" cells #142

Find recursively "pure input" cells #142

nopria commented May 5, 2024 •

edited

dberardo-com commented May 5, 2024

nopria commented May 5, 2024 •

edited

Find recursively "pure input" cells #142

Find recursively "pure input" cells #142

Comments

nopria commented May 5, 2024 • edited

dberardo-com commented May 5, 2024

nopria commented May 5, 2024 • edited

nopria commented May 5, 2024 •

edited

nopria commented May 5, 2024 •

edited