pandas.eval() for massive datasets? #20

bbartling · 2024-02-03T18:14:11Z

Current method with apply in pandas:

def apply(self, df: pd.DataFrame) -> pd.DataFrame:
    # Existing checks
    df['static_check_'] = (
        df[self.duct_static_col] < df[self.duct_static_setpoint_col] - self.duct_static_inches_err_thres)
    df['fan_check_'] = (
        df[self.supply_vfd_speed_col] >= self.vfd_speed_percent_max - self.vfd_speed_percent_err_thres)

    # Combined condition check
    df["combined_check"] = df['static_check_'] & df['fan_check_']

    # Rolling sum to count consecutive trues
    rolling_sum = df["combined_check"].rolling(window=5).sum()
    # Set flag to 1 if rolling sum equals the window size (5)
    df["fc1_flag"] = (rolling_sum == 5).astype(int)

    return df

Use eval?

def apply_with_eval(self, df: pd.DataFrame) -> pd.DataFrame:
    # Use eval for simple comparison operations
    df.eval('static_check_ = @self.duct_static_col < (@self.duct_static_setpoint_col - @self.duct_static_inches_err_thres)', inplace=True)
    df.eval('fan_check_ = @self.supply_vfd_speed_col >= (@self.vfd_speed_percent_max - @self.vfd_speed_percent_err_thres)', inplace=True)

    # Combined condition check (bitwise AND)
    df["combined_check"] = df['static_check_'] & df['fan_check_']

    # Rolling sum to count consecutive trues (This part remains the same)
    rolling_sum = df["combined_check"].rolling(window=5).sum()
    # Set flag to 1 if rolling sum equals the window size (5)
    df["fc1_flag"] = (rolling_sum == 5).astype(int)

    return df

Any insights appreciated its sort of interesting to see what ChatGPT states in a conversation about this.

Continue using your current approach with standard pandas operations, especially for the more complex parts like the rolling window operation.

If performance becomes an issue, consider using eval() for the simpler comparison operations, but benchmark to ensure it's actually faster for your specific case.

Always balance between readability/maintainability and performance, choosing the one that best fits your project's requirements.

Remember, while eval() can offer performance improvements in certain cases, it's always good to benchmark with your specific dataset to ensure it's actually faster and doesn't compromise readability or maintainability.

The text was updated successfully, but these errors were encountered:

bbartling added enhancement New feature or request help wanted Extra attention is needed labels Feb 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pandas.eval() for massive datasets? #20

pandas.eval() for massive datasets? #20

bbartling commented Feb 3, 2024

pandas.eval() for massive datasets? #20

pandas.eval() for massive datasets? #20

Comments

bbartling commented Feb 3, 2024