Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chi_merge_vector has redundant calculation. #9

Open
cheesebear opened this issue Apr 24, 2022 · 1 comment
Open

chi_merge_vector has redundant calculation. #9

cheesebear opened this issue Apr 24, 2022 · 1 comment

Comments

@cheesebear
Copy link

cheesebear commented Apr 24, 2022

codes below is calculated in every while loop, and takes too much time.

        intervals, unique_intervals = assign_interval_unique(x, unique_intervals[:, 1])
        pt_value, pt_column, pt_index = pivot_table_np(intervals[:, 1], y)

In my situation, original code takes 10m to calculate one feature. After optimazation, it takes about 10s.
in first loop, defines df:

    df = pd.DataFrame(pt_value, columns=pt_column)
    df['pt_index'] = pt_index
    df['chi2'] = np.append(chi2_array, [np.NaN] * (m - 1))

in other loops, adjust df, and adjust intermediate variable:
```

使用快速方法,避免重复计算

    merge_index_start=index_adjacent_to_merge[0]
    # print(df.loc[merge_index_start:merge_index_start+m-1, :].sum(axis=0).to_frame())
    df=pd.concat(
        [
            df.loc[:merge_index_start-1,:],
            df.loc[merge_index_start:merge_index_start+m-1, :].sum(axis=0).to_frame().T,
            df.loc[merge_index_start+ m:, :],
        ],
        ignore_index=True
    )
    # print(df)
    df.loc[merge_index_start:merge_index_start  , 'pt_index']=new_interval[0][1]

    pt_value = df[pt_column].to_numpy()
    pt_index = df['pt_index'].to_numpy()
    boundaries_tmp = np.unique(
        np.concatenate((np.array([-float('inf')]),
                        df['pt_index'].to_numpy(), np.array([float('inf')])),
                       axis=0))
    boundaries_tmp.sort()
    unique_intervals=np.array([[boundaries_tmp[i],boundaries_tmp[i+1]] for i in range(len(boundaries_tmp)-1)])
@Mensyne
Copy link

Mensyne commented Oct 11, 2022 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants