Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimal time complexity greedy linkage #212

Open
nbgl opened this issue May 23, 2019 · 0 comments
Open

Optimal time complexity greedy linkage #212

nbgl opened this issue May 23, 2019 · 0 comments

Comments

@nbgl
Copy link
Contributor

nbgl commented May 23, 2019

Let D and D′ be two deduplicated datasets of size n and n′ respectively. Greedy linkage involves two steps: comparisons, which are trivially O(nn′), and solving.

Let p be the probability that a pair of records has similarity above the threshold. As n, n′ → ∞ (for a given source of data) p converges to the probability of a non-matching pair being above the threshold. (p > 0 unless there are never non-matching pairs above the threshold, in which case solving isn’t necessary.)

There are pnn′ pairs above the threshold. Because we sort them, our solving step is O(pnn′ log pnn′). This equals O(nn′ log nn′) since p converges to a positive constant. The entire linkage process is thus O(nn′ + nn′ log nn′) = O(nn′ log nn′).

However, solving can be done in O(pnn′ + m log m) time, where m is the number of found pairs. Observing that m ≤ min(n, n′) for a given source of data, O(pnn′ + m log m) = O(nn′). This is the same as the time complexity of the comparison step and it is the best we can do without ignoring input.

The algorithm is similar to Filter-Kruskal [1] (although I claim I thought of it before I found Filter-Kruskal!). It is a modification of QuickSort, and it filters out pairs that will provably not be accepted before they are sorted. Pseudo-Python:

def pick_pivot(pairs):
    raise NotImplementedError()


def partition(pairs, pivot):
    raise NotImplementedError()


def quick_greedy(candidate_pairs):
    matched_A = set()
    matched_B = set()
    accepted = set()

    def filter_provable_nonmatches(pairs):
        return [pair
                for pair in pairs
                if pair.index_A not in matched_A and pair.index_B not in matched_B]

    def quick_greedy_inner(pairs):
        if not candidate_pairs:
            return

        pivot = pick_pivot(pairs)
        left, right = partition(pairs, pivot)
        quick_greedy_inner(left)

        if pivot.index_A not in matched_A and pivot.index_B not in matched_B:
            matched_A.add(pivot.index_A)
            matched_B.add(pivot.index_B)
            accepted.add(pivot)

        right = filter_provable_nonmatches(right)
        quick_greedy_inner(right)

    quick_greedy_inner(candidate_pairs)
    return accepted

(Of course, in practice one would set a length limit for pairs, below which quick_greedy_inner calls the trivial greedy algorithm. This improves the speed since the trivial algorithm is faster for smaller problems.)

The time complexity of this algorithm can be proven by lightly modifying the proof of Theorem 3.1 in [1]. It relies on Lemma 1.1 in [2], which also needs adjusting.

[1] Osipov, Vitaly, Peter Sanders, and Johannes Singler. "The Filter-Kruskal Minimum Spanning Tree Algorithm." 2009 Proceedings of the Eleventh Workshop on Algorithm Engineering and Experiments (ALENEX). Society for Industrial and Applied Mathematics, 2009.
[2] Chan, Timothy M. "Backwards analysis of the Karger-Klein-Tarjan algorithm for minimum spanning trees." Information Processing Letters 67.6 (1998): 303-304.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant