Skip to content
GitLab
Projects Groups Snippets
  • /
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
  • Sign in / Register
  • P pyod
  • Project information
    • Project information
    • Activity
    • Labels
    • Members
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
  • Issues 144
    • Issues 144
    • List
    • Boards
    • Service Desk
    • Milestones
  • Merge requests 16
    • Merge requests 16
  • CI/CD
    • CI/CD
    • Pipelines
    • Jobs
    • Schedules
  • Deployments
    • Deployments
    • Environments
    • Releases
  • Packages and registries
    • Packages and registries
    • Package Registry
    • Infrastructure Registry
  • Monitor
    • Monitor
    • Incidents
  • Analytics
    • Analytics
    • Value stream
    • CI/CD
    • Repository
  • Wiki
    • Wiki
  • Snippets
    • Snippets
  • Activity
  • Graph
  • Create a new issue
  • Jobs
  • Commits
  • Issue Boards
Collapse sidebar
  • Yue Zhao
  • pyod
  • Issues
  • #315
Closed
Open
Issue created Jun 16, 2021 by Roel@RoelBoumanContributor

COF calculates full distance matrix for all sample sizes

In the current implementation COF calculates the full X-X distance matrix, leading to O(n^2) memory complexity. When analyzing data with a large number of samples this will lead to a prohibitively large distance matrix, often causing memory errors.

At the cost of linear overhead, this can be easily circumvented by recalculating the distances where needed. Effectively leading to twice as many distance calculations.

I have tested a simple implementation of this memory optimization, leading to the following code:

    def _cof(self, X):
        """
        Connectivity-Based Outlier Factor (COF) Algorithm
        This function is called internally to calculate the
        Connectivity-Based Outlier Factor (COF) as an outlier
        score for observations.
        :return: numpy array containing COF scores for observations.
                 The greater the COF, the greater the outlierness.
        """
        #dist_matrix = np.array(distance_matrix(X, X))
        sbn_path_index, ac_dist, cof_ = [], [], []
        for i in range(X.shape[0]):
            #sbn_path = np.argsort(dist_matrix[i])
            sbn_path = np.argsort(minkowski_distance(X[i,:],X,p=2))
            sbn_path_index.append(sbn_path[1: self.n_neighbors_ + 1])
            cost_desc = []
            for j in range(self.n_neighbors_):
                #cost_desc.append(
                #    np.min(dist_matrix[sbn_path[j + 1]][sbn_path][:j + 1]))
                cost_desc.append(
                    np.min(minkowski_distance(X[sbn_path[j + 1]],X,p=2)[sbn_path][:j + 1]))
            acd = []
            for _h, cost_ in enumerate(cost_desc):
                neighbor_add1 = self.n_neighbors_ + 1
                acd.append(((2. * (neighbor_add1 - (_h + 1))) / (
                        neighbor_add1 * self.n_neighbors_)) * cost_)
            ac_dist.append(np.sum(acd))
        for _g in range(X.shape[0]):
            cof_.append((ac_dist[_g] * self.n_neighbors_) /
                        np.sum(itemgetter(*sbn_path_index[_g])(ac_dist)))
        return np.nan_to_num(cof_)

While I don't think this would be great to use in every case, due to the increased computational cost, it's seems to be that offering this option to the user would be worthwhile. I can write a pull request which implements a switch between these use-cases, but how this should be implemented I don't dare say. We can for example have the user set a parameter to use either one or the other method, or have them set a limit for the distance matrix size (in MB for example). I'd like to request some additional input on whether such a feature is desireable and how it should be implemented.

Assignee
Assign to
Time tracking