Skip to content
GitLab
Projects Groups Snippets
  • /
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
  • Sign in / Register
  • P pyod
  • Project information
    • Project information
    • Activity
    • Labels
    • Members
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
  • Issues 144
    • Issues 144
    • List
    • Boards
    • Service Desk
    • Milestones
  • Merge requests 16
    • Merge requests 16
  • CI/CD
    • CI/CD
    • Pipelines
    • Jobs
    • Schedules
  • Deployments
    • Deployments
    • Environments
    • Releases
  • Packages and registries
    • Packages and registries
    • Package Registry
    • Infrastructure Registry
  • Monitor
    • Monitor
    • Incidents
  • Analytics
    • Analytics
    • Value stream
    • CI/CD
    • Repository
  • Wiki
    • Wiki
  • Snippets
    • Snippets
  • Activity
  • Graph
  • Create a new issue
  • Jobs
  • Commits
  • Issue Boards
Collapse sidebar
  • Yue Zhao
  • pyod
  • Merge requests
  • !76

Generate data clusters

  • Review changes

  • Download
  • Email patches
  • Plain diff
Merged Yahya requested to merge github/fork/John-Almardeny/Generate_Data_Clusters into development Apr 15, 2019
  • Overview 7
  • Commits 4
  • Pipelines 0
  • Changes 3

All Submissions Basics:

#65 (closed)

  • Have you followed the guidelines in our Contributing document?
  • Have you checked to ensure there aren't other open Pull Requests for the same update/change?
  • Have you checked all Issues to tie the PR to a specific one?

All Submissions Cores:

  • Have you added an explanation of what your changes do and why you'd like us to include them?
  • Have you written new tests for your core changes, as applicable?
  • Have you successfully ran tests with your changes locally?
  • Does your submission pass tests, including CircleCI, Travis CI, and AppVeyor?
  • Does your submission have appropriate code coverage? The cutoff threshold is 95% by Coversall.

New Model Submissions:

  • Have you created a .py in ~/pyod/models/?
  • Have you created a _example.py in ~/examples/?
  • Have you created a test_.py in ~/pyod/test/?
  • Have you lint your code locally prior to submission?

Brief Description

This algorithm generates one (or many) clusters of data points with different/same sizes and densities based on the user's choice passed by the parameters. It generates the required ratio of outliers controlled by the contamination parameter and distributes them on the clusters. It avails of the make_blobs function provided by sklearn to create the clusters; and main part of the algorithm is to maintain and validate the consistency of the data splits among the different clusters. It is very well documented, and I believe if you read the documentation (comments) you will get it easily.


As per previously mentioned in the related issue #66 , having different clusters of data with different sizes and densities makes outliers detection challengeable especially for those type of algorithms that based on k-nearest neighbors such as LOF , LDOF, LoOP, HiCS and SOD and others.

image

Assignee
Assign to
Reviewers
Request review from
Time tracking
Source branch: github/fork/John-Almardeny/Generate_Data_Clusters