GitLab

/

Help
Sign in / Register

P pyod
Project information
- Project information
- Activity
- Labels
- Members
Repository
- Repository
- Files
- Commits
- Branches
- Tags
- Contributors
- Graph
- Compare
Issues 144
- Issues 144
- List
- Boards
- Service Desk
- Milestones
Merge requests 16
- Merge requests 16
CI/CD
- CI/CD
- Pipelines
- Jobs
- Schedules
Deployments
- Deployments
- Environments
- Releases
Packages and registries
- Packages and registries
- Package Registry
- Infrastructure Registry
Monitor
- Monitor
- Incidents
Analytics
- Analytics
- Value stream
- CI/CD
- Repository
Wiki
- Wiki
Snippets
- Snippets
Activity
Graph
Create a new issue
Jobs
Commits
Issue Boards

Collapse sidebar

Yue Zhao
pyod
Merge requests
!76

Generate data clusters

Review changes
Download
Email patches
Plain diff

Merged Yahya requested to merge github/fork/John-Almardeny/Generate_Data_Clusters into development Apr 15, 2019

Overview 7
Commits 4
Pipelines 0
Changes 3

All Submissions Basics:

#65 (closed)

Have you followed the guidelines in our Contributing document?
Have you checked to ensure there aren't other open Pull Requests for the same update/change?
Have you checked all Issues to tie the PR to a specific one?

All Submissions Cores:

Have you added an explanation of what your changes do and why you'd like us to include them?
Have you written new tests for your core changes, as applicable?
Have you successfully ran tests with your changes locally?
Does your submission pass tests, including CircleCI, Travis CI, and AppVeyor?
Does your submission have appropriate code coverage? The cutoff threshold is 95% by Coversall.

New Model Submissions:

Have you created a .py in ~/pyod/models/?
Have you created a _example.py in ~/examples/?
Have you created a test_.py in ~/pyod/test/?
Have you lint your code locally prior to submission?

Brief Description

This algorithm generates one (or many) clusters of data points with different/same sizes and densities based on the user's choice passed by the parameters. It generates the required ratio of outliers controlled by the contamination parameter and distributes them on the clusters. It avails of the make_blobs function provided by sklearn to create the clusters; and main part of the algorithm is to maintain and validate the consistency of the data splits among the different clusters. It is very well documented, and I believe if you read the documentation (comments) you will get it easily.

As per previously mentioned in the related issue #66 , having different clusters of data with different sizes and densities makes outliers detection challengeable especially for those type of algorithms that based on k-nearest neighbors such as LOF , LDOF, LoOP, HiCS and SOD and others.

Assignee

Assign to

Reviewers

Request review from

Time tracking

Source branch: github/fork/John-Almardeny/Generate_Data_Clusters