Synthetic Data with Sklearn

Synthetic Data with Sklearn

Synthetic Data Generation

We are going to be using sklearn’s function datasets.make_classification() to create synthetic datasets. We can specifiy arguments to specify the number of informative, redundant, and repeated features in the dataset. Synthetic data can be great as we can control every aspect of our data including the number of classes, features (informative, redundant, repeating, noise), clusters within each class, and the separation of these clusters.

There are many opportunities to examine…

  • How an increasing number of noisy features affects classification performance
  • Performance of community detection and clustering algorithms with varying levels of class separability
  • How varying dimensionality alters runtime of new algorithms

For the sake of this demonstration we are going to focus on this bit; Performance of community detection and clustering algorithms with varying levels of class separability

Primary Goals

  • Two class labels with varying class separability
  • Four clusters per class
  • Varying levels of cluster separation

Class Separation

  • Let’s, examine how the class_sep affects the data
  • We will only engineer 3 features to better visualize the class separation

import numpy as np

# Modified sklearn function to 
	# To output the cluster label of every observation
from sklearn.datasets import sample_gen_gh as gh
import matplotlib.pyplot as plt

# Iterate through varying class_sep
for sep_val in np.arange(.75, 4.75, 1.0):

	# Create data
	X, Y, clust = make_classification(n_samples=1000,
		n_features=3,
		n_informative=3,
		n_redundant=0,
		n_repeated=0,
		n_classes=2,
		n_clusters_per_class=4,
		class_sep = sep_val,
		shuffle = False)

	# Custom function to write plot output
	write_3d_plot(X, Y)

Visualizing relationships within the data

It can be difficult to try and identify underlying structure within a dataset with many many features. However, there are a few methods that might help, first we will try hierarchical clustering, PCA, and finally TSNE.

We will further be dilineating how class separation alters the underlying structure as well.

Class sep 1.5

Class sep 2.5

Class sep 3.5