Getting Started with Data Augmentation¶
Before you start!¶
- This notebook assumes that shapeworks conda environment has been activated using
conda activate shapeworkson the terminal.
- See Setting Up ShapeWorks Environment to learn how to set up your environment to start using shapeworks library. Please note, the prerequisite steps will use the same code to setup the environment for this notebook and import
- See Getting Started with Segmentations to learn how to load and visualize binary segmentations.
In this notebook, you will learn:¶
- How to generate realistic synthetic data from an existing dataset using different parametric distributions.
- How to visualize the statistical distribution of the generated data compared to the original data.
Data Augmentation Overview¶
ShapeWorks includes a Python package, DataAugmentationUtils, that supports model-based data augmentation. This package is useful to increase the training sample size to train deep networks such as DeepSSM (see SSMs Directly from Images).
A preliminary requirement for data augmentation is a set of images and shape models from real data are to base augmentation on. Once that is acquired, the process includes:
- Embedding the real data into a low-dimensional space using principle component analysis.
- Fitting a parametric distribution to the subspace for sampling.
- Sampling from the distribution to create new instances.
- Projecting the samples back into the high dimensional space of the original data
- Completing the sample generation by creating a corresponding synthetic image.
This notebook shows how the distribution of the original data can be visually compared to the distribution of the synthetic data to motivate the choice of parametric distribution in step 2.
For a full explanation of the data augmentation process and package please see: Data Augmentation for Deep Learning.
- Setting up
shapeworksenvironment. See Setting Up ShapeWorks Environment. To avoid code clutter, the
setup_shapeworks_envfunction can found in
shapeworkslibrary. See Setting Up ShapeWorks Environment.
- Helper functions for segmentations. See Getting Started with Segmentations and Getting Started with Exploring Segmentations.
- Helper functions for meshes. See Getting Started with Meshes.
- Helper functions for visualization. See Getting Started with Segmentations, Getting Started with Meshes, and Getting Started with Exploring Segmentations.
- Defining your dataset location. See Getting Started with Exploring Segmentations.
- Loading your dataset. See Getting Started with Exploring Segmentations.
- Defining parameters for
pyvistaplotter. See Getting Started with Exploring Segmentations.
shapeworks functions are inplace, i.e.,
<swObject>.<function>() applies that function to the
swObject data. To keep the original data unchanged, you have first to copy it to another variable before applying the function.
Notebook keyboard shortcuts¶
Esc + H: displays a complete list of keyboard shortcuts
Esc + A: insert new cell above the current cell
Esc + B: insert new cell below the current cell
Esc + D + D: delete current cell
Esc + Z: undo
Shift + enter: run current cell and move to next
- To show a function's argument list (i.e., signature), use
shift-tab-tabto show more help for a function
- To show the help of a function, use
- To show all functions supported by an object, use
dot-tabafter the variable name
Here, we will append both your
PYTHONPATH and your system
PATH to setup shapeworks environment for this notebook. See Setting Up ShapeWorks Environment for more details.
In this notebook, we assume the following.
- This notebook is located in
- You have built shapeworks from source in
builddirectory within the shapeworks code directory
- You have built shapeworks dependencies (using
build_dependencies.sh) in the same parent directory of shapeworks code
Note: If you run from a ShapeWorks installation, you don't need to set the dependencies path and the
shapeworks_bin_dir would be set as
# import relevant libraries import sys import os # add parent-parent directory (where setupenv.py is) to python path sys.path.insert(0,'../..') # importing setupenv from Examples/Python import setupenv # indicate the bin directories for shapeworks and its dependencies shapeworks_bin_dir = "../../../../build/bin" dependencies_bin_dir = "../../../../../shapeworks-dependencies/bin" # set up shapeworks environment setupenv.setup_shapeworks_env(shapeworks_bin_dir, dependencies_bin_dir, verbose = False) import shapeworks as sw
Import Data Augmentation Package¶
# let's import shapeworks library to test whether shapeworks is now set try: import DataAugmentationUtils except ImportError: print('ERROR: DataAugmentationUtils failed to import') else: print('SUCCESS: DataAugmentationUtils is successfully imported!!!')
1. Defining the original dataset¶
Defining dataset location¶
After you login, click
Collections on the left panel and then
use-case-data-v2. Select the dataset you would like to download by clicking on the checkbox on the left of the dataset name. See the video below.
This notebook assumes that you have downloaded
Examples/Python/Data. Feel free to use your own dataset.
# dataset name is the folder name for your dataset datasetName = 'femur-v0' # path to the dataset where we can find shape data # here we assume shape data are given as binary segmentations data_dir = '../../Data/' + datasetName + '/' print('Dataset Name: ' + datasetName) print('Directory: ' + data_dir)
Get file lists¶
Now we need the .particle files and corresponding raw images for the original dataset.
# Get image path list img_dir = data_dir + "groomed/images/" img_list =  for file in os.listdir(img_dir): img_list.append(img_dir + file) img_list = sorted(img_list) # Get particles path list model_dir = data_dir + "shape_models/femur/1024/" local_particle_list =  for file in os.listdir(model_dir): if "local" in file: local_particle_list.append(model_dir + file) local_particle_list = sorted(local_particle_list) print("Total shapes in original dataset: "+ str(len(img_list)))
Run data augmentation using a Gaussian Distribution¶
Below is the command for running the complete data augmentation process:
DataAugmentationUtils.runDataAugmentation(out_dir, img_list, local_point_list, num_samples, num_dim, percent_variability, sampler_type, mixture_num, world_point_list)
out_dir: Path to the directory where augmented data will be stored
img_list: List of paths to images of the original dataset.
local_point_list: List of paths to local
.particlesfiles of the original dataset. Note, this list should be ordered in correspondence with the
num_dim: The number of dimensions to reduce to in PCA embedding. If zero or not specified, the percent_variability option is used to select the numnber of dimensions.
percent_variability: The proportion of variability in the data to be preserved in embedding. Used if
num_dimis zero or not specified. Default value is 0.95 which preserves 95% of the varibaility in the data.
sampler_type: The type of parametric distribution to fit and sample from. Options:
mixture_num: Only necessary if
mixture. The number of clusters (i.e., mixture components) to be used in fitting a mixture model. If zero or not specified, the optimal number of clusters will be automatically determined using the elbow method).
world_point_list: List of paths to world
.particlesfiles of the original dataset. This is optional and should be provided in cases where procrustes was used for the original optimization, resulting in a difference between world and local particle files. Note, this list should be ordered in correspondence with the
In this notebook we will keep most arguments the same and explore the effect of changing the
First, we will try a Gaussian distribution. For further explanation about each distribution, see Data Augmentation for Deep Learning.
# Augmentation variables to keep constant num_samples = 50 num_dim = 0 percent_variability = 0.95
output_directory = '../Output/GaussianAugmentation/' sampler_type = "gaussian" embedded_dim = DataAugmentationUtils.runDataAugmentation(output_directory, img_list, local_particle_list, num_samples, num_dim, percent_variability, sampler_type) aug_data_csv = output_directory + "/TotalData.csv"
Visualize distribution of real and augmented data¶
Below is the command for visualizing the original and augmented data:
data_csv: The path to the CSV file created by running the data augmentation process.
viz_type: The type of visulazation to display. Options
splom). If set to
splom, a scatterplot matrix of pairwise PCA comparisions will open in the default browser. If set to
violina violin plot or rotated kernel density plot will be displayed.
We will use a vi plot to visualize the difference in the real and augmented distributions.
Run data augmentation using a Mixture of Gaussian Distribution¶
output_directory = '../Output/MixtureAugmentation/' sampler_type = "mixture" embedded_dim = DataAugmentationUtils.runDataAugmentation(output_directory, img_list, local_particle_list, num_samples, num_dim, percent_variability, sampler_type) aug_data_csv = output_directory + "/TotalData.csv"
Run data augmentation using Kernel Density Estimation¶
output_directory = '../Output/KDEAugmentation/' sampler_type = "kde" embedded_dim = DataAugmentationUtils.runDataAugmentation(output_directory, img_list, local_particle_list, num_samples, num_dim, percent_variability, sampler_type) aug_data_csv = output_directory + "/TotalData.csv"