Datasets
Module for generating tf.data.Dataset
BaseImageDatagen
A base for image data generators
Attributes:
| Name | Type | Description |
|---|---|---|
shape |
List[int]
|
shape of the images to generate |
name |
str
|
name of the dataset to use in references/logs |
Methods
read_image : read a file from given filepath and decode it as a jpeg file parse_sample : given a path, read and preprocess it for further usage create_ds, abstractmethod:
Source code in conftrainer/datasets/datagen.py
read_image(path)
staticmethod
Read and decode an image from disk
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path |
str
|
path to the image |
required |
Returns:
| Name | Type | Description |
|---|---|---|
out |
tf.Tensor
|
decoded image |
Source code in conftrainer/datasets/datagen.py
parse_sample(path)
Read and resize an image
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path |
str
|
path to the image |
required |
Returns:
| Name | Type | Description |
|---|---|---|
out |
tf.Tensor, Any
|
(image, label) pair |
Source code in conftrainer/datasets/datagen.py
create_ds(batch_size=32, shape=None, training=False)
abstractmethod
ImageDatagen
Bases: BaseImageDatagen
Generator class for creating image datasets via tf.dataframe.Dataset API.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
filepaths |
List[str]
|
names of the filepaths to read samples |
required |
labels |
Optional[np.ndarray]
|
labels |
None
|
name |
str
|
name of the object to use in references |
'current'
|
shape |
List[int]
|
shape of the images. Might be overwritten in create_ds method |
None
|
Source code in conftrainer/datasets/datagen.py
86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 | |
create_ds(batch_size=32, shape=None, training=False)
Create a dataset via tf.dataframe.Dataset API and assign it to .dataset attribute of the datagen
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
batch_size |
int = 32
|
size of each batch |
32
|
training |
bool
|
whether the dataset will be used for training. If so, the dataset will be shuffled on each call |
False
|
shape |
List[int]
|
shape of the images to output. If not provided, defaults will be used instead |
None
|
Returns:
| Name | Type | Description |
|---|---|---|
out |
tf.data.Dataset
|
batched and prefetched dataset ready to pass to a network |
Source code in conftrainer/datasets/datagen.py
create_labeled_ds(training, batch_size)
Create a labeled tf.data.Dataset
Source code in conftrainer/datasets/datagen.py
create_unlabeled_ds(training, batch_size)
Create an unlabeled tf.data.Dataset
Source code in conftrainer/datasets/datagen.py
MultiOutputDatagen
Bases: BaseImageDatagen
A dataset with multiple set of labels for each image for multibranch training
Source code in conftrainer/datasets/datagen.py
proba_postprocessing_functions: List[Callable]
property
Get postprocessing functions for each tasks' labels. If the task is multilabel, its postprocessing fn will be np.round, otherwise np.argmax
Returns:
| Name | Type | Description |
|---|---|---|
functions |
List[Callable]
|
functions to use when postprocessing labels |
unpack_per_task_info(per_task_data)
staticmethod
Given list of info for each task, create lists of per task labels, class names and task names
Source code in conftrainer/datasets/datagen.py
probs_to_labels(probas)
Postprocess a list of per task probabilities to get labels
Source code in conftrainer/datasets/datagen.py
create_ds(batch_size=32, shape=None, training=False)
Create the tf.data.Dataset object to pass to network
Source code in conftrainer/datasets/datagen.py
Util for loading data from given csv
load_all_datagens(csvs, data_dir, name_col, shape, trainable_classes, clean_dataset)
Create Datagen objects for training, validation and testing
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
csvs |
CSVsConfig
|
path(s) to read the data from |
required |
data_dir |
str
|
path to directory where the images are stored |
required |
name_col |
str
|
name of the column with filenames |
required |
trainable_classes |
List[str]
|
names of label columns |
required |
shape |
List[int]
|
shape of the images |
required |
clean_dataset |
bool
|
whether to clean the dataset before using it |
required |
Returns:
| Name | Type | Description |
|---|---|---|
out |
Dict[str, ImageDatagen]
|
names as keys and Datagens as values |
Source code in conftrainer/datasets/loader.py
create_datasets(datagens, **kwargs)
Update the .dataset attributes of given ImageDatagens by calling .create_ds method
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
datagens |
ImageDatagen
|
names of the datagens as key and Datagen objects as values |
required |
**kwargs |
keyword arguments for .create_ds method of ImageDatagen. See its documentation of ImageDatagen for more details |
{}
|
Source code in conftrainer/datasets/loader.py
create_multioutput_datagens(read_config, shape)
Given csv files and classes for each task, create multi output data generators for training, validation ( optional) and test (optional) datasets
Source code in conftrainer/datasets/loader.py
Helper functions related to data
filter_unlabeled_samples(dataframe, class_names)
Given a dataframe and list of columns, filter out rows with all zeroes as values in given cols
Source code in conftrainer/datasets/utils.py
stratify_split(dataframe, class_names, test_size=0.2, random_state=None)
Split given dataframe into train and test datasets, stratifying by given columns.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
dataframe |
pandas.DataFrame
|
data to split |
required |
class_names |
List[str]
|
names of label columns to split |
required |
test_size |
float
|
ratio of test dataset |
0.2
|
random_state |
int
|
random state for reproducible results |
None
|
Returns:
| Name | Type | Description |
|---|---|---|
out |
pandas.DataFrame, pandas.DataFrame
|
train and test dataframes |
Source code in conftrainer/datasets/utils.py
clean_dataframe(dataframe, data_dir, col, clean_save_path=None)
Remove rows with broken/missing images from a dataframe. If a savepath is provided, clean dataframe will be saved on disk.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
dataframe |
pd.DataFrame
|
dataframe to clean |
required |
col |
str
|
name of the column containing filenames |
required |
data_dir |
str
|
path to directory containing the images |
required |
clean_save_path |
Optional[str]
|
a csv path to save the cleaned dataframe |
None
|
Returns:
| Name | Type | Description |
|---|---|---|
out |
pandas.DataFrame
|
cleaned dataframe |
Source code in conftrainer/datasets/utils.py
read_preprocess_dataframe(csv_path, name_col, classes=None, clean_dataset=False, data_dir='./')
Read and preprocess a dataframe containing filepaths and their labels
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
csv_path |
str
|
path of the csv containing names & labels of samples |
required |
name_col |
str
|
name of the column with filenames |
required |
classes |
List[str] = None
|
name of the columns containing class names |
None
|
data_dir |
str = './'
|
root directory to read the images from |
'./'
|
clean_dataset |
bool
|
whether to clean the dataframe from non-existing/invalid images before proceeding |
False
|
Returns:
| Name | Type | Description |
|---|---|---|
filenames |
List[str]
|
filenames of the samples |
labels |
np.ndarray
|
values of label columns |
Source code in conftrainer/datasets/utils.py
validate_column_sum(dataframe, columns=List[str], name='')
Check if all provided columns of a one-hot enocoded dataframe have at least one sample
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
dataframe |
pd.DataFrame
|
data to check |
required |
columns |
List[str]
|
names of columns to check |
List[str]
|
name |
str = ''
|
name of the dataframe to display in error message |
''
|
Raises:
| Type | Description |
|---|---|
ValueError: if there's a column with no samples
|
|
Source code in conftrainer/datasets/utils.py
read_multioutput_dataframe(csv_path, per_task_data, name, data_dir, name_col, clean_dataset=False)
Read a dataframe with possibly multiple set of labels and generate a configuration to create a Datagen
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
csv_path |
str
|
path of csv file to read |
required |
per_task_data |
List[SingleTaskDataConfig]
|
configuration for each task. Includes class names, and will be filled with actual labels |
required |
name |
str
|
name of the datagen config |
required |
data_dir |
str
|
root directory to read the images from |
required |
name_col |
str
|
name of the column containing filenames |
required |
clean_dataset |
bool = False
|
whether to check if there are broken images in the data |
False
|
Returns:
| Name | Type | Description |
|---|---|---|
out |
MultiOutputDatagenConfig
|
a configuration to create image data generator with given filepaths and a separate set of labels per task |