The data visualization plot presents a comprehensive image of the data that allows you to look at the entire data at once, which is otherwise impossible to achieve.
The scatter plot creates an alternative, two - dimensional coordinates for each sample instead of the original multi dimensional vector. These alternative coordinates are calculated in a way that preserves as much information coming from the original space.
Each marker in the scatter plot represents a single sample. The markers in the closest vicinity to it are samples that in the original multidimensional space might have been more similar to this sample, while the markers further away are samples that are usually less similar.
The coloring options for the markers allow to split the data into varies categories and search for interesting patterns. The options include:
Labels (ground truth) - the original labeling of each sample.
Predictions - the models prediction for each sample. The values and the coloring are continuous in the range of [0 ,1].
FPFN (confusion matrix) - changing the working point will color the samples according to the groups: FP (false positive), FN (false negative), TP (true positive) and TN (true negative). For more details refer to the entry on confusion matrix.
Outliers - the samples that were identified as outliers in the original multidimensional space are marked.
Outliers score - the scores ranking the outliers in the multidimensional space are binned and used for the coloring. The higher the score the more outlier a sample is.
Neighborhood - each color represents a neighborhood in the multidimensional space. This helps you to see how well the neighborhoods were preserved going from the original space to the alternative reduced space.
コメント