Unveiling Multidimensional Insights: Radviz Projection and Feature Importance in Regression
Radviz projection simplifies the representation of multidimensional data onto a 2D plane. In this note, we delve into the computation of Radviz projections and demonstrate their application in uncovering important features in multivariate regression analysis.
Radviz Projection
Consider an $M$-dimensional data point represented as $\bf{x} = [x_1, x_2, \ldots, x_M]$, projected onto a 2D plane as a vector $\mathbf{v}$. The Radviz projection uses a mass-spring model, akin to a mass connected to $M$ anchor points on a circle. These anchor points, $A_1$, $A_2$, $\ldots$, $A_M$, evenly distributed on a circle, correspond to the dimensions of the data. The strength $x_i$ of the $i$-th string connecting the mass to the anchor point $A_i$ determines the position of the mass, settling at the location $\mathbf{v}$. The calculation involves balancing forces across all springs, resulting in the 2D projection (see an illustration in Figure 1 for a four-dimensional data point).
Mathematical Formulation
Let $\mathbf{A}_i$ denote the location of the anchor point $A_i$ on the circle. At equilibrium, the total forces on the mass at the location $\mathbf{v}$ is zero:
\[\sum_{i=1}^{M} (\mathbf{A}_i -\mathbf{v}) x_i = 0. \notag\]This leads to the projection’s location formula:
\[\mathbf{v} = \frac{\sum_{i=1}^M \mathbf{A}_i x_i}{\sum_{i=1}^M x_i}.\label{eqn:radviz}\]The resulting 2D projection location is a weighted average of the anchor points’ positions. Before calculation, variables $x_i$ are scaled to the range $[0, 1]$ to keep the mapped points in a confined region.
An Example
Data Overview
To demonstrate the application of Radviz, we consider a dataset with six columns $x_1$, $x_2$, $x_3$, $x_4$, $x_5$, and $y$. Figure 2 displays the scatter matrix plot of these variables.
Radviz and Parallel Coordinates Plots
Before constructing a regression model of $y$ on $x_i$, we apply the Radviz projection to $[ x_1, x_2, x_3, x_4, x_5]$ using Equation ($\ref{eqn:radviz}$). The resulting Radviz plot in Figure 3 reveals that the gradient of the $y$ value aligns with the line passing through anchor point $x_3$ and the circle’s center, indicating $x_3$ as a strong influencer. Conversely, anchor points $x_1$ and $x_5$ on the opposite end also show significance. Anchor points $x_2$ and $x_4$ appear less important because they are positioned orthogonal to the $y$ gradient of $y$.
As a reference, the data is plotted with the parallel coordinates plot in Figure 4, revealing trends but lacking information on the relative importance of variables $x_i$.
Feature Importance in Regression Model
To assess feature importance, linear regression models of $y$ on $x_i$ are constructed for different predictor combinations. The results align with the Radviz findings.
With one predictor, the table below shows that $x_3$ has the best performance, i.e., the lowest mse (mean squared error), AIC, and BIC, and the highest $R^2$. $x_1$ is a close second.
One Predictor
Predictor | mse | AIC | BIC | $R^2$ |
---|---|---|---|---|
$x_3$ | 0.69 | 3607 | 3617 | 0.31 |
$x_1$ | 0.73 | 3689 | 3700 | 0.27 |
$x_2$ | 0.91 | 4001 | 4011 | 0.095 |
$x_4$ | 0.91 | 4002 | 4012 | 0.094 |
$x_5$ | 0.99 | 4140 | 4151 | 0.0039 |
With two predictors, $(x_1, x_5)$ and $(x_1, x_3)$ are the top two combinations.
Two Predictors
Predictors | mse | AIC | BIC | $R^2$ |
---|---|---|---|---|
$x_1$, $x_5$ | 0.63 | 3462 | 3478 | 0.37 |
$x_1$, $x_3$ | 0.65 | 3509 | 3525 | 0.35 |
$x_3$, $x_4$ | 0.65 | 3523 | 3539 | 0.35 |
$x_3$, $x_5$ | 0.68 | 3573 | 3589 | 0.33 |
$x_2$, $x_3$ | 0.68 | 3583 | 3599 | 0.32 |
$x_1$, $x_2$ | 0.69 | 3599 | 3615 | 0.31 |
$x_1$, $x_4$ | 0.70 | 3634 | 3650 | 0.30 |
$x_2$, $x_4$ | 0.72 | 3670 | 3686 | 0.28 |
$x_4$, $x_5$ | 0.90 | 3998 | 4014 | 0.097 |
$x_2$, $x_5$ | 0.91 | 4003 | 4019 | 0.094 |
Three Predictors
With three predictors, $x_1$, $x_3$, and $x_5$ is the best combination.
The regression model results affirm the insights gained from the Radviz projection.
Conclusion
Radviz projection, combined with regression, gives us a clear way to make sense of complex data. It helps us see important features and validates findings with regression models, making data interpretation more straightforward.
References
Hoffman, P., Grinstein, G., and Pinkney, D. Dimensional anchors: a graphic primitive for multidimensional multivariate information visualizations. Proceedings of the NPIVM 99, 1999
Brunsdon, C., Fotheringham, A., and Charlton, M, The RADVIZ Approach to Visualisation in An Investigation of Methods for Visualising Highly Multivariate Datasets