Mathematics desk
< July 10	<< Jun \| July \| Aug >>	July 12 >

Welcome to the Wikipedia Mathematics Reference Desk Archives
The page you are currently viewing is an archive page. While you can leave answers for any questions shown below, please ask new questions on one of the current reference desk pages.

July 11

Reducing number of input variables in data modeling

I have a data analysis problem where I would like to model a response variable on the basis of a few input variables, taken from a larger set (dozens) of potential input variables. The complication is that the potential input variables are very noisy, don't necessarily have any true relation to the response variable, and many have complex correlations with each other (reflecting the fact that they are likely measuring, to different degrees, the same underlying factors). What I want to do is distill down the large number of potential input factors to a few that can best model the response variable, both by identifying those inputs which don't have any relation to the output, and by trimming redundant inputs which are representing the same metric.

The philosophy behind factor analysis matches well with my conception of the problem, but I'm a little confused as to how to undergo the post-analysis reduction phase. One complication is how to best deal with the response variable. Typical factor analysis treats all variables as equals, so I'm unsure how to weave the potentially-irrelevant-input/definitely-important-output aspect into it. (So I guess I'm looking for something like a factor analysis/multiple regression hybrid, if there is such a thing.) Also, I'm not looking for the factors per se, which aren't directly measurable, but the small set of input variables which would best represent the factors, and I'm not sure of the proceedure to identify those.

(As a final wrinkle, while I would expect the relationships between the input and response variables to be monotonic, I'm not confident enough that the relationship would be linear. Do techniques like factor analysis work with rank order correlations, or is there some fundamental assumption that requires a covariance-based Pearson correlation? This is probably a minor point, as they're likely not too far out of linearity.) -- 71.217.5.199 (talk) 18:22, 11 July 2012 (UTC)[reply]

You don't need to do factor analysis before running multiple regression -- in principle it will pick out the most predictive factors for you automatically. Also our article on exploratory data analysis gives pointers to some other tools that might be useful. But my experience -- and I have a lot of experience with problems like this -- is that no high-level tool should be trusted. The best way to start is by making a shitload of scatterplots, plotting your output against each of your inputs, and your inputs against each other, and then study them carefully to see what sort of structure you can spot, and design an approach that is appropriate for the structure you see. Looie496 (talk) 23:58, 11 July 2012 (UTC)[reply]

The reason I'm hesitant to do straight multiple regression is that many of the input variables are highly correlated, which will typically result not in a high weight on a single input, but low weights spread across the multiple correlated inputs. That's the thought behind using factor analysis - to separate out the commonalities and identify the correlation once, rather than spreading it out across multiple input variables. As I hinted, the goal is not so much the regression, as it is input variable dimensionality reduction. - I've already done the "shitload of scatterplots" on a portion of the data, and have identified weak correlations to the output with lots of noise, and correlations (with lots of noise) of the same or greater scale between input variables. What I'm attempting now is the "design an approach" stage. I'm a little hesitant to use approaches that take variables one or a few at a time, because I'm anticipating multiple related and overlapping effects, and am concerned about Simpson's paradox-like situations. Hence the hope that something like factor analysis could separate out the commonalities and differences between the input variables, and allow identification of those input variables which best represent those underlying factors. -- 71.217.5.199 (talk) 04:44, 12 July 2012 (UTC)[reply]

The basic problem with factor analysis, for this purpose, is that unless the output is a nonlinear function of the inputs, it does not help you,. It identifies a set of linear combinations of your inputs, but there is no way you can do better like that than by simply writing the output as a linear combination of the inputs, which is what multiple regression does. Looie496 (talk) 06:54, 12 July 2012 (UTC)[reply]

I agree with Looie that there's no one-size-fits-all solution that is guaranteed to work in every situation. But, if you have a way to measure the effectiveness of a particular analysis, you can try a few generic tools and see if they give you better results than what you can get in other ways (which would happen if their underlying assumptions match your problem).

If it can be assumed that the input datapoints constitute approximately a manifold in

\mathbb {R} ^{n}

, and that your response variable is a smooth nonlinear function on that manifold, what you need to do is find a low-dimensional representation of the manifold with Nonlinear dimensionality reduction, and then try to express the response variable as a function of those coordinates. I'm a fan of Maximum variance unfolding, I think there's a very strong intuition behind it, but it's expensive. -- Meni Rosenfeld (talk) 08:57, 12 July 2012 (UTC)[reply]

The description looks to me more like a diffuse cloud in

\mathbb {R} ^{n}

than a low-dimensional manifold. Looie496 (talk) 15:28, 12 July 2012 (UTC)[reply]

There are a lot of feature selection and dimension reduction methods. As you expect linearity to be a good approximation, you could try linear regression with lasso (L1) or elastic net (L1+L2) regularization. L1 regularization encourages sparsity in the coefficients (many zeroes, i.e. variables not being used). However, when you have multiple variables that measure the same thing (i.e. are highly correlated with each other), then L1 tends to choose just one of them, which is not such a good idea (even though you say you desire it): consider the extreme case where all variables measure the same thing with i.i.d Gaussian noise added in. Obviously, you'd want to average all the variables to reduce noise, instead of picking just one (and you did say the variables are noisy). Using L2 regularization does this, but it is inefficient when there are many irrelevant variables. Elastic net regularization attempts to gain the benefits of both at the same time - roughly trying to choose few groups of highly correlated variables which are then averaged (see this talk).

Using factor analysis or principal component analysis (of which there are a zillion different versions, e.g. probabilistic PCA, sparse PCA and kernel PCA) or independent component analysis is good when you just want to analyze a bunch of data without considering a dependent variable, because they don't. PCA, for example, finds the directions where your (independent) variables have the most variance. Those directions may not be at all good at predicting the dependent variable. If you want to predict a person's height and you have one variable which is that height (lucky!) and a thousand others that record the history of the price of tea in China, then your principal components will probably tell you all about tea, but little about the height you wanted to predict (the method captures a lot of variance, but it's mostly noise from your POV). That said, it's probably not going to be that bad a starting point if your variables are reasonable, but you should be aware that there are no guarantees beyond "if it breaks, you get to keep both pieces". In contrast, things like (generalized) linear regression and linear discriminant analysis attempt to predict the dependent variable well, and with suitable assumptions, do work well. -- Coffee2theorems (talk) 17:31, 12 July 2012 (UTC)[reply]

You could try to apply the method described here and here. Count Iblis (talk) 02:11, 13 July 2012 (UTC)[reply]

Formula to compute Euler angles messed up

Rotation formalisms in three dimensions#Conversion_formulae_between_formalisms describes how to compute Euler angles of a rotation from the rotation matrix. The formulas in that section use different conventions for the Euler angles within that same section. Indeed, the article mentions rotating around zxz for the first formula, and rotating around xyz for the second formula. Indeed, you can see the two formulas can't be consistent for the first formula claims that θ = arccos(A₃₃), whereas the last one claims A₃₃ = cos(φ)cos(θ), and A₃₁ = sin(θ).

Could you figure out the correct formulas for a single convention and fix the article? Thanks in advance.

– b_jonas 22:47, 11 July 2012 (UTC)[reply]

Wikipedia:Reference desk/Archives/Mathematics/2012 July 11

Contents

July 11

Reducing number of input variables in data modeling

Formula to compute Euler angles messed up