Talk:Correlation/Archive 1

Latest comment: 15 years ago by Rednblu in topic The first image

General Structure

It is possible that, in the paragraph headed "The Sample Correlation" in the section labelled "Pearson's product-moment coefficient", the author has inadvertently left an expression for the convenient calculation of r in the position it occupied during a draft (it occurs on the first line of a set of equations). The definition of r appears on the next line and the first expression (which may be derived from it) appears for a second time below this. From my reading, I suspect that the author intended to move the first expression in two stages using copy and paste followed by deletion of the original (but the deletion was forgotten). I did not wish to delete the relevant line without providing an opportunity for the original author to check. Xenoglossophobe (talk) 14:01, 29 June 2008 (UTC)

Correlation formula seems wrong for number of trials

The "n" term doesn't make any sense to me. If i=1,2,...n for both X and Y, wouldn't the total number of observations be 2n? In fact, usually for X, i=1,2,...n and for Y, j=1,2...m, making the total n+m. So should the bottom part of that formula be n+m? Akshayaj 15:27, 7 August 2007 (UTC)

Sorry, I was thinking about a two -sample case, both with explanatory and response variables.

However, the n-1 term from before seems correct, as opposed to the n term that's there now http://stattrek.com/AP-Statistics-1/Correlation.aspx?Tutorial=AP Akshayaj 16:08, 7 August 2007 (UTC)


Algorithms

I think including the Python code is too much, since Wikipedia is not really a code repository. I have removed it. Please discuss if you have a problem with that. Brianboonstra (talk) 13:49, 18 January 2008 (UTC)

To be more precise, the expression of the algorithm in Python (which actually happens to be a wonderful language) is less clear, due to the zero offsets. For example, the range(1,N) expression has nonobvious effects to someone who does not know the language. Brianboonstra (talk) 13:56, 18 January 2008 (UTC)

I recently checked the algorithm and it computes the correlation with slightly different formula:  , is this on purpose (in that case some note in the text would be needed) or an error? --Tomas.hruz 09:53, 6 October 2006 (UTC)

I believe the calculation is fine. Note that the std devs used in the denominator are population std devs. Brianboonstra (talk) 13:49, 18 January 2008 (UTC)

I just yesterday inserted a disclaimer about using the formula supplied as the basis for a one-pass algorithm, and included pseudocode for a stable single-pass algorithm in a separate section. For standard deviation, there is a separate page instead, at Algorithms_for_calculating_variance, but it seems to me that an analogous separate page should contain this algorithm only if a similar explication of the problems of numerical instability is included.--Brianboonstra 16:00, 3 March 2006 (UTC)
The last_x and last_y variables are unused in the pseudocode. They should probably be removed, no ? -- 29 March 2006
Agreed, and done. Brianboonstra 18:31, 11 April 2006 (UTC)

The algorithm does not take into account the case when either pop_sd_x or pop_sd_y is zero, causing a divide by zero on the last line. holopoj 17:06, 5 August 2006 (UTC)

That's arguably correct since correlation would not be defined in this case -- the equation which we are calculating would also have a divide by zero. --Richard Clegg 19:48, 5 August 2006 (UTC)

It seems like the algorithm is calculating something wrong (or maybe it is just that I coded it wrong!), but I wrote it in C++, and it does not calculate the correct covariance for the proposed example, which should be -841.667 as calculated in Excel and R. I used a straight algorithm in C++ (without any optimization) and it gave the right answer. Could somebody tell me what was my mistake in coding it? Thanks in advance. Here is the code:

 double cov(double* x,double* y,int tamano,int tipo) {
   int i;
   double sumCuadX = 0.0, sumCuadY = 0.0, sumCoprod = 0.0,
          mediaX = x[0], mediaY = y[0],
          barre, deltaX, deltaY, pobSDX, pobSDY, covcor;
   for (i = 1; i < tamano; i++) {
       barre = ((double)i - 1.0)/(double)i;
       deltaX = x[i] - mediaX;
       deltaY = y[i] - mediaY;
       sumCuadX += deltaX*deltaX*barre;
       sumCuadY += deltaY*deltaY*barre;
       sumCoprod += deltaX*deltaY*barre;
       mediaX += deltaX/(double)i;
       mediaY += deltaY/(double)i;
   }
   pobSDX = sqrt(sumCuadX/(double)tamano);
   pobSDY = sqrt(sumCuadY/(double)tamano);
   covcor = sumCoprod/(double)tamano;
   if (tipo == CORRELACION)
      covcor /= pobSDX*pobSDY;
   return covcor;
 }

Paulrc 25 19:39, 27 December 2006 (UTC)

The problem lies in your incomplete translation of the one-based index code to your zero-based index version. There appear to be three changes, all within the loop and all dealing with adjustments to i:

 double cov(double* x,double* y,int tamano,int tipo) {
   int i;
   double sumCuadX = 0.0, sumCuadY = 0.0, sumCoprod = 0.0,
          mediaX = x[0], mediaY = y[0],
          barre, deltaX, deltaY, pobSDX, pobSDY, covcor;
   for (i = 1; i < tamano; i++) {
       barre = i/(1d + i);
       deltaX = x[i] - mediaX;
       deltaY = y[i] - mediaY;
       sumCuadX += deltaX*deltaX*barre;
       sumCuadY += deltaY*deltaY*barre;
       sumCoprod += deltaX*deltaY*barre;
       mediaX += deltaX/(1 + i);
       mediaY += deltaY/(1 + i);
   }
   pobSDX = sqrt(sumCuadX/(double)tamano);
   pobSDY = sqrt(sumCuadY/(double)tamano);
   covcor = sumCoprod/(double)tamano;
   if (tipo == CORRELACION)
      covcor /= pobSDX*pobSDY;
   return covcor;
 }

I do not believe Tomas [see above] is correct. Note that the std devs used in the denominator are population std devs.Brianboonstra 21:40, 24 January 2007 (UTC)
Your algorithm does not process the last element. Brianboonstra 21:41, 24 January 2007 (UTC)
In the the lines after the loop, the 'n's should cancel out and don't need to be in those last terms. kiyowm, 11:45, 1 August, 2007 (AST)
True, but the pseudocode is clearer with them left in Brianboonstra (talk) 13:49, 18 January 2008 (UTC)

Ratio

Could we see some account of this concept of "correlation ratio"? All I can find is on Eric Weisstein's site, and it looks like what in conventional nomenclature is called an F-statistic. Michael Hardy 21:02 Mar 19, 2003 (UTC)

it goes somewhat like:
correlation_ratio(Y|X) = 1 - E(var(Y|X))/var(Y)
I don't know the conventional nomenclature, but in the literature on similarity measures for image registration it is called just this...

The relation between them is already there on the autocorrelation page. "...the autocorrelation is simply the correlation of the process against a time shifted version of itself." You can see this trivially by considering the equation for correlation if the series Yt = Xt-k. --Richard Clegg 20:43, 7 Feb 2005 (UTC)

This page currently tells only the mathematical aspects of correlation. While it is, obviously, a mathematical concept, it is used in many areas of research such as Psychology (my own field; sort of) in ways that would be better defined by purpose than mathematical properties. What I mean is, I'm not sure how to add information about what correlation is used for into this article - I wanted to put in the "vicars and tarts" demonstration of "correlation doesn't prove causality", for instance. But that would require a rather different definition of correlation, in terms of "the relationship between two variables" or something. Any ideas on how to rewrite would be welcome - if not, of course, I'll do it myself at some point...

Oh, and I can't decide what to do about that ext. link - as is, it's rather useless, taking you to the homepage of a particular reference site (I suspect it of being "Wikispam"); but if you find the right page and break out of their frameset, there is actually some interesting info at http://www.statsoft.com/textbook/stbasic.html#Correlations. Ah well, maybe I'll come back to this after I've sorted out some of the memory-related pages... IMSoP 17:43, 20 May 2004 (UTC)

I have now partially addressed the concerns above by putting in a link to spurious relationship, which treats the "correlation does not imply causation" cliche. Michael Hardy 21:33, 20 May 2004 (UTC)

I thought that the deleted stuff about the sample for correlation was useful. Not enough stats people pay attention to the difference between a statistic and an estimator for that statistic. The Pearson product-moment correlation coefficient page does cover this but it would be nice to see the treatment for the standard correlation too (IMHO at least). --Richard Clegg 20:21, 10 Feb 2005 (UTC)

No -- you're wrong on two counts: There is no such thing as an estimator for a statistic; probably you mean as estimator of a population parameter; and statisticians pay a great deal of attention to such matters; it is non-statisticians who contribute to statistics articles on wikipedia and in journal article who loose sight of the distrinction. Michael Hardy 23:34, 10 Feb 2005 (UTC)
... but I agree with you that the material on sample correlation should be here. Michael Hardy 23:34, 10 Feb 2005 (UTC)
Apologies, I was writing in haste. You are correct here that my comments refer to a population parameter rather than a "statistic" in the formal sense of a function on the data. My comments were intended to refer to contributors to wikipedia articles on staticss (in the wider sense). --Richard Clegg 11:37, 11 Feb 2005 (UTC)

I think so too, but I was rushed. I will put the section back soon, but I will combine it with the Pearsons section. Paul Reiser 21:09, 10 Feb 2005 (UTC)

Thanks. I, for one, think it would help clarify this page.

--Richard Clegg 22:45, 10 Feb 2005 (UTC)

Cross-correlation in signal processing

what about the signal processing version of correlation? kind of the opposite of convolution, with one function not reversed. also autocorrelation. does it have an article under a different name? if so, there should be a link. after reading this article over again, i believe the two are related. i will research some and see, (and add them to my to do list) but please add a bit if you know the connection... Omegatron 20:10, Feb 13, 2004 (UTC)

This has been created under a separate article called cross-correlation, although they are clearly related. Merge? Or link to each other? - Omegatron 04:37, Mar 20, 2005 (UTC)

Correlation matrix search redirects to this page but I can't find here what a correlation matrix is. I have some idea from http://www.vias.org/tmdatanaleng/cc_covarmat.html , but don't feel confident enough to write an entry, and I am no sure where to add it.

Covariance_matrix exists.

Scatter_matrix do not.

--Dax5 19:16, 7 May 2005 (UTC)

Correlation function in spreadsheets

The "Correlation function in spreadsheets" section looks very useless to me, and the information included is probably wrong since the correlation of two real numbers does not make sense. I will delete it, if you put it back can you tell me why?

Muzzle 12:44, 6 September 2006 (UTC)

I agree with your edit. Thanks. Chris53516 13:17, 6 September 2006 (UTC)

Random Variables

I was the one that put the disclaimer on "random" variables. If anybody would like to discuss, I'm all ears, so to speak. The preceding unsigned comment was added by Phili (talk • contribs) .

I reverted that note. You wrote:
Several places in this article refer to "random" variables. By definition a random variable has no correlation with anything else (if it does have a correlation the variable is either 1) not random, or 2) the correlation is a coincidence likely due to a small sample size). It is more accurate to think of these not as random variables, but simply as variables that have an undetermined relationship.
By definition, a random variable is just a measurable function one some probability space. And yes, two random variables can be very much correlated. :) Oleg Alexandrov (talk) 01:56, 30 November 2005 (UTC)

The "unsigned" person wrote utter nonsense. This is a crackpot. Michael Hardy 02:38, 30 November 2005 (UTC)

That might be a bit harsh. User:Phli has exactly three edits so far. Let's assume he just isn't familiar with the technical notion of a random variable, until proved otherwise. --Trovatore 21:13, 5 December 2005 (UTC)

Diagram

Perhaps I'm being thick, but after a minute or two of scrutinising it I couldn't work out how to read the diagram on this article. Which scatter plot corresponds to which coefficient, and why are they arranged in that way? It is not clear. Ben Finn 22:12, 18 January 2006 (UTC)

You're right -- I'd never looked for that. I'll leave a note to the author of the illustration. Michael Hardy 00:33, 19 January 2006 (UTC)
Wait -- it's stated in the caption. You have to read it carefully. Michael Hardy 00:36, 19 January 2006 (UTC)
The figure is not very intuitive... --128.135.82.179 06:13, 6 February 2006 (UTC)

I actually thought the figure is awesome, but now that I consider it, I wonder if it is intuitive and informative only for those who understand correlation well enough not to really need the figure. Also, I think it would be instructive to show a high-correlation scatterplot where the variances of the two underlying series are in a ratio of, say, 1:6 rather than 1:1 in the plots shown. --Brianboonstra 15:53, 3 March 2006 (UTC)

I have no clue how that figure works, and I'm in a PhD program. --Alex Storer 22:50, 17 April 2006 (UTC)


I have added this sentence to the caption to try to clarify it, in case anyone is still confused:

Each square in the upper right corresponds to its mirror-image square in the lower left, the "mirror" being the diagonal of the whole array.

Michael Hardy 22:14, 22 April 2006 (UTC)


I understand the figure, but I think it's WAY too complicated, especially for someone who doesn't already know what it is. Its slightly neat to see that you have four different data sets generated, and you're looking at all pairs... but I think for most people it would be MUCH MUCH clearer if you just showed four examples in a row, with labels directly above: R2 in {0, .5, .75, 1 } or something. 24.7.106.155 09:27, 7 May 2006 (UTC)


I have to agree that the diagram is over complicated. It also doesn't show negative correlations. Would it be better to have a table with two rows. Each colum could have a correlation coefficient as a number in the first row, and a scatter plot in the second row. The coeffecients could range between -1 and 1. I think that this would also emphasise that a negative correlation is still a strong correlation. 80.176.151.208 07:49, 31 May 2006 (UTC)

Intercorrelation

Sometimes one sees the term "intercorrelation". What does this exactly signifies? I associate "intercorrelation" as the correlation between two different variables - but that is what standard "correlation" is. It seems to me that "inter" is redundant... And the opposite of autocorrelation is not intercorrelation but cross-correlation... -fnielsen 15:43, 10 February 2006 (UTC)

I guess "intercorrelation" has some utility in multivariate analysis(?), see, e.g., supermatrix. - fnielsen 15:49, 10 February 2006 (UTC)

Table

The table at the beginning of the article is flawed in almost every regard. First, it is poorly designed. Suppose a reader wants to know what a low correlation is. He or she looks at the row, sees "low," and sees that the cell below it says "> -0.9." At first glance, this makes it sound as though ANY correlation that is greater than -0.9 is low, including 0, 0.9, etc. Then the next column says "low: < -0.4." It takes a moment to figure out that the author was actually intending to convey "low: -0.9 < r < -0.4." Something like this would be better:

Correlation coefficient
High correlation Low correlation No correlation (random) Low correlation High correlation
−1 < r < −0.9 −0.9 < r < −0.4 −0.4 < r < +0.4 +0.4 < r < +0.9 +0.9 < r < +1

though some letter other than r might be better, and less-than-or-equal-to signs belong in there somewhere. That brings up the second problem, though: Where on earth did these numbers come from? Cohen, for example, defines a "small" correlation as 0.10 <= |r| < 0.3, a "medium" correlation as 0.3 <= |r| < 0.5, and a "large" correlation as 0.5 <= |r| <=1. I know of no one who thinks that a correlation between -0.4 and 0.4 signifies no correlation.

Then there's the argument--made by Cohen himself, among others--that any such distinctions are potentially flawed and misleading, and that the "importance" of correlations depends on the context. No such disclaimer appears in the article, and the reader might take these values as dogma.

I suggest that the table be removed entirely. Failing that, it should at the very least be revised for clarity as described above, and a disclaimer should be added. The values in the table should be changed to Cohen's values, or else the source of these values should be mentioned somewhere.

I'd be happy to make all of the changes that I can, but as I'm new to Wikipedia I thought I'd defer to more experienced authors.

--Trilateral chairman 22:51, 22 March 2006 (UTC)

I agree with everything you say. My preference would be to have a statement to the effect that various descriptions of ranges of correlations have been offered. One such description could perhaps be included with a reference. The point should be made that whether or not a correlation is of consequence or sufficient depends on the context and purposes at hand. If you're confirming a physical law with pretty good equipment, 0.9 might be a poor correlation. If you're ascertaining whether a selection measurement correlates with later performance measurements, 0.9 would be excellent (you'd be lucky to do better!). The table with scatterplots corresponding with each (linear) correlation coefficient is excellent. The table to which you refer is not referenced and it is hardly a commonly agreed classificatioin. On this basis alone it should not appear in the article. Be bold! Holon 01:15, 23 March 2006 (UTC)

Okay. I've removed the old table, added Cohen's table with a citation, and added the disclaimer with the explanation you suggested. Here is the old table if anyone wants it:

Correlation coefficient
High correlation High Low Low No No correlation (random) No Low Low High High correlation
−1 < −0.9 > −0.9 < −0.4 > −0.4 0 < +0.4 > +0.4 < +0.9 > +0.9 +1

--Trilateral chairman 01:18, 24 March 2006 (UTC)

Formulas: What is E

What is the E in the first equations? Why isn't the E replaced by capital sigma indicating the sum of?

I have seen E before in statistics texts. If it is some standard notation, it should be explained.

Gary 16:24, 28 March 2006 (UTC)

I think it is the expected value.It indeed is the expected value. Often calculated as (k/N) where k is the total number of objects and N is the number of intervals.

The article says explicitly that it is expected value, and gives a link to that article. Michael Hardy 00:12, 30 March 2006 (UTC)

Clarification for non math-people

I'd really appreciate if someone could expand the first couple paragraphs a bit to better explain correlation. While I'm sure that the rest of the article is correct, for me, as smeone without a math background, it doesn't make much sense. I understand that by the very nature of the topic it is complicated, but I'd still like to have some sort of understanding of the text. Thank you! cbustapeck 17:16, 13 October 2006 (UTC)

What I find confusing is that it first defines correlation to then move to the pearsons correlation coefficient without really explaining the relation between the two. The first section on correlation is quite clear. But the Sample correlation section is entirely confusing and there is not a single mention of any relationship to what was said in the previous section. And the formula is different than the one in the first section. Computing the expected value would mean dividing by n but the formula in 'Sample Correlation' divides by n-1. —The preceding unsigned comment was added by 67.93.205.78 (talk)

Maybe this may sound completely ignorant, but I have a minimal background in math, and what I'm interested in is the implementation of this type of concept. What's the general purpose of this formula? What does it accomplish? How is it implemented? Should I just go back to school? I read layman physics books and the concepts are explained fully in plain language. Not just the barebones formulas, but the implications as well. Can the implications of this type of math be explained?70.66.9.70 15:45, 31 March 2007 (UTC)

Restore Cohen et al's book as a Reference

Recently an editor removed a whole string of on-line publications by Herve Abdi, which did seem somewhat self-promotional to include here. However the textbook by Cohen et al. is the only major textbook (that people might use in a course) that was listed, and it also was removed. Does anyone object to restoring the Cohen book to the Reference list, or under the heading 'Further Reading' if you prefer? EdJohnston 19:07, 13 December 2006 (UTC)

Did I remove that? Sorry. Sure, go ahead and restore it. If you don't know where to find it, I can do it. — Chris53516 (Talk) 20:33, 13 December 2006 (UTC)
I am all for the book being placed there, so long as we vote - not a literal vote - but so long as there is an understanding among the majority of the population that the book is there for its merits, its widespread use etc. and NOT for commercial purpose or with intent to benefit Cohen et al.--ToyotaPanasonic 13:36, 24 December 2006 (UTC)
I originally added the citation to the book (at least I think it was me). I included it only because it is a common reference in the behavioral sciences...and besides that, it was the textbook for my graduate stats course. :) I have never met the Cohens, have never corresponded with them, and have no financial interest whatsoever in their book. Heck, I'm not even interested in selling my copy. I support replacing the reference.Trilateral chairman 03:25, 7 February 2007 (UTC)

Ambiguity

Isn't there a difference between correlation and coefficient of correlation? The coefficient lying between -1 and 1, while the general term 'correlation', can have any numerical value attached to it?

If so, you will find the introduction somewhat misleading: "In probability theory and statistics, correlation, also called correlation coefficient" - correlation and correlation coefficient are not quite the same thing, but are very very similar.

With someone who has a fresher statistics/econometrics backbround please confirm this and accordingly edit the main page. Cheers all, --ToyotaPanasonic 13:31, 24 December 2006 (UTC)

What would just a "correlation" be then? No, unfortunately, correlation cannot have any random number "attached" to it. The term "correlation" can be used in non-numerical senses, but when you're talking about it's number form, it's always the correlation coefficient. Otherwise, what would the number be? How would you interpret it? (Rhetorical questions.) — Chris53516 (Talk) 22:16, 24 December 2006 (UTC)
Yes, quite brilliant. I have sourced my undergrad econometrics textbook from my basement - it turned out I was confusing Covariance (any number, which has a unit attached to it) and correlation which is standardised to lie between -1 and +1. Yes, quite brilliant. --ToyotaPanasonic 04:27, 26 December 2006 (UTC) [Feel free to delete this above topic - I have no further use for it, and neither would anyone else]

Correlation and Causation

Removed entry.

[A comment left here by User:Jjoffe was removed by EdJohnston. See my further note below. I left intact the response by User:Chris53516 who was responding to Joffe. -- EdJohnston 14:32, 8 January 2007 (UTC)

This is an excellent example of original research. Please do not add this to the article. Furthermore, it is NOT a good idea to post your email address. — Chris53516 (Talk) 14:09, 8 January 2007 (UTC)
Per WP:REFACTOR, an editor may remove content from a talk page that is 'entirely and unmistakably irrelevant'. I did so with a recent posting by User:Jjoffe. You can still see the removed material here in the edit history. -- EdJohnston 14:32, 8 January 2007 (UTC)
I don't think I agree with your changes. What was written was not "entirely and unmistakably irrelevant." Please explain yourself. — Chris53516 (Talk) 14:47, 8 January 2007 (UTC)
Talk pages are for discussing article changes. Joffe's contribution looked like a literal reprint of material that had been (or was intended to be) published elsewhere. It was not clear that he was proposing anything specific for this article, though he is welcome to do so if he thinks it can be improved. The WP:REFACTOR strategy has been used elsewhere, for a submitter who spammed Talk:IEEE_754r. If you believe Joffe's comments are relevant, you are welcome to restore them, but please explain how you would want to change the article as a result. -- EdJohnston 15:44, 8 January 2007 (UTC)
I see what you mean. It appears that if he wants to make a response to the other article somewhere, he should find his own webspace to do so. — Chris53516 (Talk) 16:13, 8 January 2007 (UTC)

Misconceptions

I'm not crazy about this section under common misconceptions: An appropriately expanded expression may be "correlation is not causation, but it sure is a hint." I don't think this is an illuminating rephrasing, in part because the rationale behind neither dictum (correlation is not causation nor the one quoted above), is explained sufficiently. I'd be more happy with:

The conventional dictum that "correlation does not imply causation" is a commonly-used admonition to using correlation to support a direct causal relationship among the variables. However, this admonition should not be taken to mean that correlations are acausal, merely that the causes underlying the correlation may be indirect and unknown. A correlation between age and height is fairly causally transparent, but a correlation between mood and health might be less so. Does improved mood lead to improved health? Or does good health lead to good mood? Or does some other factor underlie both? In other words, a correlation can be taken as evidence for a causal relationship, but cannot indicate precisely what the causal relationship might be.

Comments? SJS1971 13:07, 24 January 2007 (UTC)

I like your rewrite. It sounds good. — Chris53516 (Talk) 14:41, 24 January 2007 (UTC)
Okay, with that positive feedback I made the edit. SJS1971 15:30, 24 January 2007 (UTC)
I changed wording. I don't understand the meaning of "this admonition should not be taken to mean that correlations are acausal". I assume you are referring to whether the actual relation between two quantitative attributes is a causal relation (e.g. height and weight). Literally, correlation could be taken as a kind of ralation, but the article defines it in purely algebraic terms. So it could be confusing. I'm also no sure whether admonition is the best term -- seems a little emotive. Is there a more neutral term. The point is the nature of the logical argument, I would think. Holon 10:06, 4 March 2007 (UTC)

Currency correlation

A "{{prod}}" template has been added to the article Currency correlation, suggesting that it be deleted according to the proposed deletion process. All contributions are appreciated, but the article may not satisfy Wikipedia's criteria for inclusion, and the deletion notice explains why (see also "What Wikipedia is not" and Wikipedia's deletion policy). You may contest the proposed deletion by removing the {{dated prod}} notice, but please explain why you disagree with the proposed deletion in your edit summary or on its talk page. Also, please consider improving the article to address the issues raised. Even though removing the deletion notice will prevent deletion through the proposed deletion process, the article may still be deleted if it matches any of the speedy deletion criteria or it can be sent to Articles for Deletion, where it may be deleted if consensus to delete is reached. John 10:30, 14 June 2007 (UTC)

Topically, see the previous talk entry as well. It looks like the currency correlation article may have been added by a link spammer. I'm sorry that I don't currently have time to investigate further instances, perhaps someone else can ? – John 10:30, 14 June 2007 (UTC)
Article has now been removed – John 19:19, 21 June 2007 (UTC)

‘Association’ vs. ‘Correlation’

It appears to me that the article Association (statistics) is really discussing correlation. Is there a distinction between the terms ‘association’ and ‘correlation’? If so, could someone please edit the article association (statistics) to state what that distinction is. If there is no distinction, then should the articles be merged? --Mathew5000 00:26, 4 July 2007 (UTC)


Suspected Excessive Promotion of Herve Abdi

Another reference to Herve Abdi, inserted by an anonymous user with ip address 129.110.8.39 which seems to belong to the University of Texas at Dallas. Apparently the only editing activity so far has been to insert excessive references to publications by Herve Abdi, of the University of Texas at Dallas. The effect is that many Wikipedia articles on serious scientific topics currently are citing numerous rather obscure publications by Abdi et al, while ignoring much more influential original publications by others. I think this constitutes an abuse of Wikipedia. A while ago, as a matter of decency, I suggested to 129.110.8.39 to remove all the inappropriate references in the numerous articles edited by 129.110.8.39, before others do it. For several months nothing has happened. I think it is time to delete the obscure reference. Truecobb 21:32, 15 July 2007 (UTC)

"Causation does not imply Correlation" <--- Correction: "Correlation does not imply causation".

Is this true? If so, can someone give me a (preferably nonmathematical) example of this? The X vs. sin(X) example does not seem to work, as I would think that a functional relationship between the two means they are correlated.

Thanks

Ice cream sales and riot frequency are correlated. But neither one causes the other. In this case warm weather causes both independently. Debivort 03:06, 20 July 2007 (UTC)
You misread my comment. I'm looking for an event that IS CAUSED by another event, but that is NOT CORRELATED with that event.

Thanks Akshayaj 20:38, 20 July 2007 (UTC)

Oh yeah! I'm a total space cadet! Debivort 01:12, 21 July 2007 (UTC)
I'll answer my own question (I think)

The effect of taking Tylenol on reducing pain. Taking Tylenol does cause a patient's perceived level of pain to go down, but this may be due to the placebo effect. Here, the Tylenol does cause thereduction in pain, but the drug itself is not correlated with the reduction in pain.

Does this work? Thanks Akshayaj 20:58, 20 July 2007 (UTC)

I'l rebut my own answer. I've read pretty consistently online that causation implies correlation, and that makes sense. However, somebody has said that "economists were the first to show that 'causation does not imply correlation'". Is that somebody wrong?Akshayaj 21:20, 20 July 2007 (UTC)
Simplistically speaking, perhaps, but remember that Correlation is a linear measure. So the reason that x and sin(x) are not correlated is that sin is non-linear. Real examples are 1) the vertical position of a dot on a tire as caused by the displacement of the pistons. This will basically follow vertical position = sin(displacement) so here's an example. Another is the "butterfly effect" minor displacements in chaotic systems lead to wildly non-linearly (even non-monotonically) related differences in the outputs of the system. Debivort 01:12, 21 July 2007 (UTC)
This is backwards. The usual statement is "Correlation does not imply causation". I've never heard it the other way around. The Tylenol case is an example. Just because there is a CORRELATION between taking Tylenol and pain reduction DOES NOT IMPLY that Tylenol was the CAUSE of the pain reduction. PAR 00:04, 21 July 2007 (UTC)
That's the mistake I made above. Aksh is looking for a real life example of the x not correlated to sin(x) example. Debivort 01:12, 21 July 2007 (UTC)

Here is a simple example: hot weather may cause both crime and ice-cream purchases. Therefore crime is correlated with ice-cream purchases. But crime does not cause ice-cream purchases and ice-cream purchases do not cause crime. Michael Hardy 01:41, 21 July 2007 (UTC)

It seems there is some correlation of your example with the one given before by Debivort on 03:06, 20 July 2007 (UTC). But is your comment caused by his comment ? Lpele (talk) 12:41, 29 July 2008 (UTC)

Interpretation of the size of a correlation

The section on this subject, to my mind, misses out on a very important element in the interpretation of correlation coefficients. The point that I want to make is that, in addition to calculating the correlation coefficient, a researcher will often be well advised to carry out a significance test. A significance test will guide a researcher as to how much importance should be attached to a correlation coefficient.

The easiest way to carry out the significance test is to compare the 'test value' with a 'table value'. The test value is simply equal to 'mod r', i.e the value of the correlation coefficient, ignoring a negative sign if applicable. Table values can be acquired from appropriate published tables (sorry I can't quote a reference here 'off the top of my head' but I am certain that another reader will be able to fill in the gap). The table value to be used depends on the level of significance required (e.g. 5% or 1% etc.) and on the number of 'degrees of freedom' which in this case is equal to n-2 where n is the sample size. If the test value exceeds the table value then the correlation is significant at the percentage selected.

An instructive example would be a piece of research where the sample size was, say, 32 so that n-2 was 30. In such a case, the table value for 5% significance is 0.35 and for 1% significance it is 0.45. It follows that in such a case, correlation coefficients as low as 0.35 would be significant at 5% and as low as 0.45 would be significant at 1%. Thus, although a correlation in this situation of 0.45 is only 'medium' according to Cohen's table, it is nevertheless significant at 1% i.e. there is only a 1% or less probability that the observed relationship is just 'due to chance' as distinct from being a real relationship.

A converse example would be where the sample size was just 7 (n-2=5). In this case, consultation of the tables will tell us that we need a correlation coefficient of at least 0.75 to be significant at 5% and at least 0.88 for significance at 1%! Agibbs100 21:48, 9 August 2007 (UTC)

Correlation coefficient is in [-1,1]

The following refers to following remark added by User:Hongguanglishibahao: 'However, Guang Wu does show how to deliberately make the correlation coefficient be larger than unity.' (see [1]).

Let   and   be two random variables that are not degenerate and have finite first two moments. These assumptions are required for the correlation coefficient to be properly defined; in particular they imply that the variances of   and   are nonzero and finite. Then

 
 
 

Similarly

 

From these two inequalities it follows that  .

Unfortunately I cannot access the article 'Wu, G. (2003). An extremely strange observation on the equations for calculation of correlation coefficient. European Journal of Drug Metabolism and Pharmacokinetics, 28, 85-92.', so I cannot explain why it claims that the correlation coefficient can deliberately be made larger than unity. Given the proof above --- which can be found in just about any basic text on statistics (that provides proofs) --- it seams mr. Wu is either making a mistake or using different definitions. Note in particular that   and   need to be nonzero, otherwise the correlation coefficient is not defined (or defined to be zero by some texts). A proof such as given above simply falsifies the acclaimed statement. The fact the article is not published in a mathematics journal might explain some things here. (no pun intended) --Kuifware 10:11, 16 August 2007 (UTC)

I cannot access the article either; however, the abstract does not give a good feeling about the paper: in addition to the poor grammar, the last sentence does not really sound like what you'd expect in a scientific journal:

Various equations are used to calculate the correlation coefficient, these equations are presumed equally. However we find the extraordinary results when using r = square root of ((sigma(ŷi - y)2) / (sigma(yi - y)2)) and r2 = (sigma(yi - y)2 - sigma(yi - ŷi)2) / (sigma(yi - y)2) to calculate the correlation coefficient, for example, a line within 95% confidence band of a regressed line. The results are so extraordinary that we do not know whether or not we can still call the results as correlation coefficient, however we are sure that these results need to be presented.

Schutz 14:02, 17 August 2007 (UTC)
I agree the abstract is not giving much hope for the contents. Also, it looks like the article refers to the sample correlation coefficient (instead of the correlation coefficient), which can suffer from numerical instability. BTW the same journal published an article called 'An extemely strange observation?' from H. Schütz in 2004, which is a comment on Wu's article. The question mark in the title suggests enough I presume. --Kuifware 13:02, 18 August 2007 (UTC)
Yeah, I have seen the commentary, but did not manage to get a PDF. However, I did manage to get a PDF of the Wu paper, which I will look at, but I don't plan to spend too much time on this — and we should definitively not lose sleep or put this page on hold while we wait for the mysteries contained inside the paper... Schutz 20:38, 20 August 2007 (UTC)

Hi, Debivort. Thanks for continuing deleting the addition, 'However, Guang Wu does show how to deliberately make the correlation coefficient be larger than unity.' However, this remark is fully referenced in international peer-reviewed journal, which as all international peer-reviewed journals, has strong and strict reviewing process.

I really wonder why you do not read the reference before deleting, and the paper can be obtained by emailing postmaster@dreamscitech.com, as you know that the paper is copyrighted, whose contents cannot be put here.

Still, what I present here is the verified fact, please be unbiased to treat real facts. —Preceding unsigned comment added by Hongguanglishibahao (talkcontribs) 02:29, August 26, 2007 (UTC)

Regards —Preceding unsigned comment added by Hongguanglishibahao (talkcontribs) 26 August, 2007.

By the way, the comment by H. Schütz in European Journal of Drug Metabolism and Pharmacokinetics should be in the form of letter to Editor, and G. Wu has answered this comment. Mr Schütz has no more responses for the answer. —Preceding unsigned comment added by Hongguanglishibahao (talkcontribs) 26 August, 2007.

Since there is a possibility of a hoax, and since editor Hongguanglishibahao (talk · contribs) is unable to furnish us with a derivation showing that this is possible, or even an extended quote from the article that he says proves this point, I think we should keep the statement out of the article.
This is a hard-to-find journal that is making an extraordinary claim about a mathematical topic, and it's not even a math journal. Extraordinary claims require extraordinary evidence. It contradicts statements you can find in widely-used statistical textbooks. For instance Kendall and Stuart, Advanced Theory of Statistics, 3rd edition, volume 2, page 300, which asserts that the square of the correlation coefficient lies between 0 and 1, due to the Cauchy-Schwarz inequality. Do you think that Guang Wu has disproved the Cauchy-Schwarz inequality? It is more likely he has made a mistake.
If Guang Wu's claim is true, it should be easy to provide us a list of pairs of numbers (x, y) such that the correlation of x and y exceeds 1. I invite anyone who thinks the claim is true to provide such a list. That same list will disprove the Cauchy-Schwarz inequality. EdJohnston 03:34, 26 August 2007 (UTC)

By the strong request of Debivort, let us discuss how to deliberately make the correlation coefficient be larger than unity. Please note DELIBERATELY. Besides, the study was done almost 10 years ago, it is only recently that we have full accession of Wiki inside China, thus I decided to put this buried result in light.

Let us assume that we have a dataset, x = 0, 5, 10, 15, 20, 25, 30 and 35; and y = 0.1, 5.5, 9.7, 14.3, 21.7, 24.3, 32, 34.4, which resulted in y = 0.0917 + 1.0091x, with r = 0.9963, 10 years ago. If we used x = 0, 5, 10, 15, 20, 25, 30 and 35 into y = 0.0917 + 1.0091x, we get 0.0917, 5.1372, 10.1827, 15.2282, 20.2737, 25.3192, 30.3647, and 35.4102.

Put these data into r = square root of ((sigma(ŷi - y)2) / (sigma(yi - y)2)) = square root of (1069.1970/1077.0800) = 0.9963.

When using a pharmacokinetic software to fit this dataset, let us assume that we stop fitting before reaching the global minimum, although the global minimum can be easily found analytically for this dataset. Say, we stopped very near to y = 0.0917 + 1.0091x, which is located in the 95% confidence intervals for slope (0.9219 to 1.0962) and 95% confidence band for y = 0.0917 + 1.0091x.

The stopped line, for example, is (worse y) = 0.0917 + 1.02x. Then we put x = 0, 5, 10, 15, 20, 25, 30 into (worse y) = 0.0917 + 1.02x, which then resulted in 0.0917, 5.1917, 10.2917, 15.3917, 20.4917, 25.5917, 30.6917, 35.7917 and 35.4102.

Let us put them into the equation, r = square root of ((sigma(worse ŷi - y)2) / (sigma(yi - y)2)) = square root of (1092.7139/1077.0800) = 1.0072.

So here, r > 1.

Actually, the process is very simple, you have any dataset, then you regress them and get a regressed equation no matter linear or nonlinear. Then you slightly move this regressed line within 95% confidence band, and put x into this slighted-moved-line equation to calculate ŷ, then put ŷ into r = square root of ((sigma(ŷi - y)2) / (sigma(yi - y)2)). In general, you get r > 1.

Please note this process only related to our assumption that we stop fitting a curse before reaching the global minimum.

By the way, someone, who knows well how to write mathematical formulae, please edit the equations in this text. Many thanks! —Preceding unsigned comment added by Hongguanglishibahao (talkcontribs) 04:18, August 26, 2007 (UTC)

In fact, this study was inspired by seeing the figures in papers and books that how the sum of squared residuals reduces during the fitting process. The idea then was how the correlation coefficient would change during this fitting process. And the only equation suited for calculation of this process seemed to be r = square root of ((sigma(ŷi - y)2) / (sigma(yi - y)2)). However, when writing a program for fitting with this equation, the resulted correlation coefficient can be larger than unity.

One may say why we have not met the case that the correlation coefficient is larger than 1 during a fitting process because a fitting using the least squares method often stops before reaching the global minimum sum of squared residuals, i.e. the fitted line is near to the “real” regressed line as our example. This is because the correlation coefficient is calculated based on different graphic presentations, i.e. to use regressed and measured yi as axes to construct a plot.

In more plain words of this process, we have a dataset, x and y, we get the regressed linear equation, y = ax + b (the same for the more complicated dataset with multi-linear as well as nonlinear regressed equations). Then you move this line to ŷ = (a + delta)x + b or ŷ = ax + (b + delta), put x into this moved line, you get ŷi for each x. At this moment, the only equation that can calculate the correlation coefficient is r = square root of ((sigma(ŷi - y)2) / (sigma(yi - y)2)), by which we can get r>1.

I argue that until now we have no restriction on how to use this equation, and we have no other equations to calculate the correlation coefficient during the fitting, but if we calculate the correlation coefficient in such a way, that is, a line near to the global minimum within 95% confidence interval, then this only equation, which can be used in such circumstance, will result in r>1.

Best wishes

Guang Wu Hongguanglishibahao 05:45, 26 August 2007 (UTC)

Moreover, I had no business to do with Cauchy-Schwarz inequality 10 years ago, what I was interested in was how the correlation coefficient behaved during the fitting and how to calculate the value of correlation coefficient when the fitting did not reach the global minimum.

Guang Wu Hongguanglishibahao 06:38, 26 August 2007 (UTC)

Another important issue is that if we have, for example, x = 1, 2, 3, 4, 5 and y = 10, 20, 30, 40 50. And then we have ŷ = 10.1, 20.1, 30.1, 40.1, 50.1.

We absolutely need to write a program according to r = square root of ((sigma(ŷi - y)2) / (sigma(yi - y)2)) to calculate r, you cannot either use x vs ŷ or y vs ŷ in any regression program for calculation of r. Even in the later case y vs ŷ, the program will result in r<=1, because the regression program uses y vs ŷ as x vs y to make calculation. This is particularly important when the dataset is multiple linear or nonlinear regressions. Hongguanglishibahao 11:04, 26 August 2007 (UTC)

Perhaps due to the language barrier, I find that I can't follow your reasoning. You don't seem to be disputing the Cauchy-Schwarz inequality, and if that holds, r must remain in [-1,1]. You seem to be arguing that during a particular calculation procedure having to do with pharmacokinetics, a person might get the impression that some 'adjusted' value of the correlation coefficient goes outside of the expected range. This does not appear convincing. The type of argument you are making is one that you should get accepted in a math or statistics journal before bringing it to Wikipedia. EdJohnston 14:31, 26 August 2007 (UTC)

Thanks, EdJohnston

However, please do not pay so much worship to mathematicians and their mathematics journals. The correlation coefficient is an extremely old topic, no modern mathematicians and statisticians have more working knowledge on this topic than you and me. What they are interested in is the current problems in mathematics and statistics rather than such an old topic. Besides, the pioneer generation of statisticians, who worked on this topic more than 100 years ago, had no imagination of how the correlation coefficient would be when the regressed line is approaching the globe maximum.

With the advance of technology and the open Wikipedia, I think that everyone should control his own fate rather than dictated by others. We should respect the fact, even disputed. The fact is simple that r can be larger than unity, even much more in fitting process. By the way, Eur J Drug Metab Pharmacokinet is also a statistical journal in the field I once worked.

Besides, I did not want to spend time to dispute this topic with anyone that was why I did not provide you the data earlier. However, the spirit of Wikipedia is the presentation of referenced facts, not cycled referenced. Still, I stressed several times, deliberately. Please do not make Wikipedia an old style media, which is giving its dominated role step-by-step.

Regards Hongguanglishibahao 18:17, 26 August 2007 (UTC)

I'm sorry but, no it does not appear that r can be greater than one. What you have described is that under certain circumstances, numerical approximations of r, using very specific software, can be more than one, if you tweak with the simulation conditions. This is neither evidence nor a proof that r can be greater than one.
Your argument is equivalent to saying that   because the first three terms only add up to  . Unless there is some mis understanding due to language difficulties here, this is not valid. Debivort 18:42, 26 August 2007 (UTC)

I did not use any specific software to numerical approximations. I told you already that I was interested in how the correlation coefficient behaved during the fitting and how to calculate the value of correlation coefficient when the fitting did not reach the global minimum.

Can you tell me how to calculate the correlation coefficient during the fitting? We should face the new problems raised in current situation rather than avoid them with various excuses. Hongguanglishibahao 18:55, 26 August 2007 (UTC)

Dear Mr. Wu. Thank you for providing us with the details of your computation. I repeated your calculations and found the same numerical results. But I am not doubting your calculations, but the interpretation that you give to them.
First of all, you added a statement However, Guang Wu does show how to deliberately make the correlation coefficient be larger than unity in the section on the correlation coefficient   of two random variables   and  . That is the wrong place when you are in fact talking about the sample correlation. But the remark should also not be added to the section on sample correlation. The formula you use for computing the sample correlation is valid when   is the regressed value of   with respect to  . As you indicate, in your computation you replace   by a pertubed linear model   or   for some  . Unfortunately, this introduces an error in the calculated correlation coefficient, making it even larger than 1 in absolute value. This does not prove that the sample correlation coefficient can be outside the range  , but indicates a problem in the numerical procedure that you used for calculating it.
If you want to calculate the sample correlation, I recommend you use another procedure. For example, calculate the correct value of the regression value, or use another formula for the sample correlation. (An example is on the Wikipedia page.)
If you want to calculate a confidence interval for the correlation coefficient, then you need another approach than what you did here. So you type in 'correlation coefficient confidence interval' in your favorite search engine and find that if the data has a normal distribution, then Fisher's r-to-Z transformation gives approximate confidence intervals. If the data is not normal, you could use some bootstrapping procedure. By the way, see Help:Formula for help on editing math formulas. --Kuifware 15:46, 27 August 2007 (UTC)

Dear All

Since I posted the discussion here, I could not enter Wiki until today. Perhaps, the connection will be blocked soon. I only would like to say that I will discuss this issue when I have full access again.

Guang Wu —Preceding unsigned comment added by Hongguanglishibahao (talkcontribs) 14:09, 1 April 2008 (UTC)

By the way, I am not happy with the comments made by EdJohnston “The type of argument you are making is one that you should get accepted in a math or statistics journal before bringing it to Wikipedia”.

I wonder who give you the power to decide which sentence can be added? Do you want to make Wiki another UN, where you can veto everything you do not like but you do not even pay the UN fee. —Preceding unsigned comment added by Hongguanglishibahao (talkcontribs) 14:32, 1 April 2008 (UTC)

Dear Mr Wu, What you compute is not a correlation factor. So, there is no consequence in this article. We don't discuss about flaw of a particular software here !Lpele (talk) 13:05, 29 July 2008 (UTC)

single pass correlation

The pseudo algorithm describe on the page will not work when pop_sd_x or pop_sd_y is 0, you'll get a nan value. —Preceding unsigned comment added by 195.167.237.98 (talk) 10:13, August 24, 2007 (UTC)

See earlier in the article: "The correlation is defined only if both of the standard deviations are finite and both of them are nonzero" --mcld (talk) 09:13, 20 May 2008 (UTC)

Disambiguation of geometry

Isn't correlation also a term in projective geometry? When PG(V) and PG(W) are projective spaces, a correlation   is a bijection from the subspaces of V to the subspaces of W, such that   is equivalent with  

11:12, 12 October 2008 (UTC)

Correlation and inverse / direct relationship

What's the difference between those?

11:12, 12 October 2008 (UTC)

The first image

 
Positive linear correlations between 1000 pairs of numbers. The data are graphed on the lower left and their correlation coefficients listed on the upper right. Each square in the upper right corresponds to its mirror-image square in the lower left, the "mirror" being the diagonal of the whole array. Each set of points correlates maximally with itself, as shown on the diagonal (all correlations = +1).

The first image (show right of here) is really quite difficult to understand. Could we find an easier image to introduce correlation with? --Apoc2400 (talk) 20:45, 6 December 2007 (UTC)

IMHO, the previous correlation example was much interesting. Certainly a bit harder to read at first slight, but it is a quite common and efficient representation for correlation between several events. I regret it disappears from this page. --152.81.13.31 (talk) 08:12, 21 January 2008 (UTC)
 
Several sets of (xy) points, with the correlation coefficient of x and y for each set. Note that the correlation reflects the noisiness and direction of a linear relationship (top row), but not the slope of that relationship (middle), nor many aspects of nonlinear relationships (bottom). N.B.: the figure in the center has a slope of 0 but in that case the correlation coefficient is undefined because the variance of Y is zero.
Wow! I was just impressed by the clarity of the current first image--shown left of here. It is so clear that I immediately started looking for counterexamples that would violate the symmetry type implied in the third row. I can't find any flaws, though the back of my mind keeps looking for one--and that combination is my favorite learning. So my congratulations to, I guess it was, Imagecreator who contributed this wonderfully clear and thought-provoking image. Thanks! Rednblu (talk) 14:59, 14 September 2008 (UTC)

General Comments

This article needs a link to explain the concept of a linear relationship. Ianhowlett 18:47, 6 July 2007 (UTC)

I also have some difficulty with understanding why a "linear relationship" is included. Even if the relationship is not linear, there could still be a non-zero correlation. Unfortunately, the figure does not include any such examples. If x is restricted to be in the range -pi/2 to pi/2, then x and sin(x) will be strongly correlated, since knowing one will tell you the other. Comfortably Paranoid (talk) 03:49, 10 December 2007 (UTC)
You can't say two variables will be strongly correlated just because knowing one tells you the other. For example, here's a simple nonlinear function that when y=f(x) and x is between 0 and 1, gives zero correlation between x and y, even though knowing one will tell you the other:
f(x) = x if x < 0.211325
f(x) = x-1 otherwise
In this case x and y are completely uncorrelated, even though they are completely dependent. "Dependence" measures whether they affect each other. "Correlation" measures to what degree they affect each other linearly. Those are very different concepts.
The reason sin(x) between -Pi/2 and Pi/2 has a strong correlation is because it looks a lot like a straight line (with a slight curve to it). You can draw a certain straight line with positive slope, and all the (x,sin(x)) data points will on average be fairly close to that line. The relationship is close to linear, so the correlation is strong. In my example there isn't any line that fits the data well, so the correlation is zero.
Notice that sin(x) for x between 0 and Pi also has correlation of zero. That's because it doesn't look much like a line. And the best fit line is horizontal. The relationship isn't very close to linear, so it has a small correlation. You could think of correlation as being a measurement of the degree to which a line with nonzero slope fits the data better than a line with zero slope. A positive (or negative) correlation means the best-fit line has positive (or negative) slope.
Or, think of it this way. Every data set can be transformed to have zero correlation with just a simple linear transform (technically, an affine transform). This can be done no matter how nonlinear the relationship is. So if the correlation can always be removed with a linear operation, it must be measuring something linear about the data. It isn't really measuring the nonlinear aspects of the relationship. A nonlinear relationship can have a nonzero correlation, but only because there's some linear relationship mixed in. And that linear component can be removed with just a linear transform. —Preceding unsigned comment added by Imagecreator (talkcontribs) 01:36, 17 December 2007 (UTC)

11:12, 12 October 2008 (UTC)

Removing Correlation may be missing a factor and is missing a citation

I have implemented this formula in R. Could it be that the correct assertion is that The covariance matrix of T will be 1/(m-1) times the identity matrix rather than The covariance matrix of T will be the identity matrix? Is there a textbook of a paper that can be cited regarding this formula, preferably containing a derivation? I would like to check whether it is Wikipedia or my implementation that is incorrect.

Thanks, Leo. —Preceding unsigned comment added by 201.9.189.100 (talk) 22:38, 12 June 2008 (UTC)

11:12, 12 October 2008 (UTC)