Talk:Iris flower data set

Latest comment: 2 years ago by NerdOnTour in topic Dataset

Dataset edit

"Python tutorial" edit

The dataset section is written as a Python tutorial which I think is inappropriate as given, this page should really be about historical information about Fisher's Iris dataset. Wiki is not code.org. 136.168.148.56 (talk) 23:30, 30 October 2019 (UTC)Reply

Dialog for showing/hiding data set edit

The table containing the data set can be expanded or hidden. If it is expanded, then it has a neatly typesetted heading. If it is hidden however, the short heading goes over three lines (at least for me). Is this intended behaviour? I think it isn't, but I don't know how to change it. It would be nice if someone could change it. NerdOnTour (talk) 08:36, 22 October 2021 (UTC)Reply

"Are separable" .. really? without over-fitting? edit

This article makes the claim that the dataset is separable, but this is hardly obvious. It s separable if you don't mind over-fitting the data, but its not clear (to me) that any classifier can do this, using, say, a 3-fold or a 5-fold cross-validation. I'd like to see a citation for this. 99.153.64.179 (talk) 17:39, 2 August 2013 (UTC)Reply

BTW, the citation given, A.N. Gorban, N.R. Sumner, and A.Y. Zinovyev, Topological grammars for data approximation, Applied Mathematics Letters Volume 20, Issue 4 (2007), 382-386. does NOT perform a k-fold cross-validation. They appear to simply over-train on the entire dataset, which is not all that hard to do. 99.153.64.179 (talk) 17:47, 2 August 2013 (UTC)Reply
In the cited article the entire dataset is projected onto the first principal tree. This tree is build without any hints about classes (completely unsupervised task). There is no fitting for classification problem in this work at al. It happens that the classes are separated in the projection onto this tree. In the projection onto the first classical principal component these classes are not separated, of course. They are also not linearly separable. I hope that this comment can answer the question about fitting and over-fitting.Agor153 (talk) 17:58, 10 October 2013 (UTC)Reply
I figure the term "is separable" is vague, and in many cases, people will use it even when some classification error remains. As in: "you can build a model using linear separations that gets 95% right". Language is imprecise. Even mathematical language will contain some ambiguity, unless you include all preliminaries and notations, and then it will be too verbose for an encyclopedia... --Chire (talk) 17:49, 15 October 2013 (UTC)Reply

k-means image is horrible edit

The k-means image illustrates a very poor execution of the k-means algorithm. It is unclear what the reader is supposed to get out of a comparison where the I. setosa cluster is inappropriately split and the I. versicolor and I. virginica populations inappropriately merged. This should be recreated with a better k-means implementation?--Physicsmichael (talk) 01:00, 5 February 2014 (UTC)Reply

The text of the article explains that the data set does not cluster well, and is therefore not a good choice for evaluating clustering algorithms. It's not a matter of k-means implementation, but the result seen actually may be the global minimum (unverified, but plausible). The image shouldn't be read as a standalone thing IMHO. --Chire (talk) 16:24, 10 February 2014 (UTC)Reply

Highest Accuracy Achieved? edit

What is the highest accuracy that has been achieved on this dataset? — Preceding unsigned comment added by 129.59.79.147 (talk) 18:11, 23 October 2014 (UTC)Reply

With careless overfitting and sloppy evaluation: 100%. --Chire (talk) 08:47, 24 October 2014 (UTC)Reply

Sources regarding controversy? edit

I removed the following sentence from the intro as it was unsourced: Fisher's paper was published in the journal, the Annals of Eugenics, creating controversy about the continued use of the Iris dataset for teaching statistical techniques today. I did try to find sources before removing (doing searches of google and google scholar for terms like "iris dataset controversy", "iris dataset eugenics", "iris dataset racism"), but all I could find were tweets, a reddit post, and a couple posts from personal blogs ([1], [2]). I don't think these are sufficient to support the claim, but if anyone can find any better sources, feel free to restore. Colin M (talk) 18:35, 15 January 2021 (UTC)Reply