Arthur Lyon Bowley used precursors of the stemplot and five-number summary (Bowley actually used a " seven-figure summary", including the extremes, deciles and quartiles, along with the median-see his Elementary Manual of Statistics (3rd edn., 1920), p. 62 – he defines "the maximum and minimum, median, quartiles and two deciles" as the "seven positions").Francis Galton emphasized order statistics and quantiles.Many EDA ideas can be traced back to earlier authors, for example: Nonlinear dimensionality reduction (NLDR).Projection methods such as grand tour, guided tour and manual tour.Glyph-based visualization methods such as PhenoPlot and Chernoff faces.Typical graphical techniques used in EDA are: There are a number of tools that are useful for EDA, but EDA is characterized more by the attitude taken than by particular techniques. They are also being taught to young students as a way to introduce them to statistical thinking. Many EDA techniques have been adopted into data mining. Provide a basis for further data collection through surveys or experiments.Support the selection of appropriate statistical tools and techniques.
Assess assumptions on which statistical inference will be based.Suggest hypotheses about the causes of observed phenomena.Enable unexpected discoveries in the data.
In particular, he held that confusing the two types of analyses and employing them on the same set of data can lead to systematic bias owing to the issues inherent in testing hypotheses suggested by the data. Tukey held that too much emphasis in statistics was placed on statistical hypothesis testing (confirmatory data analysis) more emphasis needed to be placed on using data to suggest hypotheses to test. Tukey wrote the book Exploratory Data Analysis in 1977. These statistical developments, all championed by Tukey, were designed to complement the analytic theory of testing statistical hypotheses, particularly the Laplacian tradition's emphasis on exponential families. Such problems included the fabrication of semiconductors and the understanding of communications networks, which concerned Bell Labs. The packages S, S-PLUS, and R included routines using resampling statistics, such as Quenouille and Tukey's jackknife and Efron 's bootstrap, which are nonparametric and robust (for many problems).Įxploratory data analysis, robust statistics, nonparametric statistics, and the development of statistical programming languages facilitated statisticians' work on scientific and engineering problems. Tukey promoted the use of five number summary of numerical data-the two extremes ( maximum and minimum), the median, and the quartiles-because these median and quartiles, being functions of the empirical distribution are defined for all distributions, unlike the mean and standard deviation moreover, the quartiles and median are more robust to skewed or heavy-tailed distributions than traditional summaries (the mean and standard deviation). Tukey's EDA was related to two other developments in statistical theory: robust statistics and nonparametric statistics, both of which tried to reduce the sensitivity of statistical inferences to errors in formulating statistical models. This family of statistical-computing environments featured vastly improved dynamic visualization capabilities, which allowed statisticians to identify outliers, trends and patterns in data that merited further study. The S programming language inspired the systems S-PLUS and R. Tukey's championing of EDA encouraged the development of statistical computing packages, especially S at Bell Labs. Tukey defined data analysis in 1961 as: "Procedures for analyzing data, techniques for interpreting the results of such procedures, ways of planning the gathering of data to make its analysis easier, more precise or more accurate, and all the machinery and results of (mathematical) statistics which apply to analyzing data." Approach of analyzing data sets in statistics Part of a series on Statistics