Abstract: Visualizing very large matrices involves many significant problems. Various popular solutions to these problems involve sampling, clustering, projection, or feature selection to reduce the size and complexity of the original task. An important aspect of these methods is how to preserve relative distances between points in the higher-dimensional space after reducing rows and columns to fit in a lower dimensional space. This aspect is important because conclusions based on faulty visual reasoning can be harmful. Judging dissimilar points as similar or similar points as dissimilar on the basis of a visualization can lead to false conclusions. To ameliorate this bias and to make visualizations of very large datasets feasible, we introduce a new algorithm that selects a subset of rows and columns of a rectangular matrix. This selection is designed to preserve relative distances as closely as possible. We compare our Matrix Sketch to more traditional alternatives on a variety of artificial and real datasets.
Bio: Leland Wilkinson is Chief Scientist at H2O and Adjunct Professor of Computer Science at the University of Illinois Chicago. He received an A.B. degree from Harvard in 1966, an S.T.B. degree from Harvard Divinity School in 1969, and a Ph.D. from Yale in 1975. Wilkinson wrote the SYSTAT statistical package and founded SYSTAT Inc. in 1984. After the company grew to 50 employees, he sold SYSTAT to SPSS in 1994 and worked there for ten years on research and development of visualization systems. Wilkinson subsequently worked at Skytree and Tableau before joining H2O.
Wilkinson is a Fellow of the American Statistical Association, an elected member of the International Statistical Institute, and a Fellow of the American Association for the Advancement of Science. He has won best speaker award at the National Computer Graphics Association and the Youden prize for best expository paper in the statistics journal Technometrics. He has served on the Committee on Applied and Theoretical Statistics of the National Research Council and is a member of the Boards of the National Institute of Statistical Sciences (NISS) and the Institute for Pure and Applied Mathematics (IPAM). In addition to authoring journal articles, the original SYSTAT computer program and manuals, and patents in visualization and distributed analytic computing, Wilkinson is the author (with Grant Blank and Chris Gruber) of Desktop Data Analysis with SYSTAT. He is also the author of The Grammar of Graphics, the foundation for several commercial and open-source visualization systems (IBMRAVE, Tableau, R-ggplot2, and Python-Bokeh).