Big Data and High Dimensional Data Analysis
by Prof. B. L. S. Prakasa Rao
Abstract: Over the last ten to fifteen years, more and more corporations are adapting to data-driven approach to have targeted services, reduced risks and improved performance. They are implementing specialized data analytic programs to collect, store, manage and analyze large data sets or what is now called BIG DATA. Analyzing large size of economic and financial data is challenging. BIG DATA has unique features that are not shared by the traditional data sets. BIG DATA sets are characterized by massive sample size and high-dimensionality. Massive sample size allows one to unravel hidden patterns associated with small sub populations. Modeling the intrinsic heterogeneity of BIG DATA requires better statistical methods. There are several phenomena associated with high-dimensionality such as noise accumulation, spurious correlation and incidental endogeneity. Traditional statistical methods are inappropriate to tackle such problems. There are also many types of event we can think of when we have a potentially large number of measurable or parameters/covariates quantifying the event but a relatively few instances of that event. Example: few patients with a given genetic disease but a large number of genes which might cause this event. In statistical terms, the number of parameters p is large as compared to the number of observations n. This type of data is termed as HIGH-DIMENSIONAL DATA. The basic methodology which was used in classical statistical methods is not applicable for analysing such data. We will discuss some problems arising in analysis of BIG DATA and HIGH-DIMENSIONAL DATA.
Downloads: (a) presentation, (b) notes