workshop on
Big Data analytics

August 20-21, 2015

Co-organized by ACM Student Chapter, ISI Kolkata and Indian Statistical Institute

The goal of this workshop is to provide a forum for exchanging ideas and information on current research studies, challenges, system developments, and practical experiences in this emerging field of Big data analysis. The workshop includes topics related to distributed machine learning, high dimensional data analysis, MapReduce framework, Hadoop system, and future of big data in India.

Abstracts and Lecture Slides

Big Data and High Dimensional Data Analysis

by Prof. B. L. S. Prakasa Rao
Abstract: Over the last ten to fifteen years, more and more corporations are adapting to data-driven approach to have targeted services, reduced risks and improved performance. They are implementing specialized data analytic programs to collect, store, manage and analyze large data sets or what is now called BIG DATA. Analyzing large size of economic and financial data is challenging. BIG DATA has unique features that are not shared by the traditional data sets. BIG DATA sets are characterized by massive sample size and high-dimensionality. Massive sample size allows one to unravel hidden patterns associated with small sub populations. Modeling the intrinsic heterogeneity of BIG DATA requires better statistical methods. There are several phenomena associated with high-dimensionality such as noise accumulation, spurious correlation and incidental endogeneity. Traditional statistical methods are inappropriate to tackle such problems. There are also many types of event we can think of when we have a potentially large number of measurable or parameters/covariates quantifying the event but a relatively few instances of that event. Example: few patients with a given genetic disease but a large number of genes which might cause this event. In statistical terms, the number of parameters p is large as compared to the number of observations n. This type of data is termed as HIGH-DIMENSIONAL DATA. The basic methodology which was used in classical statistical methods is not applicable for analysing such data. We will discuss some problems arising in analysis of BIG DATA and HIGH-DIMENSIONAL DATA.

Downloads: (a) presentation, (b) notes

Distributed Machine Learning and Big Data

by Prof. Sourangshu Bhattacharya
Abstract: Learning from Big Data has now become an ubiquitous problem, driving implementation of learning algorithms on Big Data platforms. These algorithms need to be distributed and communication-efficient. Many problems in Machine Learning, e.g. SVM, Logistic Regression, etc., can be cast as optimisation problems, which needs to be solved in a distributed manner. In this talk, we discuss a gamut of algorithms which are used to solve different variants of these problems. In particular, we focus on a popular set of techniques, called alternating direction method of multipliers (ADMM), which offer high flexibility and performance.

Downloads: (a) presentation

From Big Text to Big Knowledge

by Prof. Partha Pratim Talukdar
Abstract: Knowledge harvesting from Web-scale text datasets has emerged as an important and active research area over the last few years, resulting in the automatic construction of large knowledge bases (KBs) consisting of millions of entities and relationships among them. This has the potential to revolutionize Artificial Intelligence and intelligent decision making by removing the knowledge bottleneck which has plagued systems in these areas all along. In this talk, I shall provide an overview of my research in this exciting and emerging area.

Downloads: (a) presentation

Practising Data Science in Wild

by Dr. Arijit Laha
Abstract: The advancements in technologies and techniques in data science promise big paybacks in real world. However, away from the headlines of success stories is the great mass of organizations and enterprises which are still grappling for understanding and leveraging this new computational milieu. Working with such organization effectively constitutes the greatest challenge for a data science practitioner.

Downloads: (a) presentation

Analysis of high-velocity data streams

by Prof. Saumyadipta Pyne
Abstract: Finding patterns in data streams has emerged as an important research area in computational statistics over the last couple of decades. Most algorithms for modeling data streams encounter two important challenges: (1) one-pass constraint: an algorithm must perform its computations in a single pass over the data without iterations and with limited storage, and (2) concept drift: as the data generating processes may change over time, it is necessary to update the model in an incremental manner, so that it remains relevant when applied to the current test instances. In this presentation, we shall discuss some of the key algorithmic concepts and methods for supervised and unsupervised learning of patterns in stream data, based on ideas from basic probability, sampling, and parametric and non-parametric modeling techniques.

Downloads: (a) presentation

Distributed Deep Learning Implementation over Spark and Applications

by Dr. Vijay Srinivas A
Abstract: This talk starts from simple artificial neural networks (ANNs) and explains how deep learning evolved from ANNs. It goes on to detail the different kinds of deep learning networks and discusses the possible applications. The last part of the talk covers the motivation for a distributed deep learning system and how we built one such system. It ends with a short discussion on an audio sentiment analysis use case we solved with the deep learning network.

Downloads: (a) presentation

Introduction to the Map-Reduce framework and the Hadoop EcoSystem

by Himanshu Gupta
Abstract: This talk will provide an introductory tutorial of the Map-Reduce framework and Hadoop EcoSystem. The talk will consist of three parts. The first part will discuss how we can develop map-reduce algorithms with some examples (Aggregation, Equi Join, In-equality Join etc). The second part will provide an overview of the map-reduce research. The third part will discuss various components of Hadoop eco-system briefly ( e.g., Hive, HBase, Oozie, Scalding etc). Time permitting, the talk will also briefly discuss Spark, how it is different from the Map-Reduce framework and why it has become a popular paradigm for processing big data.

Downloads: (a) file 1, (b) file 2, (c) file 3, (d) file 4, (e) file 5, (f) book




Indian Statistical Institute

203 B. T. Road, Kolkata - 700108, India

Lecture Hall

Platinum Jubilee Building Auditorium

