workshop on
Big Data analytics

August 20-21, 2015

Co-organized by ACM Student Chapter, ISI Kolkata and Indian Statistical Institute

The goal of this workshop is to provide a forum for exchanging ideas and information on current research studies, challenges, system developments, and practical experiences in this emerging field of Big data analysis. The workshop includes topics related to distributed machine learning, high dimensional data analysis, MapReduce framework, Hadoop system, and future of big data in India.

Abstracts and Lecture Slides

Big Data and High Dimensional Data Analysis

by Prof. B. L. S. Prakasa Rao
Abstract: Over the last ten to fifteen years, more and more corporations are adapting to data-driven approach to have targeted services, reduced risks and improved performance. They are implementing specialized data analytic programs to collect, store, manage and analyze large data sets or what is now called BIG DATA. Analyzing large size of economic and financial data is challenging. BIG DATA has unique features that are not shared by the traditional data sets. BIG DATA sets are characterized by massive sample size and high-dimensionality. Massive sample size allows one to unravel hidden patterns associated with small sub populations. Modeling the intrinsic heterogeneity of BIG DATA requires better statistical methods. There are several phenomena associated with high-dimensionality such as noise accumulation, spurious correlation and incidental endogeneity. Traditional statistical methods are inappropriate to tackle such problems. There are also many types of event we can think of when we have a potentially large number of measurable or parameters/covariates quantifying the event but a relatively few instances of that event. Example: few patients with a given genetic disease but a large number of genes which might cause this event. In statistical terms, the number of parameters p is large as compared to the number of observations n. This type of data is termed as HIGH-DIMENSIONAL DATA. The basic methodology which was used in classical statistical methods is not applicable for analysing such data. We will discuss some problems arising in analysis of BIG DATA and HIGH-DIMENSIONAL DATA.

Downloads: (a) presentation, (b) notes

Distributed Machine Learning and Big Data

by Prof. Sourangshu Bhattacharya
Abstract: Learning from Big Data has now become an ubiquitous problem, driving implementation of learning algorithms on Big Data platforms. These algorithms need to be distributed and communication-efficient. Many problems in Machine Learning, e.g. SVM, Logistic Regression, etc., can be cast as optimisation problems, which needs to be solved in a distributed manner. In this talk, we discuss a gamut of algorithms which are used to solve different variants of these problems. In particular, we focus on a popular set of techniques, called alternating direction method of multipliers (ADMM), which offer high flexibility and performance.

Downloads: (a) presentation

From Big Text to Big Knowledge

by Prof. Partha Pratim Talukdar
Abstract: Knowledge harvesting from Web-scale text datasets has emerged as an important and active research area over the last few years, resulting in the automatic construction of large knowledge bases (KBs) consisting of millions of entities and relationships among them. This has the potential to revolutionize Artificial Intelligence and intelligent decision making by removing the knowledge bottleneck which has plagued systems in these areas all along. In this talk, I shall provide an overview of my research in this exciting and emerging area.

Downloads: (a) presentation

Practising Data Science in Wild

by Dr. Arijit Laha
Abstract: The advancements in technologies and techniques in data science promise big paybacks in real world. However, away from the headlines of success stories is the great mass of organizations and enterprises which are still grappling for understanding and leveraging this new computational milieu. Working with such organization effectively constitutes the greatest challenge for a data science practitioner.

Downloads: (a) presentation

Analysis of high-velocity data streams

by Prof. Saumyadipta Pyne
Abstract: Finding patterns in data streams has emerged as an important research area in computational statistics over the last couple of decades. Most algorithms for modeling data streams encounter two important challenges: (1) one-pass constraint: an algorithm must perform its computations in a single pass over the data without iterations and with limited storage, and (2) concept drift: as the data generating processes may change over time, it is necessary to update the model in an incremental manner, so that it remains relevant when applied to the current test instances. In this presentation, we shall discuss some of the key algorithmic concepts and methods for supervised and unsupervised learning of patterns in stream data, based on ideas from basic probability, sampling, and parametric and non-parametric modeling techniques.

Downloads: (a) presentation

Distributed Deep Learning Implementation over Spark and Applications

by Dr. Vijay Srinivas A
Abstract: This talk starts from simple artificial neural networks (ANNs) and explains how deep learning evolved from ANNs. It goes on to detail the different kinds of deep learning networks and discusses the possible applications. The last part of the talk covers the motivation for a distributed deep learning system and how we built one such system. It ends with a short discussion on an audio sentiment analysis use case we solved with the deep learning network.

Downloads: (a) presentation

Introduction to the Map-Reduce framework and the Hadoop EcoSystem

by Himanshu Gupta
Abstract: This talk will provide an introductory tutorial of the Map-Reduce framework and Hadoop EcoSystem. The talk will consist of three parts. The first part will discuss how we can develop map-reduce algorithms with some examples (Aggregation, Equi Join, In-equality Join etc). The second part will provide an overview of the map-reduce research. The third part will discuss various components of Hadoop eco-system briefly ( e.g., Hive, HBase, Oozie, Scalding etc). Time permitting, the talk will also briefly discuss Spark, how it is different from the Map-Reduce framework and why it has become a popular paradigm for processing big data.

Downloads: (a) file 1, (b) file 2, (c) file 3, (d) file 4, (e) file 5, (f) book

Schedule

Registration

Instructions

Early registration is now closed. For spot registration please see the instructions below.

Early Registration (Closed) Fee.

  • Students: Rs. 1,000/- (Rs. 800/- for active ACM student members)
  • Academician: Rs. 2,000/- (Rs. 1,500/- for active ACM members)
  • Industry People: Rs. 5,000/- (Rs. 4,000/- for active ACM members)

Spot Registration Fee.

  • Students: Rs. 1,500/- (Rs. 1,300/- for active ACM student members)
  • Academician: Rs. 3,000/- (Rs. 2,500/- for active ACM members)
  • Industry People: Rs. 7,000/- (Rs. 6,000/- for active ACM members)

Payment mode for spot registration

Only by Cash or Demand Draft drawn in favour of INDIAN STATISTICAL INSTITUTE payable at KOLKATA. Please write your NAME and CURRENT AFFILIATION on back of the Draft.

Contact: Mr. Suman Kundu (email: suman@sumankundu.info/+91 89022 47884), Center for Soft Computing Research (CSCR), 1st Floor, R. A. Fisher Bhawan, Indian Statistical Institute (ISI), 203 B. T. ROAD, Kolkata- 700108, India. Spot registration form will be available at the registration desk or you may download it from below.

Spot Registration Form

(doc) (pdf) (odt)

Accommodation

The organizers do not provide any accommodation. For accommodation, you may contact hotels / hostels from the following list.

Youth Hostel
Near Sinthi More, Kolkata, (P) +91 33 2556 9394.

Mandir Palace Guest House
6/1A, T.N. Biswas Road, Kolkata - 700035 (Beside Dakshineswar Kali Temple)
E-mail:mandirpalace@yahoo.co.in (P) +91 33 1544 0034, (M) +91 80133 59769.

Holinest Guest House
(Near Dakshineswar Kali Temple) (P) +91 33 2578 7436, +91 33 2544 3121/22, (M)+91 98369 50128, 9831407874.

Debalay Guest House
(Near Dakshineswar Kali Temple) 1, T.N. Mukherjee Road, Kolkata - 700035. (P) +91 33 2544 4333

Blue Star Guest House
(Near Dakshineswar Kali Temple) 25/B A. C. Sarkar Road Kolkata - 700 076. Manager: +91 91431 42814

Moonlight Hotel and Restaurant
Belgharia Express Way, Purba Talbagan, Sukanta Pally, Mathkal, Kolkata - 700065. (P) +91 33 6555 0655

Venue

Indian Statistical Institute

203 B. T. Road, Kolkata - 700108, India

Lecture Hall

Platinum Jubilee Building Auditorium

How to reach

Sealdah to ISI, Kolkata
Option 1 (recommended): Taxis are readily available outside station. Tell them to drive to Indian Statistical Institute (Before Dunlop bridge crossing)
Option 2: By train: go to Dumdum junction (just 2 stations from Sealdah in local trains). Outside Dumdum station, auto ricksaws are available to go to Sinthee more. From Sinthee more, you can find bus or auto towards ISI.
Option 3: By bus: Outside Sealdah station, you can find many bus towards ISI. Among them 230 and 234 are very frequent.

Howrah to ISI, Kolkata
Option 1 (recommended): Prepaid taxis are readily available outside station. Tell them to drive to Indian Statistical Institute (Near Dunlop bridge crossing)
Option 2: By train: go to Bali junction (just 3 stations from Howrah in local trains). Go outside Bali station, walk a little bit and take the stairs to go up the over bridge. There auto ricksaws are available to go to Dunlop more. From Dunlop more, you can either walk for 5 minutes or find bus or auto towards ISI. Otherwise you can take buses also from the overbridge to ISI (Belurmath-Garia etc).
Option 3: By bus: Outside Howrah station, you can find many bus towards ISI. There are some mini-buses (Belgharia-Howrah, etc) and some buses in Barrackpore-Howrah route, which can take you to ISI.

Contact

Abhisek Chakrabarty (+91 94770 46387), Please send your queries to: acmsc@isical.ac.in