Project Guidelines¶

News ¶

NIST Fingerprint Example (03/09/2016)
HBase is now supported (03/02/2016)
Examples are under development (03/02/2016)
Projects, datasets, and technologies from the past are available (02/26/2016)

Important Dates ¶

Project Proposal: March 18th
Oral Presentation: Week 12 - April 1st, 2nd (Tentative)
Progress Checkup: Week 14 - April 15th, 16th (Tentative)
Final Submission: April 29th

Note

Those who can’t make the presentation with a time conflict, schedule a meeting with Course Team.

Submission ¶

IU GitHub: https://github.iu.edu/bdossp-sp16

Team Coordination ¶

Up to 3 members is recommended but individual is allowed.

Project Expectation (Grade)¶

Final project counts as 60% of semester grade and 40% goes on assignments.

60% Final project
- 10% Proposal
- 10% Presentation
- 30% Source code
- 10% Report

You do not require strong background or programming skills with HPC or Hadoop to complete a final project. We’ve noticed that, however, there are some difficulties learning Linux systems, shells, or scripts and improving programming skills with parallelization in general. You have two options, Basic and Bonus, to start your project based on your capability on these.

Basic Project ¶

Basic project starts from existing projects and extends the scope of the projects with minimal efforts on code developments. For example, take existing Hadoop benchmark tools and run them on hadoop clusters with different system configurations to compare. Try to increase data nodes, master nodes or add ZooKeeper with different settings and measure differences. Comparing performance in different software versions, settings or configurations tells you where focal points are to optimize or improve throughput of hadoop. Choose a basic project if you are not conpetent with programming languages e.g. Java or Python. Note that starting from existing projects doesn’t mean that you can simply search and download popular projects on the internet and execute. You need to address new findings and include the original source of the projects that you referenced in your final project and reports.

Minimal code writing
Start from existing projects

Bonus Project ¶

If you are working on a bonus project, you are required to write code/scripts to implement your idea in the final project. Installation and configuration should be done by Ansible Playbooks. For example, take NIST Facial Recognition software and run with Hadoop clusters. Change serial calculation to be executed in parallel. Writing map and reduce functions may be necessary in Java, Python or Scala. Write Ansible Playbooks to install and configure your software packages within a few commands. If data analytics is the area that you are interested, you may try to develop new techniques to improve performance or implement parallel algorithms for complex face detection. Developing parallel programs would be involved in most cases. There are other possibilities as well. For instance, take hadoop-ansible-stacks which consists of basic components of Hadoop and append new software tools by writing new playbooks in roles and addons. You could add Hives or update Spark with the latest release using parameters or definition in YAML. If you focus on managing systems and software deployments, think about how to manage traffics by adding/removing additional nodes or how to apply new patches on particular nodes. Bonus points are given exceptional project results.

Ansible is required
Extensive code and scripts writing are welcome
Using GitHub Issues is mandatory to communicate with AIs for your projects
Bonus points

Project Choice ¶

Deployment
Benchmark (Performance Test)
Parallelization
Analytics
Created Own (upon approval)

Deployment ¶

Deployment project focuses on automated software deployments on multiple nodes using automation tools/configuration managements such as Ansible, Chef, Puppet, Salt or Juju. For example, you can work on deploying Hadoop clusters with 10 medium virtual instances or Sharded MongoDB clusters or filesystems e.g. NFS or Gluster. Ansible is recommended and supported in the class.

Examples:

Deployment Hadoop clusters
Deployment cluster managers (e.g. Mesos)

Benchmark ¶

Benchmark project focuses on testing system’s performance by putting some stresses on different spots. Filesystems, CPUs, or memories can be tested and measured, if you think about hardware benchmark. APIs, messaging queues, load balancers or any applications can be tested and measured, if software is more focused. Hibench, Big Data Benchmark, or built-in tools e.g. Terasort are available for Hadoop benchmark.

Examples:

Hibench
Storm Benchmark
Big Data Benchmark for Big Bench

Parallelization ¶

Parallelization project focuses on building efficient software stacks in parallel including MPI and Hadoop clusters. For example, you may find writing map and reduce functions is relatively easy e.g. WordCount, but applying it in practice with large datasets isn’t that simple. Think about how to load your dataset into hadoop file systems or databases and run your jobs in a distributed fashion.

Examples:

Pig
Spark

Analytics ¶

Analytics project focuses on developing algorithms for different problems based on datasets and topics that you chose in your project. You will be required to develop algorithms for improving parallelism or performance in this project rather than developing new algorithm for face recognition, for example.

Examples:

Faunus Graph analytics
Ibis

Created Own ¶

You can develop own project idea and make it as a class project upon approval. Describe your thought, tools, and topics and make a clear statement of the problems you identified in your project proposal.

Project Requirement ¶

Installation/Configuration by Ansible playbook or relevant tools (Ansible Roles)
Reproducibility - runnable on Linux distribution
Sample Dataset - up to 480GB per team
12 VM instances with m1.medium are given to the utmost each team
Software Stacks similar to Software Layers

Project Copyright ¶

Your project deliverables may be introduced in the future classes or be shared by others online after the end of semester.

Project Proposal ¶

Please submit your project proposal by due. The proposal.rst RST file is provided in the project template repository. Fork this repository and write your proposal in the file under ‘docs’ directory. Find RST Quick Reference , Online RST Editor here. A project proposal is typically 1-2 pages long and should contain in the description section:

the nature of the project and its context
the technologies used
any proprietary issues
specific aims you intent to complete
and a list of intended deliverables (atrifacts produced)

Oral Presentation ¶

You are required to demonstrate your project during the presentation week. The clear statement of problems are necessary with schedule, plan, role of team members, resources to use.

A student will use Adobe Connect to give a presentation which will be recorded.
3-5 minutes per team.
Presentation can be substituted with written reports upon approval. 1-2 page progress report(s) need to be included.

Presentation Guideline ¶

Demonstrate the following criteria:
- team members (roles)
- problem definition
- list of technologies
- list of development tools, languages
- list of dataset and its availability
- schedule
- resources to use
All presentations will be recorded.

Progress Checkup ¶

The following activities will be evaluated:

Code development in a project repository
Participation of team members
Software installation
Datasets preparation

List of Possible Projects ¶

Note

We are currently working on this and any software and/or details are subject to change without notice. This is reference only.

Big Data Analytics Stack
- Deployment project using Ansible Playbooks (Ansible Roles)

Software Layers¶
Layer	Supported	In Progress	Optional
Scheduling Layer	YARN		Mesos
Database Layer	HBase	MongoDB, MySQL	CouchDB, PostgreSQL, Memcached, Redis
Analytics Layer	Java	MLlib, Python	BLAS, LAPACK, Mahout, MLbase, R
Data Processing Layer	Hadoop MapReduce, Spark, Pig	Storm, Flink	Tez, Hama, Hive

You may consider to work on Big Data Analytics Stack using Ansible Playbooks. The default configuration of the stack is YARN + HDFS + Java + Hadoop MapReduce, Spark, and Pig. You can develop a new addon for one of the optional software and attach to your stack. Find more details here big-data-stack, Ansible Roles.

Projects from Software Deployments ¶

Projects related to the hadoop stack consist of either extending the functionality or using the current features. This repository is intended to define a simple, easily deployable, customizable, data analytics stack built on hadoop. Currently, deployment is done to a virtual cluster running on OpenStack Kilo on FutureSystems.

big-data-stack¶
Title	Category	Data Sets	Technologies
big-data-stack	Software Deployments	n/a	Ansible

Projects Derived from Benchmarking Sets ¶

There are many benchmark sets such as BigDataBench, HiBench, Graph 500, BigBench, LinkBench, MineBench, BG Benchmark, Berkeley Big Data Benchmark, TPCx-HS, and CloudSuite. See http://dsc.soic.indiana.edu/publications/OgreFacetsv9.pdf

BigDataBench, ICT, Chinese Academy of Sciences**¶
Title	Category	Data Sets	Technologies
Amazon Movie Reviews	Batch Data Analytics	8 million reviews	Hadoop Spark MPI
Google web graph	Batch Data Analytics	Webgraph from Google, 2002	Hadoop Spark MPI
Facebook Social Network	Batch Data Analytics	Facebook data	Hadoop Spark MPI
Genome sequence data	Batch Data Analytics	`.cfa` sample data (unstructured text file)	Work Queue (master/worker framework)

Wang, Lei, et al. “Bigdatabench: A big data benchmark suite from internet services.” High Performance Computer Architecture (HPCA), 2014 IEEE 20th International Symposium on. IEEE, 2014. link

Storm, Hadoop, Hive, Mahout from Intel and Yahoo¶
Title	Category	Data Sets	Technologies
Storm Benchmark	Batch Data Analytics	https://github.com/intel-hadoop/storm-benchmark	Storm
Big Data Benchmark for Big Bench	Batch Data Analytics	https://github.com/intel-hadoop/Big-Data-Benchmark-for-Big-Bench	Hadoop, Hive, Mahout

HiBench¶
Title	Category	Data Sets	Technologies
Micro Benchmarks Sort WordCount TeraSort EnhancedDFSIO	Batch Data Analytics	https://github.com/intel-hadoop/HiBench	Hadoop
Web Search Nutch Indexing Page Rank	Batch Data Analytics	https://github.com/intel-hadoop/HiBench	Mahout
Machine Learning Bayesian Classification K-means Clustering	Batch Data Analytics	https://github.com/intel-hadoop/HiBench	Mahout
OLAP Analytical Query Hive Join Hive Aggregation	Batch Data Analytics	https://github.com/intel-hadoop/HiBench	Hive

Other Benchmarking Sets¶
Title	Category	Data Sets	Technologies
Graph 500	Batch Data Analytics	main site	MPI
BigBench	Batch Data Analytics	main site	MapReduce Hadoop
LinkBench	Batch Data Analytics	main repo	Java MySQL
BG Benchmark	Batch Data Analytics	main site	MongoDB HBase VoltDB
Berkeley Big Data Benchmark	Data Systems	main site	Redshift Hive SparkSQL Impala Stinger/Tez
TPCx-HS	Data Systems	main site	Hadoop
CloudSuite	Batch Data Analytics	main site	MapReduce
MineBench	Batch Data Analytics	main site, Data Generator

Projects From NIST ¶

Possible Projects From NIST* (http://bigdatawg.nist.gov/_uploadfiles/M0399_v2_8471652990.doc)¶
Title	Category	Data Sets	Technologies
Fingerprint Matching	Batch Data Analytics	NIST Special Database 27a (Free) NIST Special Database 14, 29, 30 (non-Free)	Apache Hadoop Spark HBase
Human and Face Detection from Video (simulated streaming data)	Streaming Data Analytics	OpenCV, INRIA Person Dataset	Apache Hadoop Spark OpenCV Mahout MLlib
Live Twitter Analysis	Streaming Data Analytics	Live Twitter feed	Apache Strom HBase Twitter’s Search and Streaming APIs, D3.js Tableau
Big data Analytics for Healthcare Data/Health informatics	Batch Data Analytics	Medicare Part-B in 2014	Apache Hadoop Spark HBase Mahout Lucene/Solr MLlib
Spatial Big data/Spatial Statistics/Geographic Information Systems	Batch Data Analytics	Uber Ride Sharing GPS Data	Apache Hadoop Spark GIS-tools Mahout MLlib
Data Warehousing and Data mining	Batch Data Analytics	2010 Census Data Products: United States	Apache Hadoop Spark HBase MongoDB Hive Pig Mahout Lucene/Solr MLlib

*Reference URL of these projects: http://bigdatawg.nist.gov/_uploadfiles/M0399_v2_8471652990.doc

2015 Fall Suggested Projects¶
Title	Data set	Software	Category
NIST Fingerprint (a subset of): NFIQ PCASYS MINDTCT BOZORTH3 NFSEG SIVV	NIST Special Database 27A [4GB]	NIST Biometric Image Software (NBIS) v5.0 [userguide]	Batch Data Analytics
Hadoop Benchmark (each) - TeraSort Suite	Teragen	hadoop-examples.jar	Batch Data Analytics
Hadoop Benchmark (each) - DFSIO (HDFS Performance)		hadoop-mapreduce-client-jobclient	Batch Data Analytics
Hadoop Benchmark (each) - NNBench (NameNode Perf.)		hadoop-mapreduce-client-jobclient	Batch Data Analytics
Hadoop Benchmark (each) - MRBench (MapReduce Perf.)		src/test/org/apache/hadoop/mapred/MRBench.java	Batch Data Analytics

Projects from Other Sources ¶

Projects From Ohter Sources¶
Title	Category	Data Sets	Technologies
MapReduce Implementation for Longest Common Substring Problem	Batch Data Analytics	Escherichia coli K-12	Python Amazon MapReduce
MapReduce Implementation for GFF Parsing	Batch Data Analytics		Python Disco Amazon EC2 MapReduce

Examples from the previous class
- List of Project 2015 Fall (In Progress)

List of Datasets ¶

Note

We are currently working on this and any software and/or details are subject to change without notice. This is reference only.

Examples from the previous class
- List of Datasets 2015 Fall

Note

There is no direct support on datasets.

Note

Large datasets should be informed to Course Team. These will be prepared and downloadable via /share/project2/FG491 on india.futuresystems.org

List of Technologies ¶

Note

We are currently working on this and any software and/or details are subject to change without notice. This is reference only.

ABDS and HPC Technologies and Software Stacks
Examples from the previous class
- List of Technologies 2015 Fall
- List of Technologies (2015 Spring)

Note

There is no direct support on Analytics software.

Details on Software Submission ¶

Code submission should be made at Github including a README file.

Source code on Github: https://github.iu.edu/bdossp-sp16/sw-project-template

README includes:

Test instruction
List of data source
List of technologies used

Details on Final Report ¶

Final report concludes the work of your team and describes findings with its results. The following sections should be included:

Description of your project
Problem statement
Purpose and objectives
Results
Findings
Implementation
References
- original source of code snippets
- original source of datasets

The final reports should sastify the following guidelines:

4 - 6 pages
Time Roman 12 point – spacing 1.1 in Microsoft Word
Figures can be included
Proper citations must be included
Material may be taken from other sources but that must amount to at most 25% of original work and must be cited
The level should be similar to a publishable paper or technical report

Details on Grading Criteria ¶

Proposal
- Clear statement
- Quality and Breath
- Interest
Code
- Reproducibility
- Executable (Most weighted)
- Instruction of Installation
- Instruction of Configuration
- Datasets
- Acknowledgements
- Gee whiz factor
Report
- Related Work
- Completeness
- Level of insight

FAQ ¶

Use of FutureSytem is required?

A. No, it is not required, but it must be deployable and runnable on FutureSystems Kilo and you should provide detailed instructions on how to do so. Ideally, running ansbile-playbook site.yml should be all that is needed to deploy, after booting and editing inventory.txt file.

I need more time to complete code development, may I have an extension?

A. Extension would be approved upon request. Send an extension request email message to the course email with a title [Project Extension] and an expected completion date.

Q. Our team wants to change a topic or scope of a project after project proposal or presentation, is it allowed?

A. Topic should be close to what you proposed earlier. Please contact Dr. Fox or Course Email if you change a topic or a scope of your project significantly. Also inform if you change team members. These changes would be approved upon request.

Report or survey type of final project is allowed?

No, software project is only allowed in this class.

Q. I found there is a similar project that I proposed, should I keep working on my project?

A. Consult with Course Team to make differences in detail. You may be asked to focus on specific area in order to avoid similarity.

Can’t make a oral presentation because I have a business trip (or a conference).

Schedule a meeting in Week 11 or Week 13 with Course Team.

What does that mean there is no direct support on Datasets and Analytics software?

We will provide support for accessing a dataset under 500 GB

Questions & Support ¶

Course Email: bdosspcoursehelp@googlegroups.com
Google Hangout (voice & screen share): upon request

Useful Links ¶

Analytics tools
Scheduler
- Running Marathon with Apache Mesos
Database
GUI Tools
Python Tools
- Pydoop
- IPython Notebook (Jupyter)
Visualization
- Highcharts Getting Started
- Pyplot Tutorial