Project Guidelines


  • NIST Fingerprint Example (03/09/2016)
  • HBase is now supported (03/02/2016)
  • Examples are under development (03/02/2016)
  • Projects, datasets, and technologies from the past are available (02/26/2016)

Important Dates

  • Project Proposal: March 18th
  • Oral Presentation: Week 12 - April 1st, 2nd (Tentative)
  • Progress Checkup: Week 14 - April 15th, 16th (Tentative)
  • Final Submission: April 29th


Those who can’t make the presentation with a time conflict, schedule a meeting with Course Team.

Team Coordination

Up to 3 members is recommended but individual is allowed.

Project Expectation (Grade)

Final project counts as 60% of semester grade and 40% goes on assignments.

  • 60% Final project
    • 10% Proposal
    • 10% Presentation
    • 30% Source code
    • 10% Report

Project Style

  • Basic
  • Bonus

You do not require strong background or programming skills with HPC or Hadoop to complete a final project. We’ve noticed that, however, there are some difficulties learning Linux systems, shells, or scripts and improving programming skills with parallelization in general. You have two options, Basic and Bonus, to start your project based on your capability on these.

Basic Project

Basic project starts from existing projects and extends the scope of the projects with minimal efforts on code developments. For example, take existing Hadoop benchmark tools and run them on hadoop clusters with different system configurations to compare. Try to increase data nodes, master nodes or add ZooKeeper with different settings and measure differences. Comparing performance in different software versions, settings or configurations tells you where focal points are to optimize or improve throughput of hadoop. Choose a basic project if you are not conpetent with programming languages e.g. Java or Python. Note that starting from existing projects doesn’t mean that you can simply search and download popular projects on the internet and execute. You need to address new findings and include the original source of the projects that you referenced in your final project and reports.

  • Minimal code writing
  • Start from existing projects

Bonus Project

If you are working on a bonus project, you are required to write code/scripts to implement your idea in the final project. Installation and configuration should be done by Ansible Playbooks. For example, take NIST Facial Recognition software and run with Hadoop clusters. Change serial calculation to be executed in parallel. Writing map and reduce functions may be necessary in Java, Python or Scala. Write Ansible Playbooks to install and configure your software packages within a few commands. If data analytics is the area that you are interested, you may try to develop new techniques to improve performance or implement parallel algorithms for complex face detection. Developing parallel programs would be involved in most cases. There are other possibilities as well. For instance, take hadoop-ansible-stacks which consists of basic components of Hadoop and append new software tools by writing new playbooks in roles and addons. You could add Hives or update Spark with the latest release using parameters or definition in YAML. If you focus on managing systems and software deployments, think about how to manage traffics by adding/removing additional nodes or how to apply new patches on particular nodes. Bonus points are given exceptional project results.

  • Ansible is required
  • Extensive code and scripts writing are welcome
  • Using GitHub Issues is mandatory to communicate with AIs for your projects
  • Bonus points

Project Choice

  • Deployment
  • Benchmark (Performance Test)
  • Parallelization
  • Analytics
  • Created Own (upon approval)


Deployment project focuses on automated software deployments on multiple nodes using automation tools/configuration managements such as Ansible, Chef, Puppet, Salt or Juju. For example, you can work on deploying Hadoop clusters with 10 medium virtual instances or Sharded MongoDB clusters or filesystems e.g. NFS or Gluster. Ansible is recommended and supported in the class.


  • Deployment Hadoop clusters
  • Deployment cluster managers (e.g. Mesos)


Benchmark project focuses on testing system’s performance by putting some stresses on different spots. Filesystems, CPUs, or memories can be tested and measured, if you think about hardware benchmark. APIs, messaging queues, load balancers or any applications can be tested and measured, if software is more focused. Hibench, Big Data Benchmark, or built-in tools e.g. Terasort are available for Hadoop benchmark.


  • Hibench
  • Storm Benchmark
  • Big Data Benchmark for Big Bench


Parallelization project focuses on building efficient software stacks in parallel including MPI and Hadoop clusters. For example, you may find writing map and reduce functions is relatively easy e.g. WordCount, but applying it in practice with large datasets isn’t that simple. Think about how to load your dataset into hadoop file systems or databases and run your jobs in a distributed fashion.


  • Pig
  • Spark


Analytics project focuses on developing algorithms for different problems based on datasets and topics that you chose in your project. You will be required to develop algorithms for improving parallelism or performance in this project rather than developing new algorithm for face recognition, for example.


  • Faunus Graph analytics
  • Ibis

Created Own

You can develop own project idea and make it as a class project upon approval. Describe your thought, tools, and topics and make a clear statement of the problems you identified in your project proposal.

Project Requirement

  • Installation/Configuration by Ansible playbook or relevant tools (Ansible Roles)
  • Reproducibility - runnable on Linux distribution
  • Sample Dataset - up to 480GB per team
  • 12 VM instances with m1.medium are given to the utmost each team
  • Software Stacks similar to Software Layers

Project Proposal

Please submit your project proposal by due. The proposal.rst RST file is provided in the project template repository. Fork this repository and write your proposal in the file under ‘docs’ directory. Find RST Quick Reference , Online RST Editor here. A project proposal is typically 1-2 pages long and should contain in the description section:

  • the nature of the project and its context
  • the technologies used
  • any proprietary issues
  • specific aims you intent to complete
  • and a list of intended deliverables (atrifacts produced)

Oral Presentation

You are required to demonstrate your project during the presentation week. The clear statement of problems are necessary with schedule, plan, role of team members, resources to use.

  • A student will use Adobe Connect to give a presentation which will be recorded.
  • 3-5 minutes per team.
  • Presentation can be substituted with written reports upon approval. 1-2 page progress report(s) need to be included.

Presentation Guideline

  • Demonstrate the following criteria:
    • team members (roles)
    • problem definition
    • list of technologies
    • list of development tools, languages
    • list of dataset and its availability
    • schedule
    • resources to use
  • All presentations will be recorded.

Progress Checkup

The following activities will be evaluated:

  • Code development in a project repository
  • Participation of team members
  • Software installation
  • Datasets preparation

List of Possible Projects


We are currently working on this and any software and/or details are subject to change without notice. This is reference only.

  • Big Data Analytics Stack
Software Layers
Layer Supported In Progress Optional
Scheduling Layer YARN   Mesos
Database Layer HBase MongoDB, MySQL CouchDB, PostgreSQL, Memcached, Redis
Analytics Layer Java MLlib, Python BLAS, LAPACK, Mahout, MLbase, R
Data Processing Layer Hadoop MapReduce, Spark, Pig Storm, Flink Tez, Hama, Hive

You may consider to work on Big Data Analytics Stack using Ansible Playbooks. The default configuration of the stack is YARN + HDFS + Java + Hadoop MapReduce, Spark, and Pig. You can develop a new addon for one of the optional software and attach to your stack. Find more details here big-data-stack, Ansible Roles.

Projects from Software Deployments

Projects related to the hadoop stack consist of either extending the functionality or using the current features. This repository is intended to define a simple, easily deployable, customizable, data analytics stack built on hadoop. Currently, deployment is done to a virtual cluster running on OpenStack Kilo on FutureSystems.

Title Category Data Sets Technologies
big-data-stack Software Deployments n/a Ansible

Projects Derived from Benchmarking Sets

There are many benchmark sets such as BigDataBench, HiBench, Graph 500, BigBench, LinkBench, MineBench, BG Benchmark, Berkeley Big Data Benchmark, TPCx-HS, and CloudSuite. See

BigDataBench, ICT, Chinese Academy of Sciences**
Title Category Data Sets Technologies
Amazon Movie Reviews Batch Data Analytics 8 million reviews
  • Hadoop
  • Spark
  • MPI
Google web graph Batch Data Analytics Webgraph from Google, 2002
  • Hadoop
  • Spark
  • MPI
Facebook Social Network Batch Data Analytics Facebook data
  • Hadoop
  • Spark
  • MPI
Genome sequence data Batch Data Analytics .cfa sample data (unstructured text file) Work Queue (master/worker framework)

Wang, Lei, et al. “Bigdatabench: A big data benchmark suite from internet services.” High Performance Computer Architecture (HPCA), 2014 IEEE 20th International Symposium on. IEEE, 2014. link

Storm, Hadoop, Hive, Mahout from Intel and Yahoo
Title Category Data Sets Technologies
Storm Benchmark Batch Data Analytics Storm
Big Data Benchmark for Big Bench Batch Data Analytics Hadoop, Hive, Mahout
Title Category Data Sets Technologies
Micro Benchmarks
  • Sort
  • WordCount
  • TeraSort
  • EnhancedDFSIO
Batch Data Analytics Hadoop
Web Search
  • Nutch Indexing
  • Page Rank
Batch Data Analytics Mahout
Machine Learning
  • Bayesian Classification
  • K-means Clustering
Batch Data Analytics Mahout
OLAP Analytical Query
  • Hive Join
  • Hive Aggregation
Batch Data Analytics Hive
Other Benchmarking Sets
Title Category Data Sets Technologies
Graph 500 Batch Data Analytics main site MPI
BigBench Batch Data Analytics main site
  • MapReduce
  • Hadoop
LinkBench Batch Data Analytics main repo
  • Java
  • MySQL
BG Benchmark Batch Data Analytics main site
  • MongoDB
  • HBase
  • VoltDB
Berkeley Big Data Benchmark Data Systems main site
  • Redshift
  • Hive
  • SparkSQL
  • Impala
  • Stinger/Tez
TPCx-HS Data Systems main site Hadoop
CloudSuite Batch Data Analytics main site MapReduce
MineBench Batch Data Analytics main site, Data Generator  

Projects From NIST

Possible Projects From NIST* (
Title Category Data Sets Technologies
Fingerprint Matching Batch Data Analytics
  • NIST Special Database 27a (Free)
  • NIST Special Database 14, 29, 30 (non-Free)
  • Apache Hadoop
  • Spark
  • HBase
Human and Face Detection from Video (simulated streaming data) Streaming Data Analytics OpenCV, INRIA Person Dataset
  • Apache Hadoop
  • Spark
  • OpenCV
  • Mahout
  • MLlib
Live Twitter Analysis Streaming Data Analytics Live Twitter feed
  • Apache Strom
  • HBase
  • Twitter’s Search and Streaming APIs,
  • D3.js
  • Tableau
Big data Analytics for Healthcare Data/Health informatics Batch Data Analytics Medicare Part-B in 2014
  • Apache Hadoop
  • Spark
  • HBase
  • Mahout
  • Lucene/Solr
  • MLlib
Spatial Big data/Spatial Statistics/Geographic Information Systems Batch Data Analytics Uber Ride Sharing GPS Data
  • Apache Hadoop
  • Spark
  • GIS-tools
  • Mahout
  • MLlib
Data Warehousing and Data mining Batch Data Analytics 2010 Census Data Products: United States
  • Apache Hadoop
  • Spark
  • HBase
  • MongoDB
  • Hive
  • Pig
  • Mahout
  • Lucene/Solr
  • MLlib
2015 Fall Suggested Projects
Title Data set Software Category
NIST Fingerprint (a subset of): NFIQ PCASYS MINDTCT BOZORTH3 NFSEG SIVV NIST Special Database 27A [4GB] NIST Biometric Image Software (NBIS) v5.0 [userguide] Batch Data Analytics
Hadoop Benchmark (each) - TeraSort Suite Teragen hadoop-examples.jar Batch Data Analytics
Hadoop Benchmark (each) - DFSIO (HDFS Performance)   hadoop-mapreduce-client-jobclient Batch Data Analytics
Hadoop Benchmark (each) - NNBench (NameNode Perf.)   hadoop-mapreduce-client-jobclient Batch Data Analytics
Hadoop Benchmark (each) - MRBench (MapReduce Perf.)   src/test/org/apache/hadoop/mapred/ Batch Data Analytics

Projects from Other Sources

Projects From Ohter Sources
Title Category Data Sets Technologies
MapReduce Implementation for Longest Common Substring Problem Batch Data Analytics Escherichia coli K-12
  • Python
  • Amazon
  • MapReduce
MapReduce Implementation for GFF Parsing Batch Data Analytics  
  • Python
  • Disco
  • Amazon EC2
  • MapReduce

List of Datasets


We are currently working on this and any software and/or details are subject to change without notice. This is reference only.


There is no direct support on datasets.


Large datasets should be informed to Course Team. These will be prepared and downloadable via /share/project2/FG491 on

List of Technologies


We are currently working on this and any software and/or details are subject to change without notice. This is reference only.


There is no direct support on Analytics software.

Details on Software Submission

Code submission should be made at Github including a README file.

README includes:

  • Test instruction
  • List of data source
  • List of technologies used

Details on Final Report

Final report concludes the work of your team and describes findings with its results. The following sections should be included:

  • Description of your project

  • Problem statement

  • Purpose and objectives

  • Results

  • Findings

  • Implementation

  • References
    • original source of code snippets
    • original source of datasets

The final reports should sastify the following guidelines:

  • 4 - 6 pages
  • Time Roman 12 point – spacing 1.1 in Microsoft Word
  • Figures can be included
  • Proper citations must be included
  • Material may be taken from other sources but that must amount to at most 25% of original work and must be cited
  • The level should be similar to a publishable paper or technical report

Details on Grading Criteria

  • Proposal
    • Clear statement
    • Quality and Breath
    • Interest
  • Code
    • Reproducibility
    • Executable (Most weighted)
    • Instruction of Installation
    • Instruction of Configuration
    • Datasets
    • Acknowledgements
    • Gee whiz factor
  • Report
    • Related Work
    • Completeness
    • Level of insight


  1. Use of FutureSytem is required?

A. No, it is not required, but it must be deployable and runnable on FutureSystems Kilo and you should provide detailed instructions on how to do so. Ideally, running ansbile-playbook site.yml should be all that is needed to deploy, after booting and editing inventory.txt file.

  1. I need more time to complete code development, may I have an extension?

A. Extension would be approved upon request. Send an extension request email message to the course email with a title [Project Extension] and an expected completion date.

Q. Our team wants to change a topic or scope of a project after project proposal or presentation, is it allowed?

A. Topic should be close to what you proposed earlier. Please contact Dr. Fox or Course Email if you change a topic or a scope of your project significantly. Also inform if you change team members. These changes would be approved upon request.

  1. Report or survey type of final project is allowed?
  1. No, software project is only allowed in this class.

Q. I found there is a similar project that I proposed, should I keep working on my project?

A. Consult with Course Team to make differences in detail. You may be asked to focus on specific area in order to avoid similarity.

  1. Can’t make a oral presentation because I have a business trip (or a conference).
  1. Schedule a meeting in Week 11 or Week 13 with Course Team.
  1. What does that mean there is no direct support on Datasets and Analytics software?
  1. We will provide support for accessing a dataset under 500 GB

Questions & Support