GEDIVA
Data are everywhere, especially given the increasing commitment by government and corporations to making their processes transparent. The flip side is that we are overwhelmed by the sheer amount of data available, unable to turn it into useful information. However, many studies have shown that when data is analyzed, it can be used to gain a deeper understanding of a domain or even make transformative changes based on patterns found. Thus the main challenge we face is how to programmatically turn that data into information we can use. These web sites are examples of trying to start that process: Swivel, ManyEyes, and a Visualization Gallery. Google has also jumped into the mix, announcing its own Data Explorer.
Specification
A typical architecture for many program designs is one that divides a program's execution into three stages: input, data is provided to the program; process, that data is transformed; and output, the program displays the results of transforming that data. This input/process/output (IPO) model of programming is used in simple programs like this one as well as in million-line programs that forecast the weather or predict stock market fluctuations. Your final program should be clearly separated into three independent modules such that each contains one or more classes that make it flexible enough to accommodate a variety of options without requiring either of the other modules to change. To do this, you must carefully design an API, Application Programming Interface, that defines the result of each step, and how to interact with it, so it can be received independently by the next step.
Write a program to allow users to import different data formats and visualize them in different ways.
- Input: read data files from a variety of sources, from a file on the current computer or across the web, and also in at least two different formats. For example, a CSV file or an HTML table of data.
- Process: filter and order data in variety of ways. For example, sorting based on different data columns or filtering out out of range data. Note, all orderings and filters should be allowed to be reversed.
- Output: display the data in a variety of ways. For example, as a bar graph or a data table. Note, users should be able to display multiple data sets at once.
The Input phase can be viewed as the back-end, or Model, of program, while the Output phase is the front-end, or View. The Process phase is mostly in the Model, but can have components in the View as well. For this project, one pair of students will work exclusively on the Model and one pair will work only on the View. Your pairs will agree to an API between you that cannot be changed after the first week. If it must be changed, the change and its reasons must be clearly described in a separate document turned in with the final version of the project.
Extensions
These extensions are intended to stretch your design further and to differentiate your program from others in order to capture the global data visualiation market, your team should agree on one area of extensions to focus on if you want to be considered for a grade in the A range. These extensions must further the good design of your program and not simply be hacks of code added at the last minute. If you do not have time to implement an extension, partial extra credit may be given for excellent justification of how your design either supports adding such a feature already or how it would need to changed sufficiently to support such a feature.
- Support other data types besides row-column data tables
- Text based data, produce interesting visualizations of text data, such as Tag Clouds or Text Trees
- Graph based data, like that used in social networking sites
- Image based data, especially useful for geo-located data
- Support different output formats than just graphics
- Images, to be included on a website
- HTML, to be displayed as a web page
- Allow users to customize a visualization via imported templates
- A background to display behind the data
- A color palette that swaps the colors displayed
- Allow users to interact with the visualized data
- Animate the data over some variable such as time (or provide a slider to let the user control the animation)
- Sort the data displayed on different criteria, like in most column data displayed on the web
- Change the values displayed in the axes (or swap the current axes)
- Change the displayed data to produce "what-if" scenarios
- Allow users to manage their data sites
- Organize data sites into history, favorites, or other organizational structure (like tags)
- Alert users when the data they are using has changed on the original web site
- Create an aggregation language for users
- Textual, like SQL, the Structured Query Language for accessing data base systems
- Graphical, like Yahoo Pipes for combining web sources
However, note that the amount of extra credit will be in proportion to the amount of intellectual effort needed to implement the option. For example, adding yet another way to filter key words would not be worth very much because your design should already support it. Of course, a well-tested, perfectly working program that has fewer features (but plenty of clear paths to easy expansion) is always worth more than the leaky kitchen sink. In short, to maximize your grade, you should implement enough variety in your program to clearly demonstrate that your design supports further such extensions.
Resources
- Josh Bloch on API Design
- A Data Type to Exploit Online Data Sources by M. Thornton and S. Edwards
- On the Criteria to be used in Decomposing a System into Modules by D.L. Parnas
- MapReduce: Simplified Data Processing on Large Clusters by J. Dean and S. Ghemawat