Starfish
    For latest update and release of Starfish, please visit
Starfish website.
    Sratfish is a workflow management and query processing system on top of multi storage engines that supports optimizations at multiple levels. The main contributions of Starfish relays on two aspects, optimizations for multi query processing and smart
usage of multi storage engines.
    Most of the current optimization strategies for query processing focus on improving
performance of individual query/workflow. However, the reality that multiple queries are usually submitted together as a batch supplies us with more
opptunities for optimization at a global level, which is not currently explored extensively. In Starfish, we try to optimize multiple queries by
exploring the correlations, choosing the right partitioning method and reusing computation and intermediate data. In details, we will do the management and optimization at three level:
job-level, which focuses on parameter tuning for a single job; Workflow level, which optimizes the execution plan and picks the right operators, just as
what Pig does; Workload-level, which merge the node/operators
from different execution plans to reuse the intermediate computation/result, in which case we may get global optimization even some of the workflow is
sub-optimum. Being aware of the entire data life cycle, we can do something smart for both current and future workload. The other key contribution of Starfish would be the
smart usage of mulitple storage engines. As has been argued for a long time, Database outperforms MapReduce-like system by enabling schema, indexing,
partitioning, data-level statistics. However, the performance gains at runtime comes at the expensive cost of loading process, which may outweight the benefits
for ad-hoc queries. Starfish tries to combine the distributed file system, column/row based storage, etc. together so that at run time, data will reside at the
appropriate storage engine according to specific context, and execution engine will make the most of that storage engine to improve the query processing. This project will also
take advantage of our past/ongoing projects in join optimization,
cost modeling, automatic configuration tuning.

Figure 1. Starfish ecosystem
    In this project, I mainly work on the computation reuse and workload integration and optimization. Also, I am developing an interface call
Lastword that takes declarative queries and MR jobs as input and for logical plan representation, or translate query into SQL statement for
processing in RDBMS when necessary.
Publication:
Herodotos Herodotou, Harold Lim,
Gang Luo, Nedyalko Borisov, Liang Dong, Fatma Bilgen Cetin, Shivnath Babu,
Starfish: A Self-tuning System for Big Data
Analytics.
In Proceedings of the Conference on Innovative Data Systems Research (CIDR '11), January 2011.  
Paper  
Slides
Adaptive MapReduce
    The idea of using MapReduce to process large scale data in parallelism and its open-source implementation Hadoop become more and more popular nowadays. Even though it is pretty simple to start with-you can immediately process TeraByte-level data on cluster just by writing a small piece of code, there are still more magic than expanding your cluster that we can explore to make it works more efficiently.
Adaptive MapReduce is supposed to make such automatic tuning even without the awareness of the job submitter. To achieve such self-optimization involves firstly a good understanding of the features of the cluster and how Hadoop behaves. The features of data (e.g. data skew level) should also be considered to avoid load imbalance.
    Currently, we are focusing on learning data characters and building cost model
for MapReduce environment. Extensive experiment is on going to capture details about how Hadoop behaves, how cluster response to different configurations
and how much parallelism we can explore for given resource. Based on these, a cost model is being built to help balancing the workload on different
nodes. Beyond the general work, I pay particular focus on parallel join in this project. Based on some join approaches we developed before, I am now
looking for 1) how to eliminate the memory limitation and make map join more available. 2) how to make good use of semi join for data partitioning and
skew handling. 3) how to tune the parameters for a join task in the fly. 4) how to choose a join plan in the plan space automatically. Many of these
goals we go towards are meaningful for neither Pig nor Hive have realized them.
    This work is lead by
Prof. Shivnath
Babu and is on going. We recently received AWS research grant from Amazon which will better support our project. For the latest updates, click
HERE.
Efficient Joins in Hadoop 
[PDF1] 
[PDF2]
    Join is the most costly operation in any data processing paradigm. Thus, we should explore more options beyond the default join approach and make a good choice among all those methods for a particular join task. Being a really new technology, MapReduce and its open source implementation Hadoop expose us to TeraByte-level data processing. Our work focus on implementing more join approaches and to see how they perform for a given context.
    our experiment shows that the network bandwidth is usually the biggest bottleneck for performance. Thus, we can benefit a lot by reducing the amount of data transferred over network as much as possible. One way to do this is to use map side join involving only the map phase, which is also known as the replication join in Database community, thus avoid transferring data from mappers to reducers. However, if the replicated table cannot fit into memory, map side join will fail (we don’t want to swap the buckets between memory and disk). A solution for this is to eliminate the records will not appear in the join result by first doing a semi join. Semi join will give us all the keys appear in both of the two tables, as well as other by-product such as information of the data distribution. In this project, we only use the result of semi join as a mask for eliminating useless records. Even after filtering, the small table may not fit into memory. For this situation, we resort back to the default approaches, but use the semi join result to eliminate useless record at map side, thus reduce the amount of data transferred to reduce side.
    We build a simple planer to choose a join method that may be most efficient for that situation. As well, we implement a demo system with a SQL-like interface which support arbitrary number of projection and selection predicates.
    This is a tentative beginning on MapReduce Join approaches. It is being shaped and improved as part of our later work
Adeptive Hadoop.
    This work is in cooperation with Liang Dong.
Trace Driven Prefetching 
[PDF]
    Many computer programs (e.g. object-oriented programs) contain data structures where the contents are not stored sequentially in memory. This includes pointer-based data structures such as linked lists, graphics, etc. Two elements that are logically adjacent to each other may reside at random memory locations, which increases the caching complexity and the cache miss rate. Based on the privious work on prefetching on spatial locality or fixed accessing pattern, we decided to use Markov model to capture the memory access pattern and have this pattern updated dynamically. The Markov model is supposed to dynamically trace the data access pattern and self-revise. We use a table stored in a buffer to simulate the Markov model. In order not to contaminate the real cache, we build extra cache for prefetched data, and make it as fast as L1 so that there is no latency even we miss L1 cache.
    The Markov model is supposed to dynamically trace the data access pattern and self-revise. We use a table stored in a buffer to simulate the Markov model. In order not to contaminate the real cache, we build extra cache for prefetched data, and make it as fast as L1 so that there is no latency even we miss L1 cache.
    This work is in cooperation with James Wu and Matthew Fulmer.
Interests Mining and Personalized Recommendation Service
    When you google "apple", you would probably get 38,600,000 items associative to this keyword. Are you patient enough to check each of the items returned to you in the first 5 pages? Possibly not! One frustrating thing is, most of the items may have nothing to do with what you want. Do you want a next generation IPod from Apple? Or do you want to find a fruit store around your house?
   
There are three main deficiencies existing in many of the current systems on which we focus in this thesis. The first one is how to discover the interest and how to categorize the interest. Many current systems base on the keywords that users input. However, it is sometimes difficult to express what you want by just several keywords, thus the keywords are not so reliable as expected (remember "apple"?). Some other methods are taken, such like the link analysis, which analyzes the target page of the link the user click in order to discover the interest by the content of this page. But this method does not pay attention to the time consumed. A page at which a user just pays a glimpse for a few seconds could reflect nothing about the user’s interest. The second problem is how to restore the interest and how to keep it up-to-date. It is natural that a user could have many interests at the same time, among which there will be the most interesting item and the least one. How to organize this multi-layer information plays an important role in the performance. Further more, seldom system now emphasizes on the change of the interest. But it is normal that a user’s interest is always changing, so it is critical for a system to adapt to this change and comply with the new interest. The last problem is how to supply the service to the user according to his/her interest. After obtaining the interest of a user, by the means of the keywords for example, it is meaningless to retype these keyword into the search box and return the search result, because the user has done exactly the same thing before. It must be boring to review something you have seen before, rather than something different but more relevant!
   
The work will attack the problems mentioned above to improve the personalized service system and currently, the demo system is built on the literature searching engine SemreX which only focus on the computer science area. To begin with, we use a multi-level VSM model which looks like a tree to accurately represent users' interest. Each node of the tree is a VSM node. The root of that tree represent a user, while the inner nodes represent the interests categories and the leaf nodes represent the specific interests. The catalog of interest will first base on the computer words classification from ACM. In order to detect users' interests, we took all the factors into consideration such as the keyword the user input, the keywords in the title that user clicked, and content of the target pages and the time user consume on that page. We assign different weight to each of these factors and calculate the interest degree. With these values at hand, we add more nodes to, or just update (by increasing the interest degree) relevant nodes in that tree. To model the change of interests, the model will periodically decrease the interest degree by half for all the nodes and some nodes will be eliminated from the tree when its degree is below certain threshold.
   
Based on the interests hierarchy, we now can supply personalized services to users. To be more specific, when the user search for result associative to a keyword, we first detect the category(ies) that keyword belongs to, and get rid of other irrelevant results. In case the user does want something that he/she is currently not interested in, we store these results in other panel and users can get to them if they want. Meanwhile, we will recommends something automatically according to what the user is looking for now. To better supply more items a user is interested in but not yet detected by our system, we use the interest information of other users with the similar background (this background information can be obtained at registration) and recommend something in the same interest categories. A evaluation mechanism for the literatures is added to help users more efficiently find the most valuable things to them.
   
This work is done in cooperation with Siming Tao, and in supervision of Prof. Pingpen Yuan.