Welcome

    Hi, I am Gang Luo, an alumi of thet Department of Computer Science, Duke University. I was pursuing my M.S. degree when I was here. I am now working at Two Sigma Investment LLC. as a software engineer with focus on data engineering.
    My research interests are Database and Large Scale Data processing. Especially, I am interested in applying elegant technologies in traditional database into new data processing framework, such as MapReduce, to make it more efficient and adaptive in the cloud computing environment. My advisor is Prof. Shivnath Babu.
    Before I come to Duke, I got my Bachelor's degree on computer science & technology in Huazhong Univ. of Science & Technology. My undergraduate research advisor is Prof. Pingpeng Yuan

Resume  [PDF]

Education

  • M.S.candidate in Computer Science, Duke University, 2009-present
  • B.E. in Computer Science & Technology, Huazhong University of Science and Technology, 2009
  • Projects

  • Adaptive Hadoop
  •     A project I am working on now which aims to supply a self-managed and self-optimized MapReduce environment, so that users can execute join and aggregation, or plugin their own program without considering optimization.
  • Efficient Joins in Hadoop [Report]
  •     A course project in which we discuss two additional join methods in MapReduce framework. A demo was implemented to dynamically choose the best option for given tables.
  • Trace Driven Prefetching [Report]
  •     A course project in which we implement a Markov-based prefetching machenism for cache. This prefeching aims to improve the cache performance for pointer-intensive data structure programs.
  • Interests Mining and Personalized Recommendation Service
  •     This is my undergraduate research topic. It is a Web application to detect, manage and extend user's interests and recommend the most relevent content to the user's interests.

    Professional Experience

  • Internship in PINGAN Insurance Company, China, 2008.10-2009.1
  •     As a software engineer, I worked on development and debugging for the online insurance selling system
  • Membership in Electrical and Electronic Technology Innovation Centre, China, 2006.2-2007.9
  •     Participated in workshop and traning on embedded system technology. Worked on projects Embedded digit recognition, Secure access control system and Environmental polution monitoring and management system.

    Teaching

  • CPS 216: Data-Intensive Computing Systems, Teaching Assistant, Fall 2010

    Publication

  • Herodotos Herodotou, Harold Lim, Gang Luo, Nedyalko Borisov, Liang Dong, Fatma Bilgen Cetin, Shivnath Babu, Starfish: A Self-tuning System for Big Data Analytics. In Proceedings of the 5th Biennial Conference on Innovative Data Systems Research (CIDR '11), Asilomar, California, USA, January 2011.

    Courses

  • CPS_212: Distributed Information System , Fall 2010, Instructor: Jeff Chase
  • CPS_296.2: Advanced Topics in Networking , Fall 2010, Instructor: Xiaowei Yang
  • CPS 210: Operating Systems , Spring 2010, Instructor: Jeff Chase
  • CPS 296.1: Adv Topics: Data Management and Processing , Spring 2010, Instructor: Jun Yang
  • GS 320A: Academic Writing I, Spring 2010, Diane Bryson
  • CPS 216: Advanced Database Systems, Fall 2009, Instructor: Shivnath Babu
  • ECE 252: Advanced Computer Architecture I, Fall 2009, Instructor: Daniel J. Sorin
  • CPS 230: Design and Analysis of Algorithms, Fall 2009, Instructor: Pankaj K. Agarwal
  • CPS 300: Introduction to Graduate Study, Fall 2009, Instructor: Jun Yang

    Computer Skills

  • Languages: C, C++, Java
  • Database: Oracle, MySQL
  • Big Data: MapReduce, LINQ, HBase, Hive, Pig
  • Web: JSP, JavaScript, AJAX, Weblogic, Tomcat
  • Others: Regular Expression, Bash Script, System programming in Linux
  • For Fun: 3Ds Max, Bryce, Poser
  • Project


    Starfish

        For latest update and release of Starfish, please visit Starfish website.

        Sratfish is a workflow management and query processing system on top of multi storage engines that supports optimizations at multiple levels. The main contributions of Starfish relays on two aspects, optimizations for multi query processing and smart usage of multi storage engines.
        Most of the current optimization strategies for query processing focus on improving performance of individual query/workflow. However, the reality that multiple queries are usually submitted together as a batch supplies us with more opptunities for optimization at a global level, which is not currently explored extensively. In Starfish, we try to optimize multiple queries by exploring the correlations, choosing the right partitioning method and reusing computation and intermediate data. In details, we will do the management and optimization at three level: job-level, which focuses on parameter tuning for a single job; Workflow level, which optimizes the execution plan and picks the right operators, just as what Pig does; Workload-level, which merge the node/operators from different execution plans to reuse the intermediate computation/result, in which case we may get global optimization even some of the workflow is sub-optimum. Being aware of the entire data life cycle, we can do something smart for both current and future workload. The other key contribution of Starfish would be the smart usage of mulitple storage engines. As has been argued for a long time, Database outperforms MapReduce-like system by enabling schema, indexing, partitioning, data-level statistics. However, the performance gains at runtime comes at the expensive cost of loading process, which may outweight the benefits for ad-hoc queries. Starfish tries to combine the distributed file system, column/row based storage, etc. together so that at run time, data will reside at the appropriate storage engine according to specific context, and execution engine will make the most of that storage engine to improve the query processing. This project will also take advantage of our past/ongoing projects in join optimization, cost modeling, automatic configuration tuning.
    Figure 1. Starfish ecosystem

        In this project, I mainly work on the computation reuse and workload integration and optimization. Also, I am developing an interface call Lastword that takes declarative queries and MR jobs as input and for logical plan representation, or translate query into SQL statement for processing in RDBMS when necessary.

    Publication:
    Herodotos Herodotou, Harold Lim, Gang Luo, Nedyalko Borisov, Liang Dong, Fatma Bilgen Cetin, Shivnath Babu, Starfish: A Self-tuning System for Big Data Analytics. In Proceedings of the Conference on Innovative Data Systems Research (CIDR '11), January 2011.   Paper   Slides

    Adaptive MapReduce

        The idea of using MapReduce to process large scale data in parallelism and its open-source implementation Hadoop become more and more popular nowadays. Even though it is pretty simple to start with-you can immediately process TeraByte-level data on cluster just by writing a small piece of code, there are still more magic than expanding your cluster that we can explore to make it works more efficiently. Adaptive MapReduce is supposed to make such automatic tuning even without the awareness of the job submitter. To achieve such self-optimization involves firstly a good understanding of the features of the cluster and how Hadoop behaves. The features of data (e.g. data skew level) should also be considered to avoid load imbalance.
        Currently, we are focusing on learning data characters and building cost model for MapReduce environment. Extensive experiment is on going to capture details about how Hadoop behaves, how cluster response to different configurations and how much parallelism we can explore for given resource. Based on these, a cost model is being built to help balancing the workload on different nodes. Beyond the general work, I pay particular focus on parallel join in this project. Based on some join approaches we developed before, I am now looking for 1) how to eliminate the memory limitation and make map join more available. 2) how to make good use of semi join for data partitioning and skew handling. 3) how to tune the parameters for a join task in the fly. 4) how to choose a join plan in the plan space automatically. Many of these goals we go towards are meaningful for neither Pig nor Hive have realized them.
        This work is lead by Prof. Shivnath Babu and is on going. We recently received AWS research grant from Amazon which will better support our project. For the latest updates, click HERE.

    Efficient Joins in Hadoop [PDF1] [PDF2]

        Join is the most costly operation in any data processing paradigm. Thus, we should explore more options beyond the default join approach and make a good choice among all those methods for a particular join task. Being a really new technology, MapReduce and its open source implementation Hadoop expose us to TeraByte-level data processing. Our work focus on implementing more join approaches and to see how they perform for a given context.
        our experiment shows that the network bandwidth is usually the biggest bottleneck for performance. Thus, we can benefit a lot by reducing the amount of data transferred over network as much as possible. One way to do this is to use map side join involving only the map phase, which is also known as the replication join in Database community, thus avoid transferring data from mappers to reducers. However, if the replicated table cannot fit into memory, map side join will fail (we don’t want to swap the buckets between memory and disk). A solution for this is to eliminate the records will not appear in the join result by first doing a semi join. Semi join will give us all the keys appear in both of the two tables, as well as other by-product such as information of the data distribution. In this project, we only use the result of semi join as a mask for eliminating useless records. Even after filtering, the small table may not fit into memory. For this situation, we resort back to the default approaches, but use the semi join result to eliminate useless record at map side, thus reduce the amount of data transferred to reduce side.
        We build a simple planer to choose a join method that may be most efficient for that situation. As well, we implement a demo system with a SQL-like interface which support arbitrary number of projection and selection predicates.
        This is a tentative beginning on MapReduce Join approaches. It is being shaped and improved as part of our later work Adeptive Hadoop.
        This work is in cooperation with Liang Dong.

    Trace Driven Prefetching [PDF]

        Many computer programs (e.g. object-oriented programs) contain data structures where the contents are not stored sequentially in memory. This includes pointer-based data structures such as linked lists, graphics, etc. Two elements that are logically adjacent to each other may reside at random memory locations, which increases the caching complexity and the cache miss rate. Based on the privious work on prefetching on spatial locality or fixed accessing pattern, we decided to use Markov model to capture the memory access pattern and have this pattern updated dynamically. The Markov model is supposed to dynamically trace the data access pattern and self-revise. We use a table stored in a buffer to simulate the Markov model. In order not to contaminate the real cache, we build extra cache for prefetched data, and make it as fast as L1 so that there is no latency even we miss L1 cache.
        The Markov model is supposed to dynamically trace the data access pattern and self-revise. We use a table stored in a buffer to simulate the Markov model. In order not to contaminate the real cache, we build extra cache for prefetched data, and make it as fast as L1 so that there is no latency even we miss L1 cache.
        This work is in cooperation with James Wu and Matthew Fulmer.

    Interests Mining and Personalized Recommendation Service

        When you google "apple", you would probably get 38,600,000 items associative to this keyword. Are you patient enough to check each of the items returned to you in the first 5 pages? Possibly not! One frustrating thing is, most of the items may have nothing to do with what you want. Do you want a next generation IPod from Apple? Or do you want to find a fruit store around your house?
        There are three main deficiencies existing in many of the current systems on which we focus in this thesis. The first one is how to discover the interest and how to categorize the interest. Many current systems base on the keywords that users input. However, it is sometimes difficult to express what you want by just several keywords, thus the keywords are not so reliable as expected (remember "apple"?). Some other methods are taken, such like the link analysis, which analyzes the target page of the link the user click in order to discover the interest by the content of this page. But this method does not pay attention to the time consumed. A page at which a user just pays a glimpse for a few seconds could reflect nothing about the user’s interest. The second problem is how to restore the interest and how to keep it up-to-date. It is natural that a user could have many interests at the same time, among which there will be the most interesting item and the least one. How to organize this multi-layer information plays an important role in the performance. Further more, seldom system now emphasizes on the change of the interest. But it is normal that a user’s interest is always changing, so it is critical for a system to adapt to this change and comply with the new interest. The last problem is how to supply the service to the user according to his/her interest. After obtaining the interest of a user, by the means of the keywords for example, it is meaningless to retype these keyword into the search box and return the search result, because the user has done exactly the same thing before. It must be boring to review something you have seen before, rather than something different but more relevant!
        The work will attack the problems mentioned above to improve the personalized service system and currently, the demo system is built on the literature searching engine SemreX which only focus on the computer science area. To begin with, we use a multi-level VSM model which looks like a tree to accurately represent users' interest. Each node of the tree is a VSM node. The root of that tree represent a user, while the inner nodes represent the interests categories and the leaf nodes represent the specific interests. The catalog of interest will first base on the computer words classification from ACM. In order to detect users' interests, we took all the factors into consideration such as the keyword the user input, the keywords in the title that user clicked, and content of the target pages and the time user consume on that page. We assign different weight to each of these factors and calculate the interest degree. With these values at hand, we add more nodes to, or just update (by increasing the interest degree) relevant nodes in that tree. To model the change of interests, the model will periodically decrease the interest degree by half for all the nodes and some nodes will be eliminated from the tree when its degree is below certain threshold.
        Based on the interests hierarchy, we now can supply personalized services to users. To be more specific, when the user search for result associative to a keyword, we first detect the category(ies) that keyword belongs to, and get rid of other irrelevant results. In case the user does want something that he/she is currently not interested in, we store these results in other panel and users can get to them if they want. Meanwhile, we will recommends something automatically according to what the user is looking for now. To better supply more items a user is interested in but not yet detected by our system, we use the interest information of other users with the similar background (this background information can be obtained at registration) and recommend something in the same interest categories. A evaluation mechanism for the literatures is added to help users more efficiently find the most valuable things to them.
        This work is done in cooperation with Siming Tao, and in supervision of Prof. Pingpen Yuan.

    Miscellaneous

    Presentations:

  • Introduction to Pig. Teaching for CPS216. Nov. 2010
  • MapReduce Online. Presentation for CPS296. Oct. 2010
  • Joins in MapReduce. Project presentation. May 2010
  • Pig. Group discussion. Apr. 2010
  • Introduction to LINQ. Group discussion. Mar. 2010
  • Contact

    Department of Computer Science
    Duke University
    Durham, NC 27708
    Office: N303A North Building
    Email: gang AT cs.duke.edu
    Phone: 919-660-6546   (office)
    Fax: 919-660-6519
     
    -->
     
    θҳģ°㺯a>