CS+: CompSci Projects Beyond the Classroom

What is CS+?

CS+ is a ten week summer program exclusively for Duke undergraduates to get involved in computer science research projects with faculty in a fast-paced but supportive community environment. Students participate in teams of 3-4 and are jointly mentored by a faculty project lead and a graduate student mentor. The experience is meant as a rich entry point into computer science research and applications beyond the classroom.

Logistics

  • Applications are closed - Only students enrolled at Duke University are eligible to apply.
  • The 2022 program will run for 10 weeks from Monday, May 23 through Friday, July 29, 2022.
  • We are currently planning for and requiring in-person attendance, following Duke guidelines for summer programs.
  • We expect students to participate in this program full-time (40 hours/week). You cannot take summer courses or do other internships/fellowships while doing CS+.
  • Participants will receive a stipend of $5,000 to cover expenses.
  • The deadline to apply is Feb 18, 2022. We begin evaluating applications in early March; applications and offers are on a rolling deadline.

If you have questions about the program, please email csplus@cs.duke.edu.

FAQs

What is the difference between Code+, Data+, and CS+?  All three “plus” programs have the same model: students collaborating in teams on a project in tech/data for the same 10 weeks of the summer and receiving a stipend of the same amount. We also partner to provide some common events (talks, social events, final poster fair, etc) in order to create a larger ecosystem of students studying in tech and data over the summer; over 100 students participated in 2019 across all three programs. Each program has its own application.

  • CS+ focuses on projects in computer science research and applications and is run by the Department of Computer Science. Project leads are typically computer science faculty.
  • Data+ focuses on interdisciplinary data science projects from all over the university, and is run by Rhodes I.I.D. in Gross Hall. Project leads are typically faculty from diverse areas of the university, with frequent additional participation from community and/or industry partners.
  • Code+ focuses on projects in software and product development and is run by Duke OIT taking place at the American Tobacco Campus in downtown Durham. Project leads are professional IT developers with the emphasis on developing real-world development experience.

Do I apply to the program, or can I pick the projects I want to be a part of?  You can apply specifically to the projects and faculty of interest to you.

How much background do I need?  CS+ is intended for students who have some computer science experience, but students do not need to be computer science majors or rising seniors in order to apply. We welcome and encourage applications from rising 2nd and 3rd year students who have completed the introductory course sequence in computer science and have skills and interests that make them a good fit for their projects. Feel free to reach out to individual project leaders to discuss background for specific projects.

CS+ Projects 2022

Lead: Xiaowei Yang

Description: Web cache is an essential service on the Internet to speed up page loading. Current websites usually employ Content Distribution Networks (CDNs) as a third-party cache service. Such a practice requires that a website developer correctly configures web contents as cacheable and non-cacheable. A mistake made in configuration may cause a user's private data to leak to other users. We plan to conduct a measurement study of web cache misconfigurations that may lead to private data leakage. The misconfigurations include HTTP header misconfigurations or CDN misconfigurations. The challenge of this project is to develop a methodology to detect private data on a website. We have a preliminary method that differentiates private and public requests by inspecting the responses of authenticated and unauthenticated accounts. However, such a method may lead to a high false-positive rate because of the prevalent dynamic requests present in the web ecosystem. We expect to improve such a method in this project.

Goals/Deliverables: The ideal deliverables will be a project write-up that can be published at systems or networking conferences such as USENIX NSDI, ACM SIGCOMM, and ACM IMC.

Background/Prerequisites: Students who are interested in this project should be familiar with one programming language and web development. Students will learn and practice the knowledge of network protocols and network security in this project.

Lead: Xiaowei Yang

Description: IP anycast refers to the routing practice where routers announce the same IP address prefix from multiple network locations. In part due to its inherent routing support and performance, anycast is commonly used by popular network services such as DNS, content delivery networks (CDNs), and distributed denial of service (DDoS) mitigation services. Anycast helps these services to distribute traffic loads across multiple network locations and/or to reduce latency between servers and clients. A key challenge for deploying anycast services effectively is that the mappings between client networks and anycast sites (i.e., anycast catchments) are determined by Internet’s inter-domain routing protocol, the Border Gateway Protocol (BGP)’s policy-based routing decisions rather than service providers’ goals such as minimizing latency and balancing the load. Several measurement studies have revealed that some anycast catchments exhibit unexpectedly inflated latency, and increasing the number of anycast sites in a deployment (in an attempt to reduce the distance between clients and sites) counter-intuitively increases the average latency for clients and disrupts attempts to balance load. In this project, we will conduct active BGP experiments as well as passive measurements to understand how various factors, including external BGP routing factors and internal structural factors of an anycast network, affect the performance of an anycast network. This understanding will help network operators deploy a performant anycast network.

Goals/Deliverables: The ideal deliverables include a project write-up that can lead to a publication at a high-quality networking conference, including but not limited to Internet Measurement Conference, ACM SIGCOMM, and USENIX NSDI.

Background/Prerequisites: Python data analysis for large volume data, Internet measurement, and undergrad level network class such as CS/ECE 356.

Leads: Danyang Zhuo, Matthew Lentz

Description: There is growing interest in deploying machine learning models on the edge of the Internet, including distributed video analytics and IoT. Our project’s goal is to enable efficient ML workflow serving on the edge. Given a workflow of ML models, our optimization framework decides which components should be deployed at the sensor, on the edge, or in the cloud. The decisions have to take into consideration the computation power of edge devices (e.g., cellphone, camera, edge servers) and the required amount of edge-cloud communication. In addition, the optimization framework may transform the workflow to enable better decomposition and deployment of components on the computing infrastructure. This project is a hybrid of both theory and practice; it involves both algorithm design along with implementation of the algorithms to evaluate their performance on real-world workloads. Our existing framework consists of a simple, yet powerful, optimizer. We need to invent optimization algorithms to further enhance ML model placement. We also need to build an end-to-end system to support running of complex ML workflows.

Goals/Deliverables: We expect there to be several types of deliverables at the end of the summer:

  • Algorithms/Designs: Algorithmic enhancements to our optimization framework at both the logical layer (ML workflow structure and transformations) and physical layer (placement of ML workflow compute tasks on infrastructure)
  • Real-world Implementation: Translating theory to practice through implementation of the algorithms/designs on top of our existing distributed ML serving system

Background/Prerequisites:

  • Fluent in Python programming;
  • Knowledge of low-level programming languages (C, C++, Rust) is a plus;
  • General knowledge about machine learning; and
  • Prior experience running simple ML models using PyTorch or Tensorflow is preferred but not required.

Lead: Maciej Mazurowski

Description: In this project we ask the question: "Will a convolutional neural network trained on medical imaging data from one institution perform well on data from another institution?". This is a crucial question when it comes to practical implementation of machine learning algorithms in medicine. In this project, students will be exposed to concepts from the field of deep learning as well as become familiar with medical imaging data. You will have an opportunity to implement and test convolutional neural networks and will work as a part of a team including multiple experts in machine learning. The project will conclude in a collaborative paper.

Goals/Deliverables: We will share code and write a collaborative paper at the end of the project.

Background/Prerequisites: The students should have strong programming skills in Python and some experience in writing code involving convolutional neural networks.

Lead: Rong Ge

Description: Recent work [LCDR18] has shown (remarkably) that it is possible to do machine translation in an entirely unsupervised way through alignment of text embedding spaces across languages. This line of work has sparked significant interest in understanding how text embedding spaces compare to another, and several works [AMJ18, SRV18, VRS20] have considered measures for comparing embedding spaces across languages. However, seemingly little work has been done on understanding how embedding spaces differ across a single language when trained with different methods or using different random initializations. The recent results of [AZL21] have shown that different model initializations can lead to learning different features of the input data, and the goal of this project is to understand whether this is also the case in the context of learning text embeddings. Namely, we hope to first apply some of the existing metrics in the aforementioned works to understanding how much randomness affects learned text embedding spaces, and then we also hope that these initial experiments will potentially lead to new insights for either (a) comparing embeddings, (b) training better text embedding models, or (c) both.

References:

  • [AMJ18] David Alvarez-Melis and Tommi Jaakkola. Gromov-Wasserstein alignment of word embedding spaces. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 1881–1890, Brussels, Belgium, October-November 2018. Association for Computational Linguistics.
  • [AZL21] Zeyuan Allen-Zhu and Yuanzhi Li. Towards understanding ensemble, knowledge distillation and self-distillation in deep learning, 2021.
  • [LCDR18] Guillaume Lample, Alexis Conneau, Ludovic Denoyer, and Marc’Aurelio Ranzato. Unsupervised machine translation using monolingual corpora only, 2018.
  • [SRV18] Anders Søgaard, Sebastian Ruder, and Ivan Vuli ́c. On the limitations of unsupervised bilingual dictionary induction, 2018.
  • [VRS20] Ivan Vuli ́c, Sebastian Ruder, and Anders Søgaard. Are all good word vector spaces isomorphic?, 2020

Goals/Deliverables: A report (potentially turned into a research paper).

Background/Prerequisites: Familiarity with linear algebra, machine learning, willing to learn some deep learning framework (most likely pytorch).

Lead: Kamesh Munagala

Description: The project will involve a combination of modeling, algorithm design, and data analysis. Potential topics include designing algorithms for societal decision making, machine learning, or pricing problems whose outcomes are fair to participants. This will involve delving into what it means to be fair, possibly via data analysis in a concrete application domain.

Goals/Deliverables: Work towards a research paper.

Background/Prerequisites: Mathematical sophistication, and some programming knowledge (Python or Java).

Lead: Pankaj Agarwal

Description: Multi robot systems are widely deployed in logistics, a variety of civil engineering and nature preserving tasks, and in agriculture, to name a few areas. With the broad progress in robotics, the task of planning the motion of multi robot systems with performance guarantees is of increasing importance. One aspect of the autonomy of a multi-robot system is its collision-free motion capability, namely the ability of its constituent robots to navigate in their workspace without colliding with obstacles nor with other robots. The basic motion-planning problem for a team of robots is to plan such collision-free paths for the robots between given free start and final positions. Specifically, we are given a family of robots, each modeled as a simple shape (e.g. disc), moving in a planar environment. Given a set of initial and final placements for each robot, the goal is to find a good quality motion plan for all the robots. The quality of a motion plan is measured by the sum of lengths of the paths or the maximum taken by a robot to reach its destination. The overall goal is to design scalable algorithms for finding high-quality motions.

Goals/Deliverables: Adapt existing algorithms, implement them, and test their efficacy and efficiency.

Background/Prerequisites: Basic knowledge of algorithms and data structures and strong coding skills.

Lead: Fred Dietrich, Associate Professor, Molecular Genetics and Microbiology

Description: In recent years experimental methods to determine three-dimensional protein structures have improved significantly, and in July 2021 the source code for Alphafold from Deepmind has been released. Alphafold is considered by many to be a significant improvement in the ability to predict protein structures. These developments open up new areas to explore in biological and biomedical research. In the past the fields of genetics and protein structure determination have barely overlapped. This may well change over the next few years. For “Project Protein Fold” we intend to ask students to explore the following big question: Can protein structure, either experimentally determined or predicted, inform us on the nature of polymorphism found in human disease?

Goals/Deliverables: Develop a pipeline that can extract polymorphism data for specific human genes, generate predicted protein structures for each gene for each allele, and determine the spectrum of structural changes resulting from common, rare, disease causing, non- disease causing, and potential but not observed alleles. This information is publicly available from several sources. The primary challenges here are selecting an appropriate algorithm to compare structures. A secondary challenge is deciding which portion of the protein sequence is necessary to use in this analysis. These are both non-trivial problems. A final issue is how to put the results together in a generally comprehensible form, comprehensible to those working in the area of human genetics.

Background/Prerequisites: Programming skills, especially in Python, comfortable working with data, some experience in algorithms and/or machine learning helpful, experience in computational biology is a plus.

Lead: Brandon Fain

Description: As data-driven algorithmic systems become increasingly pervasive in the real world and make decisions/optimizations that directly impact humans, one fundamental concern is that these systems be designed with fairness in mind. For example: When building a classifier to decide who should receive a loan (i.e., who is “credit worthy”) how do we ensure that different demographic groups are treated equitably, keeping in mind past bias in this and other areas of societal decision making? There is an established literature in machine learning and computational economics that has developed substantially over the last decade addressing these kinds of issues from an algorithmic perspective.

One algorithmic problem for which fairness is less well understood is reinforcement learning, where the consequences of the actions of the algorithmic decision maker may be initially unclear and may impact the future state of the world. This problem was introduced in an ICML 2017 paper by Jabbari et al, but we propose to follow-up in a very different model; rather than constraining the policy space for fairness, we will consider the problem of multi-objective reinforcement learning where the objectives represent different persons or groups of persons to whom the algorithm wishes to be fair. See recent work, for example, in ICML 2020 by Siddique et al. For example, you can imagine a reinforcement learning AI that acts as a personal assistant whose decisions impact multiple individuals, or a loan-granting algorithm that learns over time but seeks to be fair across demographic groups. Multi-objective reinforcement learning has been studied in classical and modern deep learning forms (see, e.g., JMLR 2014 by Van Moffaert and Nowe and ICML 2019 by Bellemare et al, respectively). Our goal is to synthesize research from reinforcement learning and algorithmic fairness to design an efficient and effective reinforcement learning algorithm that learns fair policies.

Goals/Deliverables:

  1. Extensive literature review in modern reinforcement learning and algorithmic fairness culminating in a substantial annotated bibliography.
  2. Model of the problem informed by the literature review along with motivating examples and appropriate definition(s) of fairness.
  3. Creation of one or more novel algorithms based on modifications to existing techniques.
  4. Implementation of algorithms in code.
  5. Experimentation on selected or novel constructed environments.
  6. Final goal is a research manuscript to be submitted for publication during AY 22-23 at an AI or algorithmic fairness conference.

Background/Prerequisites: Experience with data-driven algorithms. Mathematical maturity sufficient to read and assess mathematical models and proofs. Experience with AI/ML in Python is a plus. Students could be more interested in the theory or the experiments/application, a mix would be good.

Leads: Xiaobai Sun, Nikos Pitsianis, Dimitris Floros

Description: The project mission is four-fold:

  1. To create an online platform for collecting, archiving and sharing mycelium sightings and information from all regions and countries, for later categorizing and mycelium network analysis;
  2. To facilitate better education and understanding of mycelium networks and their impacts on local and global environments and to create environment-friendly materials and industries;
  3. To complement the existing data sites and databases by and for professional mycelium researchers;
  4. To engage computer science students in broad-impact data science activities and enable research inquiries and investigations that were previously impossible.

Goals/Deliverables: The expected deliverables will be an open-source website that citizen scientists can contribute and use for collecting and analyzing mycelium data and networks.

Prerequisites and Learning Objectives: Students will make use of the basic technologies for relational databases and web interfaces; Advanced students will get familiar with model-view-controller (MVC) frameworks and the OpenStreetMap API.


Summer Research Projects:

Main    2021    2020    2019