CS+: CompSci Projects Beyond the Classroom

CS+ is a ten week summer program exclusively for Duke undergraduates to get involved in computer science research projects with faculty in a fast-paced but supportive community environment. Students participate in teams of 3-4 and are jointly mentored by a faculty project lead and a graduate student mentor. The experience is meant as a rich entry point into computer science research and applications beyond the classroom.

Logistics:

  • The program will run for ten weeks from Tuesday, June 1, 2021 (the day after Memorial Day) through Friday, August 6th, 2021.
  • We anticipate the program will be virtual.
  • Participants will receive a stipend of $5,000 to cover expenses.
  • We will begin evaluating applications starting early March. Applications and offers will be rolling.

If you have questions about the program or issues with the application, please email csplus@duke.edu.

Apply Now

FAQs:

What is the difference between Code+, Data+, and CS+?  All three “plus” programs have the same model: students collaborating in teams on a project in tech/data for the same 10 weeks of the summer and receiving a stipend of the same amount. We also partner to provide some common events (talks, social events, final poster fair, etc) in order to create a larger ecosystem of students studying in tech and data over the summer; over 100 students participated in 2019 across all three programs. Each program has its own application.

  • CS+ focuses on projects in computer science research and applications and is run by the Department of Computer Science. Project leads are typically computer science faculty.
  • Data+ focuses on interdisciplinary data science projects from all over the university, and is run by Rhodes I.I.D. in Gross Hall. Project leads are typically faculty from diverse areas of the university, with frequent additional participation from community and/or industry partners.
  • Code+ focuses on projects in software and product development and is run by Duke OIT taking place at the American Tobacco Campus in downtown Durham. Project leads are professional IT developers with the emphasis on developing real-world development experience.

Do I apply to the program, or can I pick the projects I want to be a part of?  You can apply specifically to the projects and faculty of interest to you.

How much background do I need?  CS+ is intended for students who have some computer science experience, but students do not need to be computer science majors or rising seniors in order to apply. We welcome and encourage applications from rising 2nd and 3rd year students who have completed the introductory course sequence in computer science and have skills and interests that make them a good fit for their projects.

Summer 2021 Projects

Leads: Cynthia Rudin, Sudeepa Roy, Alex Volfovsky

Description: The Almost Matching Exactly lab designs software tools for matching in causal inference. However, our code and interfaces are not perfect, and we would love to have users try out the code on applications. A self-motivated team of students can help us to improve, and at the same time, learn how to troubleshoot and apply this type of code to real problems.

Outcomes: This is up to the student, but we expect the students to be troubleshooting code, writing code, and designing cool applications.

Skills: Coding, communication, machine learning, and causal inference.

Lead: Debmalya Panigrahi

Description: Recent progress in both combinatorial and algebraic techniques in graph algorithms has given us hope that some of the hardest, and longest standing, challenges in the field might finally be within our reach. This includes, for a graph on n vertices and m edges, the following problems:

  1. find all pairs min-cuts in an undirected graph faster than n-1 max-flows
  2. find a global min-cut of a directed graph in o(mn) time
  3. find the reliability of a graph in near-linear time

All these problems have been open for at least two decades, some of them even for more than five decades. But, for each of these problems, there is increasing evidence that we are close to finally solving them. For (a.), Li and Panigrahi recently showed that the problem can be solved if one allows approximations. For (b.), Panigrahi and others recently gave an o(mn) algorithm for vertex connectivity, which takes us halfway to directed connectivity. For (c.), Karger recently showed that the problem is amenable to techniques from randomized connectivity algorithms, and near-linear time algorithms are already known for this category. This project will explore one of these (or related) problems. The project will be theoretical in nature.

Outcomes: If the project leads to a new result, then it will be published in a research paper. Students will also learn about the process of theoretical research.

Skills: Design and analysis of algorithms, formal proofs, and reading theoretical computer science research papers.

Lead: Kristin Stephens-Martinez

Description: CS101 Reviewer App is a web application that provides an online quiz tool to students enrolled in CS101 at Duke University. It enables students to quiz themselves on CS101 topics with carefully designed questions that check for specific misunderstandings of the content. A recent feature includes an autogenerated quiz that chooses what topics to focus on for the student based on their past performance. This project has the potential to go in different directions. We have ideas to improve the app, such as adding different question types, improving the algorithm that generates the auto-generated quiz, or adding automated hints based on the student's wrong answers. We also have data analysis needs that will inform future features.

Outcomes: If students want to improve the app, they will modify the existing app codebase and produce new code. If they want to do data analysis, they will need to write code to clean and analyze the data and produce a small report describing what they did and explaining their results.

Skills: Web app development for a Python app OR data analysis in either Python or R.

Leads: Kamesh Munagala and Brandon Fain

Description: Fairness is an emerging concept within algorithm design. This project will consider settings where multiple participants use the solution to a certain optimization routine, and this solution provides them with different utility. The question we ask is how should these routines be designed so that different demographic groups obtain comparable utility. Though this question sounds abstract, we will work with well-defined notions of fairness, and well-defined optimization problems. The research will be largely theoretical and analytical in nature, but there will be opportunity to test out the resulting procedures on datasets.

Outcomes: Algorithmic insights and a research paper.

Skills: Background in design and analysis of algorithms. Please list algorithms and related courses you have taken.

Lead: Rong Ge

Description: Recently self-supervised learning has become a popular way to do unsupervised learning (learning without labels). In self-supervised learning, the algorithm will hide some information from the input and try to predict the hidden information. Empirically, such learning algorithms are successful in many domains such as natural language processing and image understanding.

Traditionally, unsupervised learning problems are often solved using latent variable models. The goal of this project is to try to understand why self-supervised learning might have a better performance.

Outcomes: The project will start by experimenting self-supervised learning ideas on data generated for some traditional latent variable models, such as HMM (Hidden Markov Model) or topic models. Then the goal is to systematically change the setting to find scenarios where self-supervised learning can outperform traditional latent variable models, and understand why.

Skills: Math: probabilities, calculus, willingness to learn new things. Machine learning: Used or willing to learn standard deep learning packages (e.g. pytorch).

Leads: Bhuwan Dhingra (lead), Ashwin Machanavajjhala, Lavanya Vasudevan, Jun Yang

Description: Misinformation about vaccines has led to vaccination hesitancy, which is listed by WHO as a TOP-10 threat to global health. This project aims at developing automated tools that assist health workers and the public in dispelling myths about vaccines. A key observation is that for interventions to be effective, they must address the individuals’ specific concerns and be perceived as credible, which means they must be highly contextualized and personalized.

To help pinpoint the specific concerns, we have created a taxonomy of common misconceptions about vaccines, collected a corpus of articles containing vaccine misinformation, and worked on developing techniques for labeling articles with specific misconceptions. We plan to curate a corpus of intervention articles that help dispel specific misconceptions and are also diverse enough to appeal to individuals with different backgrounds. Putting these techniques together, our ultimate goal is to develop tools/apps that can be used by health workers or deployed alongside web/social media platforms to combat vaccine misinformation.

One specific aim of this summer will be to develop NLP techniques for identifying different categories of vaccine misinformation from text. A secondary aim will be to apply these techniques to a corpus of articles and social media posts and develop visualizations which aid in understanding how vaccine misinformation evolves over time and with the introduction of new vaccines such as for COVID.

Outcomes: Given a small amount of labeled data and a corpus of articles containing vaccine misinformation, we expect students to train multiple machine learning models that classify sentences, paragraphs or whole articles into our taxonomy of common misconceptions about vaccines. Specifically the students will be expected to adapt existing large-scale language models such as BERT for the task. Using the best model they will then develop visualizations and tools which help understand how the misconceptions change over time, both in our corpus and separate collection of social media data. Students will complete a research report on their experiments and also produce an extensible codebase which will aid further research after the summer. If appropriate, the report may also be submitted as a research paper to a conference.

Skills:

  • Ability to survey papers on machine learning and natural language processing
  • Running machine learning models in Python, both off-the-shelf implementations and with minor modifications
  • Data visualization and preparation in Python

Leads: Jun Yang, Sudeepa Roy, Kristin Stephens-Martinez

Description: The goal of our project is to create an interactive debugger called I-Rex for SQL, which is a ubiquitous query language for accessing and modifying data stored in relational databases. SQL can quickly get complex in practice and it is a challenge for novice to learn and debug. I-Rex allows users to interactively “trace” through highly complex SQL queries (e.g., those involving aggregation, nesting, and correlation), understand how they execute, and debug wrong queries.

As the need for data manipulation and analysis becomes ever more important to more people, tools like I-Rex are sorely needed. We plan to deploy I-Rex in our courses (CompSci 216/316/516) in Fall 2021. We are looking for help to improve the backend so I-Rex supports all of SQL and to make it robust. We are also looking for help on the frontend to improve both usability and effectiveness. Finally, we are also interested in anyone who wants to help evaluate how well I-Rex helps novices learn relational querying.

Outcomes: The desired deliverables include a fully working I-Rex system and a clean codebase with proper documentation. If students make progress on related research problems, there are opportunities for writing research/demonstration papers.

Skills: Knowledge/experience with at least one of following areas; must be able to learn quickly as needed:

  • SQL (CompSci 316 or CompSci 516 would suffice) and Python/Java programming
  • Frontend design and implementation (e.g., JavaScript, Web frameworks like Flask, Apache)

Lead: Alberto Bartesaghi

Description: Cryogenic electron microscopes – or cryo-EM for short – allow researchers to peer at the microscopic shape of cellular proteins like never before. These machines blast proteins with a 300,000-volt beam of electrons so that highly sensitive detectors underneath can tease out their shapes based on the interaction that occurs. Being able to “see” proteins – life’s crucial building materials – can help determine how they work. Recognizing protein structure and function is essential for scientists trying to design better drugs to tackle some the world’s most devastating diseases, including HIV, cancer, COVID-19 and Alzheimer’s disease. A 300,000-volt electron beam is, however, extremely damaging to the proteins it is trying to image. To help protect the samples in the machine, researchers cryogenically freeze them to help maintain their integrity and use very low electron doses to prevent structural damage which results in extremely noisy images.

An emerging modality of cryo-EM called cryo-electron tomography (cryo-ET) uses computerized tomography principles to provide an accurate representation of the 3D molecular architecture of entire cells. The mining of the rich information contained in the native cellular environment is hindered by the crowded nature of cells populated by many different molecular species. The accurate detection of individual molecules in 3D is a critical step towards allowing the visualization of these molecular machines at high-resolution. Motivated by recent advances in deep neural network approaches for object detection in natural images and autonomous navigation, this project seeks to apply these methods to detect the position of macromolecules within 3D images of frozen hydrated cells with the ultimate goal of understanding cellular function and disease at the molecular level.

Outcomes: As part of this project, students will write computer code that will take as input 3D volumes of cells and automatically detect the location of multiple molecular species so they can later be extracted and used for high-resolution 3D visualization. Students will carry out the development in a dedicated high-performance computing (HPC) environment and at the end of the project will write a research paper to describe their approach and present results obtained on real datasets.

Skills: Knowledge of Python and background or interest in deep learning, image processing or computer vision.

Lead: Xiaowei Yang

Description: As global Internet traffic grows, more and more content networks depend on IP anycast to serve their global requests from multiple content caches. Unlike DNS-based content load balancing, the anycast network distributes clients' requests at the mercy of the inter-domain routing protocol Border Gateway Protocol (BGP)[3]. Previous work measured the real performance and benefits of the anycast network and observed highly skewed load distribution and sub-optimal load distribution[1,2].

In order to understand what causes the inefficiency of IP anycast, we propose to measure to what extent Network Providers optimize the anycast network in the wild. Unlike previous anycast measurement projects, which focus on application-level performance, we focus on mining the control plane, i.e., BGP prefix configuration parameters of various routers for different anycast service providers.

Outcomes: The ideal deliverables include a project write-up that can lead to a publication at a high-quality networking conference, including but not limited to Internet Measurement Conference, ACM SIGCOMM, and USENIX NSDI.

Skills: Required:

  1. Familiarity with common Linux commands;
  2. Familiarity with Python programming and requests package; and
  3. Familiarity with the data processing in Python, such as regex, and pandas.

Any knowledge of BGP and inter-AS routing is preferred; and any previous experience in network measurement is preferred. Students who took a previous offering of CS356 should have sufficient background knowledge to participate in this project.


Recent Years' Summer Research Projects:

2020   2019