The third annual Computer Science Undergraduate Project Showcase celebrated student inquiry in computer science. Students submitted 16 videos on projects from mentored research, class projects, and independent work. The projects showcased students creatively applying skills and concepts from their computer science coursework.
2020 Undergrad Research Projects
Team: Maya Choudhury, Joyce Er, Jianchao, JJ Jiang, Emily Liu, Andres Montoya-Aristizabal, Dora Pekec, Rahul Sengottuvelu, Charlie Todd, Siyi Xu, Matthew O'Boyle, David Yoon, Jackson Proudfout, Daniel Zhou, Felicia Chen, Arthur Zhao
Abstract: Our society is struggling with an unprecedented amount of falsehoods, hyperboles and half-truths that do harm to democracy, health, economy and national security. Fact-checking is a vital defense against this onslaught. Despite the rise of fact-checking efforts globally, fact-checkers find themselves increasingly overwhelmed and their messages difficult to reach some segments of the public.
Building on a collaboration between Public Policy and Computer Science at Duke on computational journalism and fact-checking, this project seeks to leverage the power of data and computing to help make fact-checking and dissemination of fact-checks to the public more effective, scalable and sustainable.
Team: Cameron King, Christine Yang
Abstract: Over the course of the independent study, we built, tested and launched a social app called Parrot. Parrot lets students know who may like them back by sending anonymous chirps to each other. If two people unknowingly chirp each other, it's a pair. In this study we explored how to manage this app including data storage and security, iterative UX design and improving app performance. Parrot was initially made using Swift and XCode, but now is cross platform for both iOS and Android. Learn more about Parrot.
Team: Caroline Wang, Bin Han, Feroze Mohideen, Bhrij Patel
Abstract:In recent years, academics and investigative journalists have criticized certain commercial risk assessments for their black-box nature and their failure to satisfy competing notions of fairness. Since then, the field of interpretable machine learning has created simple yet effective algorithms, while the field of fair machine learning has proposed various mathematical definitions of fairness. However, studies from these fields are largely independent, despite the fact that many applications of machine learning to social issues require both fairness and interpretability.
We explore the intersection by revisiting the recidivism prediction problem using state-of-the-art tools from interpretable machine learning, and assessing the models for performance, interpretability, and fairness. Unlike previous works, we compare against two existing risk assessments (COMPAS and the Arnold Public Safety Assessment) and train models that output probabilities rather than binary predictions. We present multiple models that beat these risk assessments in performance, and provide a fairness analysis of these models. Our results imply that machine learning models should be trained separately for separate locations, and updated over time.
Student: Joseph DeChicchis
Abstract: Although augmented reality (AR) devices and developer toolkits are becoming increasingly ubiquitous, current AR devices lack a semantic understanding of the user’s environment. Semantic understanding in an AR context is critical to improving the AR experience because it aids in narrowing the gap between the physical and virtual worlds, making AR more seamless as virtual content interacts naturally with the physical environment. A granular understanding of the user’s environment has the potential to be applied to a wide variety of problems, such as visual output security, improved mesh generation, and semantic map building of the world. This project investigates semantic understanding for AR by building and deploying a system which uses a semantic segmentation model and Magic Leap One to bring semantic understanding to a physical AR device, and explores applications of semantic understanding such as visual output security using reinforcement learning trained policies and the use of semantic context to improve mesh quality.
Student: Belanie Nagiel
Abstract: IP Anycast is a network addressing and routing methodology where network operators are able to announce the same IP address prefix from sites in different locations. Each of these sites can provide the same service to the end users and clients’ requests are automatically distributed among the sites according to the BGP protocol. Recently, anycast has also been leveraged in a new mechanism called triangular anycast forwarding. In this mechanism, users are directed first to an instance of an anycast address instead of directly to the services that they request. The anycast instance can filter out any unwanted or dangerous traffic and then forward only the legitimate traffic on to the actual service that the user is requesting. Whether or not triangular forwarding is being implemented, there can be variety in the latencies between a client and the different anycast instances. Choosing the right set of sites to advertise the anycast address prefix can have a large impact on latency, therefore choosing the optimal set of anycast sites can help decrease one way and round trip times to anycast sites. In this paper, we examine the performance and reliability of both anycast systems and triangular anycast forwarding systems in terms of the control over load balancing, route stability, and quality of paths chosen. We find that in regards to the quality of paths chosen from end users to the anycast instances, only 45.23% of the chosen paths were the optimal choice in terms of latency. Additionally, we find that the average difference in time between the chosen path and the best path is 54.51 milliseconds and the maximum difference can get to over 500 milliseconds.
Team: Junyu Liang, Webster Bei Yijie
Abstract: Machine Learning (ML), especially Deep Learning (DL), has become a new darling of technological companies. Indeed, ML has replaced traditional algorithms in performing many tasks, the most commonly known includes content recommendation, facial recognition, fraud detection and etc. The rise in popularity of ML/DL is accompanied by growth in computing power of chips developed in recent years as well as the availability of large scale datasets that are necessary for “learning”. ML algorithms have been widely deployed in the industry with the help of a variety of toolkits developed by industry leaders as well as researchers in academia.
Similar to the adoption of any significant technology in history, ML is on its way to become more beginner-friendly. In companies that now heavily rely on ML in their production environments, dummy-proof and streamlined procedures for taking raw data to actionable insights or revenue-generating functionalities were long developed and continuously optimized to ensure efficiency and reliability.
This is, however, not true for small to medium businesses and independent developers who wish to exploit the potential of machine learning while remaining focused on the primary objectives that their products target. For those companies, utilizing the All-in-One solutions such as Google Cloud AI & Machine Learning lacks the flexibility and privacy that they ask for. In addition, building their own machine learning pipeline takes too much effort to be worthy. In view of such needs, the Slaking Machine Learning platform is developed.
Slaking is built to reduce the overhead that small to medium companies encounter when building a machine learning pipeline. Slaking Machine Learning platform is a software suite deployed on a Kubernetes cluster. It is reliable and easily scalable through distributed training and multiple instances of inference endpoint backed by load balancing. With simple user interface, Slaking enables users to create a full machine learning pipeline (model construction, training, deployment to public-facing inference endpoint) within a few clicks. For ML beginners, Slaking also provide a drag-n-drop model visual editor so that they can create ML models without writing code.
Student: Anshul Shah
Abstract: Computer Science 101 is an extremely popular class at Duke University, with regularly over 200 students divided into around 10 lab sections. Before developing this project, gauging student comprehension besides general test-taking was challenging. As a result, the initial goal of the project was to generate hundreds of questions on various concepts for students to answer. After collecting the data, the course staff would be able to answer a variety of questions, from lab-level statistics on certain topics or individual student progression over time.
First and foremost, the online tool serves as an aid to the entire course curriculum, spanning topics from the first to last units of study. Students can use the tool to supplement their learning in lecture and lab throughout the semester. The questions are in the 'What Will Python Display' format, where students are given a snippet of code and must select the correct output.
Further, the site supports a number of features to improve study habits and allow students more flexibility in learning the concepts. Students have complete control over the format of the quiz, from the number of questions to how many and which concepts are present on the quiz. This allows students to mix various concepts in a single quiz, simulating an important aspect of the test-taking experience.
The site also allows students to analyze their own response history by leveraging the data collected from each quiz. Specifically, students can see their “mastery” level for each concept, which helps them optimize studying by moving on to another concept after achieving mastery. The project can be found at https://cs101-quiz.cs.duke.edu/. You will need a Duke NetID and password to view the site.
Student: Jackson Hubbard
Abstract: Over the course of the past two semesters, I have created an iOS application that utilizes 3D scanning technology to allow athletic medicine professions and team doctors to scan the anatomies of the athletes they work with. The app was developed for Protect3d, a startup leveraging 3D technologies to create custom fitting athletic protective devices. In order for the company to deliver its products to users across the country, it was necessary to develop an application that allows any user to harness the power of 3D scanning. In the fall, I completed an independent study and successfully created a minimum viable product of the app. This spring semester, I have continued working on the app, making improvements such as the addition of a backend server, as well as many new features on the app. Towards the beginning of the spring semester, the app was released to 3 collegiate athletic programs in the area and was used to create custom fitting pads for numerous athletes on athletic teams such as lacrosse, football, and soccer. The app is on schedule to be released to athletic programs across the country towards the end of the summer.
The app gathers information from the user, provides custom scanning instructions, and then performs a 3D scan. This process takes less than two minutes from start to finish. This 3D scan model, as well as the information that was gathered about the athlete who was scanned and their injury, is then sent to the startup by creating an “order” that is sent through the cloud. The app also displays the real time status of the user’s orders that are being modeled and printed by the startup.
The app is written primarily in Swift and uses Firebase for its backend. The 3D scanning portion of the app is written in Objective-C and implements a 3D Scanning SDK. When I started this project in September, I did not know anything about creating iOS apps or about 3D Scanning. I am very proud of the progress I have made and how the app I developed has played a role in getting athletes back onto the field with custom-fitting protective gear!
Team: Christopher Suh, Andre Wang, Henry Zhou, Isaac Zhang
Abstract: In this work, we consider one of the most difficult challenges faced in natural language generation: that of generating text that is highly constrained. We focus in particular on limerick generation, which requires rhyming constraints, constraints on the meter, constraints on the number of syllables in each line, grammatical constraints, a fixed number of lines in the poem, along with the expectation of a reasonable storyline, and perhaps some humor. To address these challenges, we present a new framework for generating limericks, which leverages a wide variety of NLP tools but does not always use these tools in the way they were originally intended. These tools, and their modifications to suit our purposes, are incorporated into our algorithm, Diverse Trajectory Search (DTS), which allows for broad and efficient exploration of the space of limericks. The resulting limericks satisfy poetic constraints and have thematically coherent storylines, which are sometimes even funny (when we are lucky).
Team: Sriram Gollapudy, Jason Lee, Angikar Ghosal, and Kush Gulati
Abstract: Scheduling courses at Duke is done primarily intra-departmentally, and that there is little data available on how often courses not offered by the same department may be correlated (i.e., how often students would try to take both courses in the same semester) or on how often two courses may have been offered. This results both in suboptimal student schedules and in suboptimal class schedules each semester and possible logjams towards the end of Duke students’ careers, leading to extra semesters of classes. We set out to alleviate these problems via three different deliverables in partnership with the Duke Registrar.
First, we leveraged historical course catalog data to create visualizations showcasing the distribution of classes across different time periods. Then, we created a schedule planner for students that visualizes how likely two or more classes will collide. Lastly, we were given access to roster data, which specifically tells us which students took what courses. We are using this to create a course scheduling assistant for the registrar. By making some sort of metric using this and other data, the registrar can weigh the significance of clashes between classes to determine the most optimal course schedule for students.
The last of these deliverables is still primarily in its research phase. We needed to figure out a good metric to determine how much a class “clashed” with another. We used the naive approach of class correlations –– given one class, what was the probability a student took the same class in the same semester? By using clustering algorithms, we wanted to validate how accurate our metrics were; specifically, we expect the number of clusters to roughly correspond to the number of majors in our dataset, but not exactly lining up with majors due to overlap. Our class correlations were a high dimensional data set, so we reduced the dimensionality via t-Stochastic Neighbors Embedding. We then ran k-Means and DBSCAN on these outputs for various sets of classes, such as all Biology, Psychology, and Chemistry classes and CS, Math, and Stats classes to see how close they came to ground truth (3 or so clusters). DBSCAN outputted 3 clusters for both and k-Means was optimized with 3 clusters, but due to noise we also attempted to use various Trinity codes to generate a heuristic to refine our metric. This heuristic accounted for the course numbers denoting how advanced the class is, relative position among other courses in the same department, other departments a class is cross-registered in, and overall area of study the class belongs to, and other general education requirements and is presently being scaled to all classes offered at Duke.
Student: Felicia Chen
Abstract: Vaccine hesitancy was named one of the top 10 global health threats by the WHO in 2019. However, there is still no personalized intervention to combat vaccine hesitancy. One challenge here is that it is difficult to pick out vaccine myths from individual sentences, and sentence-by-sentence rebuttal may not work well.
To address this challenge, we built a taxonomy for vaccine misinformation, which served as a knowledge hierarchy of common anti-vaccination tropes. Our taxonomy is broken down into five key categories. We can then label anti-vaccination articles into categories in our taxonomy, which helps us with determining more personalized and effective interventions. We tried three different approaches to labeling: basic counting, weighted by TF-IDF, and word embeddings.
After deciding the type of misinformation an anti-vaccination article propagates, the next challenge was to determine whether or not an article was anti-vaccination. We trained a classifier on anti-vaccination and news articles related to vaccination.
Finally, we surveyed participants to determine whether they preferred the default CDC intervention or our personalized intervention for anti-vaccination articles.
Team: Sachit Menon; Alex Damian; McCourt Hu; Nikhil Ravi
Abstract: The primary aim of single-image super-resolution is to construct a high-resolution (HR) image from a corresponding low-resolution (LR) input. In previous approaches, which have generally been supervised, the training objective typically measures a pixel-wise average distance between the super-resolved (SR) and HR images. Optimizing such metrics often leads to blurring, especially in high variance (detailed) regions. We propose an alternative formulation of the super-resolution problem based on creating realistic SR images that downscale correctly. We present a novel super-resolution algorithm addressing this problem, PULSE (Photo Upsampling via Latent Space Exploration), which generates high-resolution, realistic images at resolutions previously unseen in the literature. It accomplishes this in an entirely self-supervised fashion and is not confined to a specific degradation operator used during training, unlike previous methods (which require training on databases of LR-HR image pairs for supervised learning). Instead of starting with the LR image and slowly adding detail, PULSE traverses the high-resolution natural image manifold, searching for images that downscale to the original LR image. This is formalized through the “downscaling loss,” which guides exploration through the latent space of a generative model. By leveraging properties of high-dimensional Gaussians, we restrict the search space to guarantee that our outputs are realistic. PULSE thereby generates super-resolved images that both are realistic and downscale correctly. We show extensive experimental results demonstrating the efficacy of our approach in the domain of face super-resolution (also known as face hallucination). Our method outperforms state-of-the-art methods in perceptual quality at higher resolutions and scale factors than previously possible.
Team: Norah Tan, Jessica Centers
Abstract: On the subject of bandwidth compression and direct-current-free (DC-free), the power spectrum analysis of the write signal due to a certain coding scheme and signaling method is a critical tool. Constrained codes that forbid certain patterns can alleviate inter-symbol interference in magnetic recording systems and inter-cell interference in Flash systems. In this project, based on the finite state transit diagram (FSTD) representation of the codes, we derived the power density spectrum for general constrained codes with non-return-to-zero (NRZ) signaling and applied the method to two sets of constrained codes: Tx and Ax, where Ax-constrained codes is the asymmetric counterpart of Tx-constrained code.
Student: Esther Brown
Abstract: Currently, 74% of all mobile phones in the world run on Android's operating system. Therefore android mobile development is essential to the process of creating mobile applications that are accessible to people all over the world. The goal of this project-based independent study was to learn more about the world of android mobile development. Through the implementation of five mobile applications that increased in functionality and complexity with each new app, this project explored resources that are available to Android mobile developers. The mobile applications implemented as part of this project pulls in extensive data using resources such as Google Search developer APIs, the World Health Organization's website, and Statista (a statistics/data portal). In addition, the data used in these apps is continuously updated in real-time as new data is reported. The five mobile applications implemented in this project are: 1) a “To-do” app where users can add and remove items from a list of tasks, 2) a movie browsing app that fetches data from Google Search, 3) a recipe and cooking recommendation app that fetches data from The Cooking Channel TV’s website, 4) an app that provides real-time data/statistics on the COVID-19 outbreak around the world, and 5) an app that provides real-time data on all reported earthquakes in the world.
Team: Matthew O'Boyle, Jake Derry
Abstract: The Design Checklist is a tool that is currently being used by students, TA's and professors in CompSci307 & CompSci308 to visualize the static analysis of code. It highlights issues that students have in their design choices that make the code less extensible.
Student: Bailey Heit
Abstract: Our world is shifting towards a predominantly digital world, one in which humans are constantly connected to their phones, tablets, and computers. In order to proactively monitor and track our activities, researchers must develop an effective model to recognize activities based on eye movement. Areas including the industrial, educational, sports, and health sectors will benefit from this advancement in activity recognition. Already, computers can track how applications are being used and for how long through software technologies and more recently auditory approaches; however, these approaches remain limited. If executed correctly, eye tracking could be the answer that researchers and engineers have been looking for. One potential execution of this is through machine learning classifiers, which can be trained to recognize activities based on eye movement measures.
Eye tracking estimates the gaze of a subject (where the subject’s eyeball is directed) at a given time. At a regular interval, the x,y coordinate of the gaze point is recorded. These coordinates can be grouped together into either saccades or fixations. A saccade is a sudden change in direction of eye orientation, lasting between 30 to 120 ms. A fixation is the period between saccades in which the eye is held stable to view an area of interest. Because the raw data is composed of (x,y) coordinates in a plane, there must be a method to group the individual coordinates into clusters of saccades and fixations. This allows us to look at the duration and positioning of fixations and the timestamps of saccades. A fixation filter has been designed to group fixations that are spatially and temporally close, both sensitive enough to group points close together yet robust enough to a noisy gaze position signal and avoid producing false fixation estimates. The ultimate goal of this filter is to estimate the spatial position of fixations as well as saccades. The final product will be a series of saccade points, in which the gap between the two points is a fixation and its duration. Low-level measures include features based on fixations and saccades.
After applying the fixation filter to the data, machine learning model has been implemented to analyze how accurately sedentary activities (ie. reading and writing) can be classified based on a variety of low-level features derived from existing literature. At an interval of 60 seconds, a random forest model receives an accuracy score 0.54. After some modifications for more accurate predictions, the fixation filter and classifiers can be used with future eye tracking data.