Sarah Cohen's talk on 11/18 (see slides from 11/18, attached to the syllabus) Public interest reported --- Powerful people and instituions want to hide things, keep them secret. The public needs to know this Common traits--for online journalism --- multiple sources with varying accuracy people in newsrooms have no abilty or interest in technical ability Lots of data is in unstructured format: documents or unstructured data. How do you get information out of court hearings, for example. Journalists don't have concrete goals, they want to know what's in the information? What's there? Street reporting meets document/data analysis: you have to talk to people and do the lab work Reporters and journalists all suffer from ADD: rarely spend much time on a given data source or question -- all 'attention deficit disorder' Reporters look at data for tips and examples. They don't look for evidence or statistical patterns. So there's no real data mining per se, it's a look for 'nuggets'. Eventually you need a 'smoking gun' to run a real story. But you need to know there's a smoking gun before you can find it. (this is ola's take) --- Journalists talk to people, look at real email, look at restaurant health forms. Don't use Google to find what other people say. Example: 200 million checks sent to farmers: (more on that coming) Data analysis isn't mission critical: story will still get done, people won't die, nothing bad will happen. So why do data analysis? The unique identifiers for people are often censored. You can't cross-match on SSNs for example, they're usually scrambled. Little or not control over form of data. Is data always good? NO! See sunlight foundation. The data on the stimulus bill, for example, is flawed, you can't always make coherent or reasonable judgments based on the data because you can't (necessarily) trust it. ------ Transparency Myths Government records are hard to get in usable form. Examples when it is usable: recovery act, campaign finance, crime (see http://recovery.gov/Pages/home.aspx) The data just doesn't add up, it's not the government's fault: they do open work, it's not their job to make sense of the data. Do we want the government to spend money on making the information 'clean'? No! We want them to solve important problems like making jobs. All records should be obtainable via open records laws. But: data.gov prromised 100,000 datasets. Today 728 have been delivered. There are aggregate statistics. Quote: I hope the term 'quantitative journalism' will be quaint soon. --- There are incredibly interesting mashups with 'campaign contributions' meets 'boats'. We looked at 'oakland crime maps'. These are the only two sets of data everyone can get lots of. There's not much more than that widely available: we dig where the hole already is: these two areas. In fact, most information is hidden: in file cabinets, somewhere they don't know about. Government is incredibly good at getting/gathering information but not so good at doing stuff with the information. -- Example: Harvesting cash: example of farm subsidies: rice and cotton sources. Drought money goes to places without droughts Disaster relief goes to places with no disasters How do you find this? Look at weather records, drought monitoring, property records, crop estimates When should you stop farming in flood areas? After 30 years! That's the average reporting window. Discrepancy: get money for farming when you're not a farmer: look to see who doesn't gets tax-relief for farming (which farmers should get), but do get subsidy payments. Everyone in congress is in a farming state: e.g., food stamps are farming? Story: 2006: 'Forced Out' from washington post. What happens when landlords force tenants out to reclaim the building, don't want to protect/fix the housing, want to convert to condos. Source: file cabinets with tax breaks. But streets 'c'-'f' weren't there. Spreadsheet was given with colors having meaning with 500 rows of data. 200,000 housing code complaints (see her slides) Lots of on-the-street, checking sources and cross-referencing of materials. Washington post: open source timeline maker from MIT, simile timeline? Pushed in lots of Obama data and summarize them into graphics for 100 days. When did Obama reverse something from Bush administration? Ultimately want to publish this: POTUS tracker, daily schedule put into database and audience can explore it. The POST will have a resource no one else has, lots of data that has been tagged. Get to trends and news that no one else will get to. Audience can see what the POST wants them to see, but not the Deep Data. Similar: HeadCount (also from Washing Post). Lots of information about the folks that are in Obama administration. Let audience look via interactive graphics, but also a resource for news gathering process. Sarah says it took her six weeks to profile all apointees to see who had given money etc. Now the information is gathered every day/week instead of once, so the data is updated constantly. The HeadCount gets 40,000-80,000 hits/day. Many people have visited the site before. It's unique. Specific strategy: hard, labor intensive, and no one else does it. This is a business decision that is driving access to data, but also provides benefit to the POST in economic terms. --- Government leads you to the answer they want you to see: The government wants to look good. So we shouldn't expact more than that. We want transparency and open data. Sunlight wants 'wholesale' data, not 'retail' data. But these groups have little luck. The crowd will figure out what's in the data. The government is reluctant to do this because they haven't curated the data. -- What does Sarah do? Project names: What's New?, The Real Story, Chronologies and social network tools, Text mining of government documents, Audio and video analysis, What does SEC do in terms of settlements? What do all state Attorney's general do as aggregated? How do you do this without going to all the sites? Blogs: one molecule of new information, it's then re-reported by thousands. To follow the traingle: follow 20-30 blogs, email neighborhood lists, websites, government websites, we have to look at them all. The 'anti-aggegrator' will look for new discoveries, cut out the filtering everything we've already seen. Chronologies and social-network analysis: who is connected to whom. These tools haven't been in a news environment. Text-mining going into documents to analyze them. Easley trial: 10,000 pages of 'stuff', how do we get information from this? Entity extraction, get every name. There's a project in europe to read 2 million forms/stories from WWI. How do we work with music. Digital humanities. Most government activities are only available via text and audio, no transcription, we need video/speech recognition. these are things people are working on. How does ACLU analyze guant. hearing? randomly type in a date to see what happens. What opportunities are there? DukeEngage project, middle layer of sense-making on unexplored datasets (local and national). Hobbyists who translate large datasets. Apps for america. Find ways to use government data. Sunlight has 'hackathon' in December to get access to local/state data. Who works with contracts and grants. Stimulus data (propublica, e.g.) We want to see visualizations that say something new. They're eye-candy, but there's nothing new. or don't trust them. How do we find ways to get useful, meaningful information from news sites.