• The important parts of data analysis

    March 11, 2014  |  Statistics

    There's plenty of software to muck around with data, but to gain the skills to really get something out of it, that takes time and experience. Mikio Braun, a post doc in machine learning, explains.

    For a number of reasons, I don’t think that you cannot "toolify" data analysis that easily. I wished it would be, but from my hard-won experience with my own work and teaching people this stuff, I'd say it takes a lot of experience to be done properly and you need to know what you're doing. Otherwise you will do stuff which breaks horribly once put into action on real data.

    And I don't write this because I don't like the projects which exists, but because I think it is important to understand that you can't just give a few coders new tools and they will produce something which works. And depending on how you want to use data analysis in your company, this might break or make your company.

    Braun breaks it down into four bullet points worth a read, but the tl;dr version is that analysis isn't simple, and no tool is going to do everything for you. It's simple with simple data, but you can almost always go deeper with more data, and it takes experience to ask the right questions. So try not to be too content with that software output.

  • The New York Times announced their data-driven news site, The Upshot

    March 11, 2014  |  News

    John McDuling for Quartz writes about the FiveThirtyEight replacement.

    David Leonhardt, the Times’ former Washington bureau chief, who is in charge of The Upshot, told Quartz that the new venture will have a dedicated staff of 15, including three full-time graphic journalists, and is on track for a launch this spring. "The idea behind the name is, we are trying to help readers get to the essence of issues and understand them in a contextual and conversational way," Leonhardt says. "Obviously, we will be using data a lot to do that, not because data is some secret code, but because it’s a particularly effective way, when used in moderate doses, of explaining reality to people."

    With the new FiveThirtyEight coming soon, The Upshot, and plenty of smaller bits sprouting up in other areas, this data-driven news thing might be more than a fad. Hey, statisticians, you want to get in on this? Seriously, there's plenty of data to go around.

  • UP Coffee app helps you track and understand caffeine consumption

    March 10, 2014  |  Self-surveillance

    UP Coffee

    How much caffeine can you consume during the day and still fall asleep at night? For some, it's one cup and they're up all night, whereas others don't feel a thing. UP Coffee, an app from Jawbone Labs, helps you understand your own consumption and caffeine tolerance.

    Data entry is straightforward since it's only for caffeine-related beverages, such as coffee and soda. Enter your beverage, and the app tabulates caffeine amounts for you.

    The key though is that it doesn't just stop at milligrams. What's 100 milligrams of caffeine mean anyways? Instead, with a focus on sleep, it tells you how much caffeine you've consumed and how many hours you're expected to feel the effects.

    Pair it with your Jawbone UP band and account for an even wider out picture. Although you don't have to. I've been using the app with neither, and it's still fun the play with. And it kind of makes me want a band.

  • Statistical concepts explained through dance

    March 7, 2014  |  Statistics

    Forget bell curves, jellybeans, and coin flips to explain statistical concepts. Dancing Statistics is a video series that demonstrates variance, correlation, and sampling through coreographed movements. The dance below explains variance.

    Watch the full playlist here. [via infosthetics]

  • Before and after lot vacancy

    March 6, 2014  |  Data Art

    vacated

    Justin Blinder used New York's city planning dataset and Google Streetview for a before and after view of vacant lots.

    Vacated mines and combines different datasets on vacant lots to present a sort of physical facade of gentrification, one that immediately prompts questions by virtue of its incompleteness: “Vacated by whom? Why? How long had they been there? And who’s replacing them?” Are all these changes instances of gentrification, or just some? While we usually think of gentrification in terms of what is new or has been displaced, Vacated highlights the momentary absence of such buildings, either because they’ve been demolished or have not yet been built. All images depicted in the project are both temporal and ephemeral, since they draw upon image caches that will eventually be replaced.

  • Find new beers to drink

    March 5, 2014  |  Network Visualization

    Beer similarities

    Based on reviews from BeerAdvocate, Beer Viz, a visualization class project, asks you to choose a general style of beer and a beer that you like. Then it shows you beers that are similar, based on appearance, taste, aroma, and overall score. It's like a visual version of the beer recommendation system we saw last year.

  • Basketball movements visualized

    March 4, 2014  |  Mapping

    Tim Duncan movements

    The NBA has been kind of gaga over data the past few years, and they recently announced that all 30 teams would have player tracking installed so they can see where they go at night after games. Wait, no. I mean so that there is data on where each player is on the court at any given time. Fathom Information Design played with some of this data for an Oklahoma City versus San Antonio game, with some sketches.

    Above are the movements of power forward Tim Duncan, who sticks around the middle of the court throughout a game. A guard on the other hand, runs around the court more. This is obvious if you've watched him play, but sketches like this coupled with spatiotemporal analysis could be interesting.

    Also, I get the sense that there's more people who want to know about this data than there are who know how to, so if you're a statistician on the job hunt, there's that.

  • ProPublica opened a data store

    March 4, 2014  |  Data Sources

    One of the main challenges of any data project is getting the data. It seems obvious, but the effort to get the right data to answer a question seems to catch people off guard. Even data that's "free" to download can be a huge pain that ends up completely useless. ProPublica, the non-profit newsroom, deals with this stuff on a regular basis and hopes that some of their efforts can turn into a source of funding through the Data Store.

    Like most newsrooms, we make extensive use of government data — some downloaded from "open data" sites and some obtained through Freedom of Information Act requests. But much of our data comes from our developers spending months scraping and assembling material from web sites and out of Acrobat documents. Some data requires months of labor to clean or requires combining datasets from different sources in a way that's never been done before.

    In the Data Store you'll find a growing collection of the data we've used in our reporting. For raw, as-is datasets we receive from government sources, you'll find a free download link that simply requires you agree to a simplified version of our Terms of Use. For datasets that are available as downloads from government websites, we've simply linked to the sites to ensure you can quickly get the most up-to-date data.

    For datasets that are the result of significant expenditures of our time and effort, we're charging a reasonable one-time fee: In most cases, it's $200 for journalists and $2,000 for academic researchers.

    I hope it works.

  • Solar time versus standard time around the world

    March 3, 2014  |  Mapping

    How much is time wrong around the world?

    After noting the later dinner time in Spain, Stefano Maggiolo noted relatively late sunsets for one of the possible reasons, compared to standard time. Then he mapped sunset time versus standard time around the world.

    Looking for other regions of the world having the same peculiarity of Spain, I edited a world map from Wikipedia to show the difference between solar and standard time. It turns out, there are many places where the sun rises and sets late in the day, like in Spain, but not a lot where it is very early (highlighted in red and green in the map, respectively). Most of Russia is heavily red, but mostly in zones with very scarce population; the exception is St. Petersburg, with a discrepancy of two hours, but the effect on time is mitigated by the high latitude. The most extreme example of Spain-like time is western China: the difference reaches three hours against solar time. For example, today the sun rises there at 10:15 and sets at 19:45, and solar noon is at 15:01.

  • Why you should buy the bigger pizza

    February 28, 2014  |  Statistical Visualization

    Pizza price

    Because you get more pizza to eat, and if you don't finish it, you'll have breakfast tomorrow. Other than that fine reason, well, it's geometrically the better deal. Planet Money explains with an interactive that shows the price per square inch for 3,678 pizza places across the United States, based on data from Grubhub.

    The math of why bigger pizzas are such a good deal is simple: A pizza is a circle, and the area of a circle increases with the square of the radius.
    More pizza more problems

    So, for example, a 16-inch pizza is actually four times as big as an 8-inch pizza.

    And when you look at thousands of pizza prices from around the U.S., you see that you almost always get a much, much better deal when you buy a bigger pizza.

    You get more pizza, and the business gets more money with minimal extra pizza-making effort. Win-win. Although, keep going on the horizontal axis and I bet that curve starts to curl up. Where can I get a ten-foot pizza?

  • How to Read Histograms and Use Them in R

    February 27, 2014  |  Tutorials

    How to Read Histograms and Use Them in R

    The histogram is one of my favorite chart types, and for analysis purposes, I probably use them the most. Devised by Karl Pearson (the father of mathematical statistics) in the late 1800s, it's simple geometrically, robust, and allows you to see the distribution of a dataset.

    If you don't understand what's driving the chart though, it can be confusing, which is probably why you don't see it often in general publications.
    Continue Reading

  • Job Board, February 2014

    February 26, 2014  |  Job Board

    Looking for a job in data science, visualization, or statistics? There are openings on the board.

    Senior Associate Director, Analytics for the University of Chicago in Chicago, Illinois.

    Data Scientist for Thumbtack in San Francisco, California.

    Communications Officer, Measurement and Analysis for the Bill and Melinda Gates Foundation in Seattle, Washington.

    Senior Graphics Editor for The Wall Street Journal in New York, New York.

    Basketball Analyst for the Philadelphia 76ers in Philadelphia, Pennsylvania.

  • Game theory to win game shows

    February 26, 2014  |  Statistics

    I like how a little bit of game theory has crept into Jeopardy! with contestant Arthur Chu. He bounces around the board in search of Daily Doubles and bets to tie in final Jepoardy. Chu doesn't know much about game theory himself but applies rules promoted by a past contestant.

    The ultimate champion, Ken Jennings, praises Chu on Slate.

    But in fact, plenty of nice white boys on Jeopardy! have been pilloried by viewers for using Arthur Chu's signature technique: bopping around the game board seemingly at whim, rather than choosing the clues from top to bottom, as most contestants do. This is Chu's great crime, the kind of anarchy that hard-core Jeopardy! fans will not countenance. The technique was pioneered in 1985 by a five-time champ named Chuck Forrest, whose law school roommate suggested it. The "Forrest bounce," as fans still call it, kept opponents off balance. He would know ahead of time where the next clue would pop up; they’d be a second slow.

    I don't watch Jeopardy! much, but it's pretty fun to watch Chu dominate.

    Then there's the most recent RadioLab. The first part talks about a game show called Golden Balls and the prisoner's dilemma, and how a guy — who plays and wins game shows for a living — won this one. The whole show is entertaining as usual, but this first part is of particular interest. After listening to that, watch the Golden Balls clip to see how it played out.

  • An exploration of selfies

    February 25, 2014  |  Data Art

    Selfie City

    Selfiecity, from Lev Manovich, Moritz Stefaner, and a small group of analysts and researchers, is a detailed visual exploration of 3,200 selfies from five major cities around the world. The project is both a broad look at demographics and trends, as well as a chance to look closer at the individual observations.
    Continue Reading

  • Near-real-time global forest watch

    February 24, 2014  |  Mapping

    Global forest watch

    Global Forest Watch uses satellite imagery and other technologies to estimate forest usage, change, and tree cover (among other things). These estimates and their eventual actions used to be slow. Now they're near-real-time.

    This is about to change with the launch of Global Forest Watch—an online forest monitoring system created by the World Resources Institute, Google and a group of more than 40 partners. Global Forest Watch uses technologies including Google Earth Engine and Google Maps Engine to map the world’s forests with satellite imagery, detect changes in forest cover in near-real-time, and make this information freely available to anyone with Internet access.

    Many layers and high granularity. Take your time with this one.

  • A human-readable explorer for SEC filings

    February 21, 2014  |  Statistical Visualization

    SEC filings

    Maris Jensen just made SEC filings readable by humans. The motivation:

    But in the twenty years since, despite hundreds of millions invested in rounds of contracted EDGAR modernization efforts and interactive data false starts, the SEC's EDGAR has remained almost untouched. In 2014, the SEC is quite literally doing less with SEC filings than their predecessors had planned for 1984. Data tagging is the red-headed stepchild of the Commission -- out of hundreds of forms, only about a dozen are filed as structured data -- and the first program to automate the selection of SEC filings for review, the Division of Economic and Risk Analysis (DERA)'s 'Robocop', has been 'aspirational' for years. The academics in the division responsible for the SEC's interactive data initiatives write papers about information asymmetry, using EDGAR data they repurchase in usable form for millions each year, but do nothing to fix it. Companies are chastised for insufficient and inefficient disclosure, while the SEC fails to help retail investors navigate corporate disclosures at all.

    Look up a company and see their financials, ownership, influences, and board members, among other things typically not so straightforward to look up.

  • Using slime mold to find the best motorway routes

    February 20, 2014  |  Mapping

    This is all sorts of neat. Researchers Andrew Adamatzky and Ramon Alonso-Sanz are using a slime mold, P polycephalum, to find the most efficient road routes to provide guidance on how to rework them. P polycephalum is a single-celled organism that forages for food through various branches, and when it finds the most efficient food source, backs away from the others. The video above is a sped up version of it in action. Adamatzky and Alonso-Sanz put a map underneath.

    We cut agar plates in a shape of Iberian peninsula, place oat flakes at the sites of major urban areas and analyse the foraging network developed. We compare the plasmodial network with principle motorways and also analyse man-made and plasmodium networks in a framework of planar proximity graphs.

    [via infosthetics]

  • Why we think of north pointing up

    February 19, 2014  |  Mapping

    Claudius Ptolemy world map

    Nick Danforth for Al Jazeera delves into the history books for why north is typically on the top of our maps. There's no single reason for it, but Ptolemy might have had something to do with it.

    The north's position was ultimately secured by the beginning of the 16th century, thanks to Ptolemy, with another European discovery that, like the New World, others had known about for quite some time. Ptolemy was a Hellenic cartographer from Egypt whose work in the second century A.D. laid out a systematic approach to mapping the world, complete with intersecting lines of longitude and latitude on a half-eaten-doughnut-shaped projection that reflected the curvature of the earth. The cartographers who made the first big, beautiful maps of the entire world, Old and New — men like Gerardus Mercator, Henricus Martellus Germanus and Martin Waldseemuller — were obsessed with Ptolemy. They turned out copies of Ptolemy's Geography on the newly invented printing press, put his portrait in the corners of their maps and used his writings to fill in places they had never been, even as their own discoveries were revealing the limitations of his work.

    Ptolemy put north on top. Although, we don't know why he put it there.

  • A visual explanation of conditional probability

    February 18, 2014  |  Statistics

    Conditional probability

    Victor Powell, who has visualized the Central Limit Theorem and Simpson's Paradox, most recently provided a visual explainer for conditional probability.

    Two bars, one blue and one red, represent two events that can happen together or independently of the other. When a ball hits a bar the corresponding event occurs. What is the probability that one event occurs given that the other does and vice versa? If the probability of both events increases and decreases, how does that change the separate probabilities? Sliders and options let you experiment, and the visual and counters change to help you learn.

    A fun one to tinker with.

  • Surviving on minimum wage

    February 17, 2014  |  Infographics

    Surviving on minimum wage

    As most of us know, it's not easy getting by on minimum wage, and in some places it's not possible. The New York Times provides a calculator to see how challenging it can be.

    A simple visual on the right shows dollars made per year, one box per dollar colored green initially and then red to signal debt. It's a good way to make the numbers more relatable. Select a state, enter expenses, and watch dollars disappear, and most likely you'll end up in the red early.

Unless otherwise noted, graphics and words by me are licensed under Creative Commons BY-NC. Contact original authors for everything else.