Data Science should be about more than data

There are lots of “top 10 skills” lists for data science and analytics. The ones I’ve seen are all missing something huge.

Here’s an example:

Business Broadway – Top 10 Skills in Data Science

Modeling barely appears here. Almost all the items concern the collection and analysis of data (no surprise there). Just imagine for a moment what it would be like if science consisted purely of observation, with no theorizing.

What are you doing with all those data points and the algorithms that sift through them? At some point, you have to understand whether the relationships that emerge from your data make any sense and answer relevant questions. For that, you need ways of thinking and talking about the structure of the phenomena you’re looking at and the problems you’re trying to solve.

I’d argue that one’s literacy in data science is greatly enhanced by knowledge of mathematical modeling and simulation. That could be system dynamics, control theory, physics, economics, discrete event simulation, agent based modeling, or something similar. The exact discipline probably doesn’t matter, so long as you learn to formalize operational thinking about a problem, and pick up some good habits (like balancing units) along the way.

Snow is Normal in Montana

In this case, I think it’s quite literally Normal a.k.a. Gaussian:

Normally distributed snow

Here’s what I think is happening. On windless days with powder, the snow dribbles off the edge of the roof (just above the center of the hump). Flakes drift down in a random walk. The railing terminates the walk after about four feet, by which time the distribution of flake positions has already reached the Normal you’d expect from the Central Limit Theorem.

Enough of the geek stuff; I think I’ll go ski the field.

Rats leaving a sinking Sears

Sears Roebuck & Co. was a big part of my extended family at one time. My wife’s grandfather started in the mail room and worked his way up to executive, through the introduction of computers and the firebombing in Caracas. Sadly, its demise appears imminent.

Business Insider has an interesting article on the dynamics of Sears’ decline. Here’s a quick causal loop diagram summarizing some of the many positive feedbacks that once drove growth, but now are vicious cycles:

sears_rats_sinking_ships_corr

h/t @johnrodat

CLD corrected, 1/9/17.

Remembering Jay Forrester

I’m sad to report that Jay Forrester, pioneer in servo control, digital computing, System Dynamics, global modeling, and education has passed away at the age of 98.

forresterred

I’ve only begun to think about the ways Jay influenced my life, but digging through the archives here I ran across a nice short video clip on Jay’s hope for the future. Jay sounds as prescient as ever, given recent events:

“The coming century, I think, will be dominated by major social, political turmoil. And it will result primarily because people are doing what they think they should do, but do not realize that what they’re doing are causing these problems. So, I think the hope for this coming century is to develop a sufficiently large percentage of the population that have true insight into the nature of the complex systems within which they live.”

I delve into the roots of this thought in Election Reflection (2010).

Here’s a sampling of other Forrester ideas from these pages:

The Law of Attraction

Forrester on the Financial Crisis

Self-generated seasonal cycles

Deeper Lessons

Servo-chicken

Models

Market Growth

Urban Dynamics

Industrial Dynamics

World Dynamics

 

 

 

A textbook death spiral

NPR has a nice article on self-regulation in the textbook industry. It turns out that textbook prices are up almost 100% from 2002, yet student spending on texts is nearly flat. (See the article for concise data.)

Here’s part of the structure that explains the data:

Starting with a price increase, students have a lot of options: they can manage textbooks more intensively (e.g., sharing, brown), they can simply choose to use fewer (substitution, blue), they can adopt alternatives that emerge after a delay (red), and they can extend the life of a given text by being quick to sell them back, or an agent can do that on their behalf by creating a rental fleet (green).

All of these options help students to hold spending to a desired level, but they have the unintended effect of triggering a variant of the utility death spiral. As unit sales (purchasing) fall, the unit cost of producing textbooks rises, due to the high fixed costs of developing and publishing the materials. That drives up prices, promping further reductions in purchasing – a vicious cycle.

This isn’t quite the whole story – there’s more to the supply side to think about. If publishers are facing a margin squeeze from rising costs, are they offering fewer titles, for example? I leave that as an exercise.

Doing our bit for the cure … and the cause

I have a soft spot for breast cancer research, but I have to admit that it seemed a little silly when I started getting hay with pink baling twine.

But now it seems the Susan G. Komen foundation for breast cancer has really jumped the shark, with pink drill bits from oilfield service company Baker Hughes. Funding cancer care with revenue derived in part from pumping carcinogens into the ground, providing pinkwash for that practice, seems like rather unsystemic thinking. What’s next, pink cigarettes?

Not so fast?

Maybe Baker Hughes is deriving some enlightenment from the relationship. In a less-noticed bit of news:

As part of our ongoing commitment, we have adopted a new policy with respect to the information that we provide about the chemistry contained within our hydraulic fracturing fluid systems. Beginning October 1, 2014, Baker Hughes will provide a complete, detailed, and public listing of all chemical constituents for all wells that the company fractures using its hydraulic fracturing fluid products.

The end is here

Facebook is down.

Runaway positive feedback is the culprit:

To make matters worse, every time a client got an error attempting to query one of the databases it interpreted it as an invalid value, and deleted the corresponding cache key. This meant that even after the original problem had been fixed, the stream of queries continued. As long as the databases failed to service some of the requests, they were causing even more requests to themselves. We had entered a feedback loop that didn’t allow the databases to recover.

The way to stop the feedback cycle was quite painful – we had to stop all traffic to this database cluster, which meant turning off the site. Once the databases had recovered and the root cause had been fixed, we slowly allowed more people back onto the site.

This got the site back up and running today, and for now we’ve turned off the system that attempts to correct configuration values. We’re exploring new designs for this configuration system following design patterns of other systems at Facebook that deal more gracefully with feedback loops and transient spikes.

It’s faintly ironic, since positive feedback of a different sort is responsible for Facebook’s success.

The dynamics of UFO sightings

The Economist reports on UFO sightings:

UFOdataThis deserves a model:

UFOs

UFOs.vpm (Vensim published model, requires Pro/DSS or the free Reader)

The model is a mixed discrete/continuous simulation of an individual sleeping, working and drinking. This started out as a multi-agent model, but I realized along the way that sleeping, working and drinking is a fairly ergodic process on long time scales (at least with respect to UFOs), so one individual with a distribution of behaviors over time or simulations is as good as a population of agents.

The model replicates the data somewhat faithfully:

UFOdistributionThe model shows a morning peak (people awake but out and about) and a workday dip (inside, lurking near the water cooler) but the data do not. This suggests to me that:

  • Alcohol is the dominant factor in sightings.
  • I don’t party nearly enough to see a UFO.

Actually, now that I’ve built this version, I think the interesting model would have a longer time horizon, to address the non-ergodic part: contagion of sightings across individuals.

h/t Andreas Größler.