Autospeak-Straight Talk contains articles covering digital and social media marketing social communities and events marketing
Autospeak-Straight Talk

View All Blog Posts

Bookmark and Share

Everything You Know About Social Is Wrong (Statistically Speaking)

(Posted on May 20, 2014 at 11:56AM )
There are a lot of disparate social media statistics out there. Facebook grows while declining. Google+ is bigger than Twitter except it’s not. Pinterest is the best social selling platform… unless Instagram is. Snapchat is the fastest growing social network… unless it’s Tumblr.

Many people want to bombard you with social media statistics that support their opinions and agendas, and you may be persuaded by their confidence. But they’re almost certainly wrong.
The reason that there is so much conflicting data isn’t that there is so much variation in what people are doing. A lot of the misinformation about social media is generated (deliberately or not) by poor statistical practice.

What I want to do in this post is talk about some of the common statistical errors that cause people to draw inaccurate conclusions about who, what, why, where and how people are using social media.

Sample PopulationThe premise of most social media statistics is to make a generalized statement about the behavior of an entire population of users. It may surprise you to understand that statistical analysis often isn’t the cause of poor statistics, it’s often that a sample population isn’t representative of the general population.

A good example of this is Stephen Wolfram’s Facebook research from last year. It’s a fascinating study of Facebook connections, but it’s drawn from a population of Wolfram Alpha users (Wolfram Alpha is a computational search engine) who volunteered their Facebook data. To say that a bunch of people using a computational search engine are representative of the all social media users is in all likelihood wrong. The fact that people volunteered their data is another source of bias that I’ll touch on when I discuss sampling method.

Another good example of a non-representative population are IBM’s Black Friday and Cyber Monday reports. They are a well-known bellwether of online customer sentiment in the holiday shopping season, yet their data is drawn from 800 retail sites that use their software.
I mention the Wolfram study and the IBM studies because I draw from them frequently. These clearly are not representative studies because their sample population is different than the general population they want to represent. Yet, I make some assumptions about the populations and draw conclusions anyhow. Specifically, I assume that the Wolfram Facebook user is likely a less-engaged user than a typical Facebook user and conclude that the ties that are described in the study are at least accurate. I assume that the IBM software is used so diversely that there may be some applicability of the data to similar verticals.

I could be entirely wrong about the assumptions I make about these studies. And these are the class of social media studies (aside from the research by Pew Research Internet Project, which is oftentimes randomized and controlled). The reason there is inherent bias in so much research about social media is that the cost to do a statistically significant study that is randomized and representative is expensive.

Sampling methodSampling error is a huge impediment to the accuracy of all studies. If people volunteer information, that causes sampling bias. If you only ask a certain subset of a population, that causes sampling bias. If you use weighted results and make poor assumptions, this causes sample bias. If a statistician deliberately decides who gets asked questions, this causes sampling bias.

This is probably most notably seen in Nate Silver’s FiveThirtyEight political polls which interpret the sampling bias of individual polls, and weight them to create a more accurate meta-analysis of political polls.

The gold standard for research is the double-blind, randomized, controlled study. Double-blind means that neither the researcher or subject knows ahead of time who gets what treatment (or question). Randomized means that anyone (in a proper representative population) could be chosen, and controlled meaning that the study is set up in a way to isolate variables to determine causation.

Here’s what you need to know about this: nobody studying social media can afford to do this. If they did, they would tell you (BIG TIME). Between sample population and sampling method, nearly every study about social media is inaccurate to some degree.

The jump to conclusions boardOnce in awhile I’ll have a person controvert something that I say about social media by saying that “correlation is not causation.” It’s a concept introduced in statistics 101 to explain that because something happens concurrent to something else doesn’t mean one is causing another. For instance, there’s a strong correlation between purchase of swimsuits and sunburns. Neither causes the other, in fact nicer weather probably causes both.

The ironic thing about social media zealots arguing that correlation isn’t causation is that many social media studies don’t even show correlation. We established that somewhat above, but let’s take it a step further:

Many studies show raw percentages of observed events without understanding the size of the sample or the calculated margin of error. Statistical confidence intervals (the most common being 90 percent, 95 percent and 99 percent) can communicate how likely it is that the same research would be duplicated if done again. Of course you probably don’t hear too much about confidence levels with most social media studies, because they don’t meet any threshold of accuracy (and because poor data in makes the point moot anyhow).
Drug research is a good example of causation, where a drug researcher must show positive outcomes with 95 percent confidence before it can be approved for consumer use. That doesn’t mean that 95 percent of people have positive outcomes, just that it is probable that positive outcomes are caused by the drug. And the data has to be good. It’s expensive.

Correlation on the other hand is oftentimes statistically determined by a correlation coefficient, which is a number between -1 and 1 that communicates how likely two events happen concurrently (correlate). -1 shows negative correlation (me playing drums would correlate negatively with my wife’s mood for example), 1 shows positive correlation (me doing dishes would correlate positively with my wife’s mood for example), and o shows no correlation (me changing a diaper has no correlation with the Earth’s rotation).

Correlation coefficients are most notably used in Searchmetrics and‘s annual/bi-annual ranking factor research. They assess different SEO variables to determine what aspects of a web page make for higher search rankings (specifically with Google). They use correlation coefficients of .30 and below to determine what feeds Google. Since you understand that .3 is very weak correlation you might wonder whether this data is useful at all and it is, but it can lead to some very bad conclusions.

Since any unrelated event (worldwide meat loaf consumption) might show weak correlation to search engine rankings, it may give some false positives. Facebook shares for example show relatively strong correlation to search results even though Google can’t index most of Facebook. And once again, this is one of the best examples of correlation with social media. Outside of these analyses, you’ll rarely see correlation coefficients described.

Point being that while it’s true that correlation doesn’t imply causation, what some people think is correlation isn’t.

Mean versus medianSeven statisticians are sitting alone in a bar. Their net worth with sizable college debt is $200,000. Bill Gates strolls into the bar and the mean net worth of all of the people in the bar is about $7 billion. The median net worth is still $200K. This is the difference between mean and median. Median is the midpoint of all values, mean is the average of all values.

You’ll see this oftentimes when it comes to the numbers of followers/Fans or time on site. Using a mean instead of a median is a way to artificially inflate a statistic by incorporating the biggest outliers (superusers as an example) into the aggregate.

When I see “average” or “mean” I always make the assumption that the statistic is less fantastic than it purports to be.

Is the metric given what you really want to know?Most of the big social networks tout a statistic called “monthly average users (MAU).” To be included in this group, a person must log in at least one time in a 30-day window. How important is that statistic? Not at all.

People pay their phone bill once every month. It doesn’t tell you how they pay it, how much they pay or who they pay. Point being, the easiest metrics to gather are also the most useless. Because statistical analysis is expensive to do, most people want to give the qualitative equivalent of MAU. Marketers need to be smart about what statistics are relevant to them and statistics are just noise.

There is a special place in hell for infographicsInfographics are immensely popular, and many are extraordinarily inaccurate. A good way to sum up what’s wrong with social media statistics is to talk about the collection of data in infographics, because the inaccuracy of an amalgamation of bad statistics multiplies how poorly informed a user might be.

Say an infographic has five data points about a web property.

  • Point one shows an Alexa or Quantcast rating which is biased in its sample population.
  • Point two shows a statistic from Facebook showing the reach of my Facebook page based upon monthly average users.
  • Point three shows a demographic from Pew generalized about all websites.
  • Point four shows a study about my vertical, which is derived from a population of users of a prticular e-commerce platform.
  • Point five shows an opinion poll by people who volunteered their opinion using a Sodahead widget.
None of these points gives you a great idea of who, what, when, where, why or how people are using this property, and in aggregate they collectively obscure your insight even further. And it’s probably written with a very specific point of view relative to who constructed it, which is even more dangerous because bad statistics regularly are cherry-picked to give specific narratives about products and services.

What I wanted to point out in this piece is that there is inaccuracy in nearly every statistic that you read and see about social media. It’s important for marketers to understand this and to vet these statistics and studies before accepting them as truth. Infographics in particular have a specific agenda and a tendency to mash bad data together to make it worse.

Jim Dougherty is an expert on social media and technology who blogs at Leaders West. For more marketing advice from Jim, click here.

Image: Jan Willem-Reusink, John Lester, Kathleen Deggelman (Creative Commons)