Avoiding Data Sampling Places Quality Over Quantity

Data sampling is widespread in analytics and is standard practice for several of the major players. It’s important to be aware of the potential dangers of sampling and negative financial impact it can have on your organization.

What is data sampling in analytics?

Data sampling is the practice of analyzing a subset of your traffic data, which is used to estimate the overall results. Instead of gathering all the data, you only get access to a limited sample, meaning that any analysis you carry out after that will be an assumption based on existing patterns.

The aim of data sampling is to speed up reporting time while still being able to uncover all the meaningful and valuable information in the larger data set.

How does data sampling work?

Data sampling methods are split into two categories: probability and non-probability sampling:

Probability sampling is where random samples are chosen from a larger population using various statistical methods (including stratified, systematic, multistage and cluster sampling). By selecting random numbers that correspond to points (users) in a data set, you ensure that everyone in your set has equal chance of getting selected.
Probability sampling allows you to obtain a partially representative sample of the population. It can remove certain systematic errors and sampling bias and tends to be more reliable.
Non-probability sampling is where a data sample is specifically identified by an analyst. This removes the randomization and means that points of the population will not be selected.
Non-probability sampling means you have a lower chance of creating a sample that accurately represents the larger population. However, it is far less complex than probability sampling as well as being faster and cheaper.

The free version of Google Analytics uses probability sampling, and your data is aggregated and delivered to you as a random data set. This means that the standard reports they provide, including the Audience, Acquisition, Behavior and Conversion reports, are all based on sampled data. GA data is also sampled when you create a custom report. It’s impossible be sure if your reports are displaying the overall traffic and any meaningful trends, or if the selected set is giving you accurate information. And downstream the lack of visibility hinders decision making and has a direct impact on business efficiency – especially for larger organizations. This is also why Google encourages users to upgrade to their premium offer.

What are the limits of data sampling?

1. Representative samples

In statistics, the standard rule is that whenever a population of behavioral data is studied, a sample must be representative. If you limit that sample, you might not be able to see real patterns occurring due to the data already being predicted and could miss out on opportunities you would otherwise have noticed if you were given the whole picture.

Example: If your site generates 50 million hits on average per month and 50,000 visits a day, sampling can limit you to 10 million hits per month and 10,000 visits a day or less. This makes it impossible to obtain a decent representation of all the data, and the more your website grows, the more inaccurate your reports will become.

2. Limited sample quota
Sampling also doesn’t account for cumulative data as the sample is different every day. This means that cumulative results are not displayed either for the month, quarter or year. Here are a couple of practical examples:

Example 1: Data collection cutoff once you have reached your sample quota

Imagine your production department releases updates on Wednesday and Friday at 5pm including flash offers. On Wednesday, if you reach your sample quota at 6pm, your updates will only partly be taken into consideration. On Friday, if you reach your quota at 4pm, your updates will not be considered at all, even though the Internet behavior of visitors to your site at 5pm is considerably different to those who visit it at 4pm.
If you then publish your sales newsletter on a Tuesday morning, it will be impossible to compare Tuesday’s sample (reached at 11am) with (and added to) Wednesday’s or Friday’s sample. You simply won’t be able to draw any meaningful insight based on:

three different populations who make different requests;
who are incited by completely different things;
and who represent a different share of the audience on the reference day.

This can also apply to the total number of cumulative hits for a given month. For example, if in November you only retain 10 million hits out of 20 million and in December only 10 million hits out of 100 million, the 20 million hits retained are clearly not representative of the total of 110 million. It’s also not possible to draw an average number of hits.

Example 2: Using a percentage of sampled data

Now imagine your history displays 14 million hits and 360,000 visits. You will only be able to collect 70% of the data to respect your sampling quota. This can have a notable effect with seasonal variations. For example, if the traffic for the month of December is twice that of any other month, then a quota of 70% is too large. This figure will thus be reduced to 35%, meaning that data will not be collected once the 35% limit has been reached. On the other hand, if February is a weak month (half of a normal month) then there is no point in sampling since the real value is less than the quota.

The importance of a comprehensive data set

Your analytics solution should be able to collect and measure every single interaction a user has with your digital platforms, at any moment, all the time. And during periods of heavy traffic requiring strategic analysis (such as during sales or major events), it’s even more critical that your solution can capture all data, without missing a beat.

Let’s say you’re running an important promotion and your campaign includes TV spots to drive traffic to your website. In the minutes following your ad broadcast, your site receives a huge surge in visits, but your analytics solution’s collection server can’t handle the traffic volume and inevitably fails. Not only are you missing a big piece of data, but that piece just happens to be a crucial one as well, because it represents a goldmine of information indicating whether your TV ad is driving the intended results and how well it’s generating ROI. You now have an incomplete and, therefore, inaccurate view of your campaign performance because of sampled data.

Your data must be complete and rich enough to answer very specific questions from all different departments of your company, such as:

How did different campaigns perform in a set location and month?
How about particular products?
How did sales for a particular product compare between smartphone users and desktop users?

If your data lacks a certain layer of information, such as geolocation data or information about the device used, you’re missing out on a valuable piece of the picture.

Which solutions can you use to avoid data sampling?

Using small, sampled data sets can significantly undermine decision-making within your organisation. Although sampled data can highlight general trends, the smaller your sample, the less representative it is of the truth. This is particularly the case when carrying out granular analysis on small, sampled data sets.

In order for your data-driven decisions to be truly accurate, they must be based on data that is complete, comprehensive and sufficiently rich. Your analytics tool must therefore collect all necessary data, and also provide the right processing and enrichments that will enable you to translate this data into action. When data is missing or corrupt, you risk making strategic decisions based on skewed information that doesn’t fully reflect reality.

Piano doesn’t use any sampled data whatsoever, which allows you to act in the confidence that your decisions are based on complete, reliable and accurate information.

The five criteria of comprehensive data

Zero data sampling: Sampled data may highlight general trends, but the smaller the samples, the less representative they are of reality.
Data control procedures: As an integral part of good data governance, regular procedures (e.g., automatic testing) allow you to check for the presence of all tags.
Complete audits: These must be done especially in the case of a very important modification to your site and/or apps.
Service contracts (SLA): Your web analytics provider is contractually obligated to guarantee you a data collection rate close to 100%.
Domain-first measurements: You recover traffic blocked by ad blockers or ITP with a collection solution using your own domain name.