Aggregation is the combination of data sets to create a set that's greater than the sum of its parts (often referred to as Big Data). This is where exponential possibilities begin, but so does exponential complexity.
In the driverless car example, the aggregation stage begins when the car arrives home and syncs, uploading its raw data to the manufacturer's server. How will the manufacturer combine this data from different sources and types?
'Big data' gets a lot of media attention, but what is it really? Big data is the aggregation of data points into large datasets, followed by analyzing those datasets to find patterns. It's called 'big' because this strategy involves merging many different kinds and sources of data to run machine learning processes that were not possible before.
The technology of big data is focused on the collection of the datasets and the taxonomies for organizing and storing them, and also refers to the analysis of those datasets.
Big data's default strategy (or mental model) is to bring together as many data points as possible to help the company do things better, faster, and/or cheaper.
It's important to distinguish the mental model from the technology model. You can use big datasets for more than just optimizing a business or organization—data at that scale can create fundamentally new forms of value (we'll explore this elsewhere in many other parts of the Digital Fluency Guide).
Having the technologies of big data at your company doesn't guarantee that you will be able to create that kind of value if you don't also have the right thinking in place.
Here, we use 'little data' to mean specific data points about an individual. Little data's technology is often the same as big data's but sometimes includes additional, specialized tools to build profiles about people, called social graphs, that represent and organize the many facets of their identity, behavior, and social networks.
Little data can be used for 'creepy' purposes like highly-targeted ad positioning. This often creates an 'uncanny valley' effect, where something is happening in a machine system that isn't clear to users, such as insights into a user's interests that get reflected in ads that are too targeted. This is why people sometimes say, "I think Facebook is listening to my microphone!" when actually Facebook is using past browsing behavior to predict a user's future needs. It's essential to clarify how predictions are happening and to give users an opportunity to provide feedback when analysis of little data is wrong or non-consensual.
Little data strategies can be used for tactics that better serve the original discloser of data, though. For example, when an app or organization feeds back information to a user about themselves, and the user consents to data collection for that purpose, they create value for the user with their own data, rather than only using it to sell more products or services.
A good example of a Little Data strategy is the Apple Watch. Apple could focus on providing more targeted and/or intrusive ad capabilities to developers but instead emphasizes using the watch for health reasons, pairing insight about simple steps users can take to take care of themselves with gentle encouragement. Providing insight to a user based on their own activity can earn a high degree of loyalty from users.
Other examples of benign and mutually beneficial little data strategies include the use of recommendation engines (which you can read more about it in the Analyze and Use stages) to provide insight into new content or social connections users might be interested in.
You can read more about Little Data in this Harvard Business Review article by Mark Bonchek, or this summary.
This "Dear Apple" video shows real users of the Apple Watch who have written to Apple to share about how the device has changed their lives. Each user had a positive experience based on little data—the data about them as an individual. Watch it to experience what little data feels like versus the more generic strategies of big data.
One of the most useful metaphors when thinking about aggregating data is the concept of data rivers and data lakes.
Data rivers or streams are flows of data from a lot of different places. You can also think of them as pipelines. Data lakes are where data from rivers or streams gather, so that it can be accessed more easily and analyzed as a whole.
One tactic is to save every last bit of data you have access to, so you can run analysis on it in the future as the organization's thinking or analytics technologies catch up—but this approach can be quite costly and hard to justify. Some organizations focus on retaining only the data they absolutely need, keeping the costs low, and letting other people do the work of gathering and warehousing data if they’re not focused on becoming a data company themselves.
Take, for example, the ‘social graph’ that exists on the backend of Facebook for each user. It's an aggregation of many data points: data from location information on your phone, message history, comments, likes, photos, articles you shared, and even things you've done in third-party apps (that you logged into with Facebook) are all brought together and carefully cross-referenced. Data lakes can get a little creepy.
Facebook can also run future analyses on these data lakes, even if they don't know what questions they will ask of the data when they first collect or store it. This is because they have very robust metadata, which allows them to ask new questions about old data more easily.
There is a business consideration to storing data indefinitely because it costs a lot of money to have these data lakes or subscribe to these data rivers. That's why big data companies were often originally funded by venture capital—it takes a long time to gather a critical mass of data and even longer to find (and sell) use cases for it. Furthermore, it is not always ethical or legal to gather huge swaths of data or store it indefinitely in some central cloud, for reasons of data sovereignty mentioned earlier.
Data hygiene is the practice of checking, correcting, labeling, and normalizing data. Common activities related to data hygiene include:
Data hygiene is important in each stage of the data supply chain, but especially in aggregation. Are you sure the data is accurate? Is it standardized or normalized in some way that permits comparison to other datasets? Have you created data catalogs that let you know what data sets are available inside your organization or from partners, and are there taxonomies that define how that data fits together?
For example, what is the exact format of the date and time? Are names stored in the same format in each dataset?
One key element of every data aggregation effort is the establishment of unique identifiers. While a simple name may be enough in a small business context, an email ID, phone number, or tax identification number might be needed in a larger dataset. The more complex the data, the more specific the taxonomy needs to be. While it’s tempting to avoid tedious taxonomy discussions, it’s important to plan for them as you go, or you might end up with a mountain of un-sortable, un-verifiable data.
It's also necessary to correct for errors. Error correction of data is a topic worthy of its own guide, and you can partner with skilled data scientists, who have many strategies for error identification and mitigation.
Data taxonomies are classification systems for your data. They allow you to provide specific categories for each record within your dataset. Examples of taxonomies include the Dewey Decimal System used to organize topics in libraries and research, the North American Industry Classification System (NAICS), or the World Health Organization's International Classification of Diseases (ICD).
A well-designed taxonomy helps you and your organization rigorously track what data you have, or could have, and also helps organize your metadata (data about data).
However, taxonomies are often designed around our current way of thinking. Because of this, we might miss categories or attributes of data because we can't see all future use cases, or introduces biases (like categories of 'race'). For this reason, and to discover new trends, datasets are often also organized using something called a 'folksonomy', which allows topics and classifications to emerge organically from a group of people interacting with the data.
An example of a folksonomy is using tags (or #hashtags) to indicate that particular pieces of content relate to trends or specific users. A well-designed approach to data has room for both comprehensive taxonomies as well as emergent folksonomies.
One of the challenges of combining datasets is that researchers or other data collectors often attempt to anonymize the data before sharing or selling it. This makes it quite difficult to integrate various datasets without accidentally introducing duplicates. At the same time, the use of unique identifiers to avoid duplicate data can sometimes increase the risk of de-anonymizing the data.
It's essential to carefully consider the best practices for anonymizing data at every point in the supply chain so far, so you don't inadvertently include biases or expose the original discloser of data to harm.
Many organizations set data ethics principles which include standards for anonymization and prevention of de-anonymization. You can read more about informed consent and doing no harm in our article here.
There is a critical opportunity at this stage for us to attempt to anonymize or de-identify data before it's stored permanently, but that’s much harder than it sounds. Going back to the autonomous car example, some pre-processing would happen on the car before uploading to a central server. This reduces the size of the dataset that's transmitted and can protect the users’ privacy by de-identifying key elements before data is stored on the server.
As awareness of COVID-19 rose, many of us obsessively refreshed the website dashboards hosted by Johns Hopkins or the New York Times. They gathered information from many different countries’ public health systems and other data sources, normalized that data, and transparently de-duplicated reported cases to get as accurate a case count as possible.
The COVID-19 pandemic is a critical example of a situation where we need to ensure that when we're aggregating our data, we're avoiding bad analysis due to double counts or vague classifications.
Decisions about taxonomies and aggregation can have life-or-death impacts; a choice by the Centers for Disease Control in the US about how to track breakthrough infection cases (where a vaccinated person gets sick) posed a bit of a challenge. While the specificity of the CDC's definition—which only counts breakthrough infections resulting in hospitalization and/or death—does help them focus on serious cases, it is currently difficult to know how many serious breakthrough cases are occurring that don't result in hospitalization or death. However, the fact that there are clear definitions of thresholds for reporting does mean that other data sources could be aggregated with CDC numbers to provide a clear picture with lower risk of erroneous duplicates.