This site is in beta. Tell us what you think.
Chapter 2 | Data Supply Chain Guidebook

Acquire

Acquisition is the first step in data's journey through various systems. Data can come from many places—sensors or cameras, systems like website analytics software, or humans (in the form of text, surveys, audio, or video).

What’s the mental model here?

When discussing structured data sources, we may think about ‘data entry,' like entering survey results into rows in a spreadsheet.

We can also think about data sources that aren't actively gathered or structured for the particular use you have in mind, like 'dumps' of social media posts, news stories, and even video streams.

These are all data sources, but they are not all equally useful for a specific purpose.

Consider ethics from the very beginning

There are data ethics and informed consent considerations throughout the data supply chain. Good modeling of a ‘data supply journey’ is critical for operating mindfully and ethically, which also always makes better business sense.

In the acquisition phase, it's important to ensure we have permission to collect the data we're gathering and that we have established informed consent—making sure that the users who give us permission know what this might mean for them in the future.

Example

An autonomous vehicle captures raw data from onboard sensors

In the acquisition stage, the car captures raw data from onboard sensors, like cameras or speed sensors. It's just bits and bytes. No work has been done in terms of processing or thinking about it.

Exercise

What data do you already have? Consider customer records, images, transactions, web traffic data, accounting files, surveys or call recordings.

  • Are there new places you could get data from?
  • Consider any digital properties your company has, like apps or websites: is there analytics data that you could take a deeper look at?
  • The data you need might not be digitized yet. For example, customers' feedback to your company might be in the analog form of an unrecorded conversation but is still a source of data you could start gathering.  

Visit the Data Sources Explorer for examples of different types of data that can be acquired

Data Catalogs

Data catalogs are what they sound like: exhaustive (or at least comprehensive) lists of what datasets are available from an organization or other source. For example, scientific researchers might need a list of all the medical statistic datasets they could access; a company might need various lists of customer information, or an app developer might need data about users.

No matter your context, it's important to create or contribute to data catalogs if you wish to collaborate with others.

For larger organizations, cataloging data inside (or near) the company can be quite an effort in and of itself, especially if data wasn't intentionally gathered for future use or in digital formats (such as old accounting records).

A good way to check your organization's data capabilities is to see whether or not catalogs of available data are accessible to anyone in the organization who needs them.

Data Marketplaces

Data is often accessible via third-party marketplaces which provide space for many parties to lease data to each other, including both broad and highly focused datasets.

For example, on Amazon Web Services' marketplace, users can source and sell data on COVID-19, real estate, satellite imagery, healthcare claims, traffic, and many other topics.

You can read more about data marketplaces in Creating Value with Data.

Metadata

Metadata is data about another piece of data. We use it to understand, sort, and validate datasets to increase their usefulness. For example, the data in an MP3 is a recording of music, but the information about the artist and song name is metadata—additional data about the core data.

Other common examples of metadata include the send and receive dates of emails, the unique address of a server, or info about which app was used to post a particular message to Twitter. For example, a recent president was found to be using an unsecured phone to post things to Twitter—because Twitter shows which app posted a tweet.

Additional examples of metadata include when a computer file is created or modified, the number of times a post has been viewed on social media, or the number of times a song has been played on Spotify.

Validating data at the point of acquisition

When acquiring data, we need to record as much as we can about the provenance (or source), accuracy, and consent for use. By confirming all of the details, we can be sure the data is correct and that we have sufficient information to compare data sets.

For example, you'd want to make sure that the satellite images you might get from 10 different providers all have the right metadata to be able to be synchronized. You need to know exactly what time and date images were captured and that the location assigned to an image is accurate. Otherwise, the results of the analysis you apply later won't be very trustworthy.

It is also important to note whether the data you acquire is raw (unedited from its original form) or has been processed already. For example, you may wish to analyze data from security cameras facing a parking lot, and assume the recordings you receive are complete. However, many security cameras process images before uploading them to the internet, in order to reduce file sizes. This may mean that the video file has already been trimmed, compressed, or edited in some way that removes useful information, like recordings of subtle movements that didn't trigger a 'motion detecting' threshold.

Understanding the basics of machine learning is important in order to ensure that data collected hasn't been over-processed. While it’s often not cost-effective (or even possible) to manually assess every data point yourself, trusting under-trained humans or machines can introduce biases so early in the data supply chain that you might not catch them.

Attributes of Data

At the point of data acquisition, it's important to consider many attributes so that you capture enough metadata (data about the data) for it to be useful in the future such as:

  • The source of the data
  • How structured the data is
  • Details about the sample of the data (for example, if web traffic is the source, was it on all of your websites, or just mobile websites in Chile?)
  • Any initial filtering applied at the point of acquisition, like motion detection for security cameras
  • The original discloser's intended use for their data
  • Rights, permission and restrictions on gathering or using the data

Explore the many attributes of data.

There are many, many more factors to consider when analyzing a data set. For a more comprehensive list of attributes, explore the "Attributes of Data" gallery.

Read More

Exercise

Where Does Data Come From?

Browse through the Data Sources Explorer.

Look for one or two data types that surprise you or spark your interest. Then take a moment to search online for information about the types you selected. Note what you find in the worksheet.

  1. What type of data does this source collect?
  2. Where could this data come from?
  3. Who could be collecting the data?
  4. What are the risks or ethical concerns in capturing this data?

The Data Sources Explorer

Explore some of the many types of data which can be acquired in the Data Sources Explorer, a gallery of dataset types.

Read More

Recap

  • Data can be collected from a wide range of sources, but we always need to attend to the ethics of informed consent at the point of data's origin.
  • Being too specific about which data is useful during collection can cause missed opportunities for metadata. Try to capture enough data to provide context and to enable future uses that haven't come up yet.
  • Verification and classification should happen close to when data is collected if possible, to test that you have accurate sources and enough metadata to organize data sets later.