Acquisition is the first step in data's journey through various systems. Data can come from many places—sensors or cameras, systems like website analytics software, or humans (in the form of text, surveys, audio, or video).
What’s the mental model here?
When discussing structured data sources, we may think about ‘data entry,' like entering survey results into rows in a spreadsheet.
We can also think about data sources that aren't actively gathered or structured for the particular use you have in mind, like 'dumps' of social media posts, news stories, and even video streams.
These are all data sources, but they are not all equally useful for a specific purpose.
There are data ethics and informed consent considerations throughout the data supply chain. Good modeling of a ‘data supply journey’ is critical for operating mindfully and ethically, which also always makes better business sense.
In the acquisition phase, it's important to ensure we have permission to collect the data we're gathering and that we have established informed consent—making sure that the users who give us permission know what this might mean for them in the future.
In the acquisition stage, the car captures raw data from onboard sensors, like cameras or speed sensors. It's just bits and bytes. No work has been done in terms of processing or thinking about it.
Data catalogs are what they sound like: exhaustive (or at least comprehensive) lists of what datasets are available from an organization or other source. For example, scientific researchers might need a list of all the medical statistic datasets they could access; a company might need various lists of customer information, or an app developer might need data about users.
No matter your context, it's important to create or contribute to data catalogs if you wish to collaborate with others.
For larger organizations, cataloging data inside (or near) the company can be quite an effort in and of itself, especially if data wasn't intentionally gathered for future use or in digital formats (such as old accounting records).
A good way to check your organization's data capabilities is to see whether or not catalogs of available data are accessible to anyone in the organization who needs them.
Data is often accessible via third-party marketplaces which provide space for many parties to lease data to each other, including both broad and highly focused datasets.
For example, on Amazon Web Services' marketplace, users can source and sell data on COVID-19, real estate, satellite imagery, healthcare claims, traffic, and many other topics.
You can read more about data marketplaces in Creating Value with Data.
Metadata is data about another piece of data. We use it to understand, sort, and validate datasets to increase their usefulness. For example, the data in an MP3 is a recording of music, but the information about the artist and song name is metadata—additional data about the core data.
Other common examples of metadata include the send and receive dates of emails, the unique address of a server, or info about which app was used to post a particular message to Twitter. For example, a recent president was found to be using an unsecured phone to post things to Twitter—because Twitter shows which app posted a tweet.
Additional examples of metadata include when a computer file is created or modified, the number of times a post has been viewed on social media, or the number of times a song has been played on Spotify.
When acquiring data, we need to record as much as we can about the provenance (or source), accuracy, and consent for use. By confirming all of the details, we can be sure the data is correct and that we have sufficient information to compare data sets.
For example, you'd want to make sure that the satellite images you might get from 10 different providers all have the right metadata to be able to be synchronized. You need to know exactly what time and date images were captured and that the location assigned to an image is accurate. Otherwise, the results of the analysis you apply later won't be very trustworthy.
It is also important to note whether the data you acquire is raw (unedited from its original form) or has been processed already. For example, you may wish to analyze data from security cameras facing a parking lot, and assume the recordings you receive are complete. However, many security cameras process images before uploading them to the internet, in order to reduce file sizes. This may mean that the video file has already been trimmed, compressed, or edited in some way that removes useful information, like recordings of subtle movements that didn't trigger a 'motion detecting' threshold.
Understanding the basics of machine learning is important in order to ensure that data collected hasn't been over-processed. While it’s often not cost-effective (or even possible) to manually assess every data point yourself, trusting under-trained humans or machines can introduce biases so early in the data supply chain that you might not catch them.
At the point of data acquisition, it's important to consider many attributes so that you capture enough metadata (data about the data) for it to be useful in the future such as:
There are many, many more factors to consider when analyzing a data set. For a more comprehensive list of attributes, explore the "Attributes of Data" gallery.
Browse through the Data Sources Explorer.
Look for one or two data types that surprise you or spark your interest. Then take a moment to search online for information about the types you selected. Note what you find in the worksheet.
Explore some of the many types of data which can be acquired in the Data Sources Explorer, a gallery of dataset types.