The next stage of the data supply chain is storage—recording data to a trusted location, which is both secure and easily accessible for further manipulation.
Where we store data is important. Data storage is often some version of the cloud, or perhaps a specific server, or it could be a flash drive or local memory on a sensor. Wherever we store it, we need to make sure that we understand how it will connect with other systems down the data supply chain.
Data needs to be placed somewhere secure (when it is 'at rest') and be transmitted through secure channels (when it is 'in motion').
We also need to respect the ethical restrictions and security standards which could be reasonably expected of us as professionals, many of which are imposed by law or contract.
In the automotive case, raw data might be stored in an unprocessed form in the vehicle's local memory—think of it as a hard drive in the car. Later, it will be uploaded to the car company's cloud servers.
Most of us think about data as a file in a filing cabinet or a row in a spreadsheet. This is data at rest: static, staying the same until a user modifies it.
Normally when you secure data at rest, you secure the perimeter. This is like locking a file cabinet or password-protecting a spreadsheet.
The problem with data at rest is that it's hard to synchronize across multiple systems. Imagine ten co-workers, all with their own copies of tomorrow’s presentation and sales figures. That’s a lot of distinct versions.
Another mental model for data is data in motion. Water is a good metaphor for data in motion: a ‘stream’ like a live video feed or a ‘flow’ of stock market data.
Data in motion is dynamic or 'live', and it has to be secured multiple ways. We secure the ‘pipes’ that the data passes through, just like locking the file cabinet in the prior example. We must also ensure that we authenticate the users who access it or even provide live encryption using blockchain or other technologies. This is akin to signing a document and sealing it in an envelope before sending it through the mail.
Data architectures need to be built to synchronize and/or stream data across systems. However, even some tech-centric organizations didn’t build infrastructures with the enormity of today’s data in mind. Speed and robustness are hard to have ready in advance.
Data in motion is much harder to wrap our heads around than data at rest. But it's a more useful mental model because we don't just want information about things that happened months ago that we retrieve from a file cabinet. We want to look at what's happening in the present to get the best insights into future opportunities.
When people refer to the cloud, they usually mean cloud storage. The term 'cloud' can refer to any kind of central storage for data. (Cloud storage can also be coupled with applications and algorithms that run on central servers too, often referred to as 'cloud apps' or 'cloud computing'.)
Centralizing data in the cloud is useful for ensuring everyone has access to the most up-to-date content without distributing every new version to every user.
Applications which run in the cloud can also be helpful because they allow near-instantaneous and synchronized manipulation of centralized data. Most forms of cloud storage and computing are beyond the scope of this guide, but it is important to understand whether the datasets you are considering can be stored in some central way. There are also disadvantages to server- or cloud-based storage, like increased security risks, ongoing subscription costs, and dependence on third-party vendors.
Where, geographically, is your data (or your users' data) stored?
What is the legal jurisdiction of the systems it passes through?
Even if systems are not storing data, or we're not sure if they're storing it, we should be mindful of the ‘citizenship’ of data. Different rights might apply to data depending on where it was gathered, manipulated, and/or consumed. Sometimes those rights are even applied retroactively by policymakers who didn’t anticipate particular usages.
This is where examples like GDPR (global data protection regulations), or California privacy laws come in. Because those policies require certain rights afforded to the generators of the data and impose restrictions on that data’s use, companies who did not store metadata on where data came from, or who did not establish consent or have the ability to reaffirm user consent, had to abandon or destroy data in their systems.
Some jurisdictions award or grant their citizens a “right to be forgotten” or to revoke their data. But what if you're Google and you have data about someone in thousands, or even millions, of locations around the world? How do you revoke that data, bring it back, and properly delete or dispose of it? While there are technical solutions, they only work if there is a good data taxonomy and logging strategy.
The key question to ask here is, "What sovereignties does our data pass through, and what are their laws?"