This site is in beta. Tell us what you think.

Attributes of Data

Because there are SO many forms of data, it can be hard to compare various sources in a data marketplace or to prioritize spending on your own direct research. We've assembled a list of various dimensions of data that can be assigned numeric values so that you can create a scorecard or rubric by which to compare data sources.

There are many contexts from which you can assess a dataset's fitness for purpose. Be explicit about which you use so that you evaluate consistently. Examples of lenses that you might use:

  • Suitability for internal use (such as optimizing marketing spend)
  • Suitability for enabling data-centric digital offerings, like a mobile app
  • Suitability for aggregation of datasets for analysis
  • Suitability for aggregation of datasets for direct sale
  • Suitability for rigorous scientific study, such as meta-analysis of many health studies' sample groups

Breadth

How comprehensive the sample of the data set is

  • How broad is the dataset? Does it cover several elements of your problem space? Does it apply to only one demographic?
  • Example: customer data from one telecom versus all telecoms

Depth

The variety and number of different data points in the set

  • Does the set provide a lot of useful datapoints, or just a few?
  • Example: does a dataset about your customer show only that they logged in to your website, or does it provide all information about everything they did on the website?

Frequency

The time distance between data points

  • How often does the data get updated?
  • If the data is live, how often does it refresh?
  • The difference can be vast between various datasets; consider tweets vs. a daily newspaper, a weekly newspaper, a quarterly or annual report

Processing

The amount of error correction and labeling performed on the data set before distribution

  • How processed is the data?
  • Has it been error-corrected? If so, are you clear about how it was error-corrected so that you can avoid downstream accuracy, bias or forensic problems?
  • Example: 'guessing' gender based on a user's submitted name vs. the user directly reporting it

Structuring

The degree to which attributes of the data have been defined and categorized

  • Some estimates say only 5% of available data is structured
  • How structured is the data set?
  • Is the structure easy to integrate with your own?

Research Cost

The cost of acquiring and verifying the data in its original form

  • What is the cost of acquiring the original data?
  • If you are acquiring your own data, estimate not just the initial cost, but verification and update costs
  • If you are acquiring third-party data, be cautious about expensive-to-research data, because it may not be updated often or at all

Quality

How accurate the original data is

  • How accurate is the data?
  • Can you verify the data with another method and get the same results?
  • If there are errors, are they critical (eg fundamentally wrong values) or just a processing need (such as inconsistent telephone number formats)

Bias

The prejudices replicated in the data by human and/or machine factors

  • Are there any intentional, unintentional or inherent biases in the data?
  • Example: data about user activity gathered only on mobile applications may skew towards people who can afford large data plans or have the space the download many apps; census data which does not recognize certain ethnicities may incorrectly bias towards 'asian' or 'caucasian' or another umbrella ethnicity which doesn't actually represent human identities well enough

Availability

The ease of accessing the data set

  • How broadly available is the data?
  • Can you get access to the data easily? If so, can your competitors?
  • Is the data available in a public marketplace or through a data subscription service which is easy for you to integrate with your tools?

Originality

How unique the data set is compared to other data sets

  • How original is the source? Can you use data that your competitors cannot access to create value?
  • How novel (unusual) is the data source?
  • How innovative is the potential usage of it?
  • Example: anyone can access public records, but not everyone has access to your in-app analytics or customer chats
  • Example: the use of satellite data imagery of parking lots to predict retail store traffic was once very novel but is now becoming more commonplace.

Technology

How advanced the digital tools necessary to access and analyze the data are

  • Are any special technologies needed to access or process the data, such as natural language processing?
  • Is the data released faster, better processed and/or in a novel way because of the technologies used by the source or the analysts?

Legal/Ethics

Ease of access and use of the data relative to restrictions of law, regulation and ethical standards

  • Are there legal implications to the access or use of this data?
  • Has consent been established with the original discloser of the data, such as a website visitor or person recorded on a street camera?
  • Are you restricted to using the data only in certain jurisdictions?
  • Are you allowed to sell derivative datasets?
  • Are you exposing yourself to liability by accessing or storing this data, such as personally-identifiable information?
  • Read more in our section on Data Ethics

Investment Type Suitability

If investing, how well-matched your data is to the type of investments you will can make

  • If you are using data to guide investment decisions, is the data well-suited to the investment types you are considering?
  • Example: if the data set you're looking at does not include people without bank accounts, but your investment is in micro-finance initiatives in sub-Saharan Africa, the data will not be very useful

Time Frequency of Investment Strategy

The degree to which the data frequency aligns with your intended investment frequency

  • If you are using data to guide investment decisions on a particular timeline, is the data updated regularly enough to be of value to you?
  • Example: annual reports may not be timely enough to give you an edge if you are engaging in algorithmic, high-frequency stock trading (HFT) but could be sufficient if you are making longer-term investments

Analysis Cost

The cost of turning the data into useful information (distinct from processing costs which precede analysis)

  • How much will it cost to analyze the data once you have it?
  • Consider machine analysis costs, such as image recognition or language translation
  • Consider human analysis costs, such as specialized data scientists

Recap

When evaluating data sets, it's helpful to prioritize attributes. It can be time-intensive or difficult to assess the potential analysis cost of data, so it's often helpful to start from a dataset's alignment with the time frequency, geography, etc. before attempting to compute more advanced criteria. Evaluating data sets for unknown future uses is very challenging, so it may be helpful to engage in design thinking and/or practical futurism exercises to focus your evaluation.