On Data Projects

Definition and Value

By Santi on June 21, 2019

Definition and Comparison

A data (science) project is a kind of information technology project where analyzing and modeling data are the most important activities. The objective of a software project is to build a digital product, while the objective of a data project is to understand and process data so as to make it actionable.

Science aims at knowledge, technology aims at action, at altering the world. Along the same lines, a data project seeks to know (and transform data to make it easy to know), while a software project seeks to create a digital artifact which can make changes on the physical or virtual world.

All software projects use data, but not every software project is a data project. Let's suppose we want to build an e-commerce website: lots of data will be needed and processed, but its analysis will not be the project's objective. Designing the architecture, creating a suitable user experience, implementing the API, and optimizing deployment are some of the tasks that will take much of the resources, but none of these tasks is directly related to data.

In some cases, however, borders are diffuse. It is a good idea to separate data intensive activities into a different project. Abilities required for data analysis and software development differ, which has to be a key consideration when staffing.


Data can be internal or external (to the company), proprietary or public.

The objective of a data project can be:

  1. Gaining insight to make informed decisions (Research)
    • Internal (people, resources, processes)
    • Customers
    • Competition
    • Providers
  2. Producing a trained model to be used in a product (Product)
  3. Preparing data for sale (Sale)


  1. Optimize product mix for a key customer segment
  2. Create an image classifier for dog breeds
  3. Create an input pipeline to clean and store information from a pixel -an invisible image embedded in digital products in order to track users-


Information technology projects are always risky, and data projects particularly so, having a fail rate, some estimate, from 60 to 85%. The principal problems are seldom related to the technology itself: having a clear goal and integrating data analytics into organizational culture are more complex processes than any machine learning algorithm. The upside, on the other hand, can be enormous: from creating amazing products with unbeatable quality and unmatchable features to targeting a new unexplored market segment, the possibility of optimizing old and generating new revenue streams is substantial.

Any project has to create value, as measured by business metrics. Starting a Machine Learning Project or a Deep Learning Project is a common mistake, the problem being that the methodology selection (machine learning, deep learning) should not be predetermined, it has to be the result of assessing the options to achieve the project's goal. Often, the newest processing techniques provide none or minimal advantage compared to a much simpler solution. Even if state of the art deep learning is required, many services (off-the-shelf cost-effective software) exist which can drastically reduce implementation costs.

Start Small

Automate and Update

Automate manual, error-prone and time-consuming processes related to data entry. Update legacy systems, defined as old tools that are hard to use, integrate and understand. These are not data projects, but have tremendous power to improve data usage in the organization.

Understand Customers

Keep the CRM updated with quality data. Pick and track customers' KPIs. Run experiments with cross and upselling, pricing, and assortment optimization.

Avoid or End Doomed Projects

Passing on a data project can be the best choice. In general, end or avoid anything without an specific business goal and metric. Make sure projects are aligned with the corporate strategy.

‹ Previous: A Datacentric World Next: Statistics, Not Anecdotes ›