The data science process (ie, creating data-driven products like recommendation engines, fraud detection systems, chatbots, etc.) is somewhat similar to what goes on in the kitchen when preparing a new menu. A data science bootcamp London compares data science to the activities in a kitchen restaurant. Let’s begin.
A chef has to create something new that will appeal to guests and then offer it again and again throughout the season. Even when there is a recipe, to get the right taste, texture, or appearance, it is often necessary to make several attempts and experiment, ie learn from trial and error. The prerequisites are ingredients (data), already existing intuition (technical knowledge and technical/professional expertise), and the right equipment (tools).
Business and Data Understanding
Why do we have to cook at all? For whom? For what occasion? What dishes are we talking about? And how do we know if we succeeded? All of these questions, related to the business goal, potential value, stakeholders, and expected outcome, are also the starting point of any data science project (this is usually called the business understanding or problem formulation phase). Now it’s obvious that cooking spaghetti bolognese is not the same as cooking a sauce-soaked timpani prepared. Still, there are a number of ingredients (meatballs, pasta, and tomato sauce), preparation steps, and kitchen utensils (tools) that are common to both dishes. What differs is the combination and proportions of the ingredients, the equipment used and their settings, and the order and timing of the preparation steps.
In the Data Understanding phase, the goal is to select, from among the dishes that potentially fit the business goal, one for which the appropriate ingredients (data) are available and that can actually be prepared (i.e., the time, the skills, the tools, etc. are available).
Data Preparation
A dish does not consist of random ingredients. The ingredients need to be prepared, usually in a specific order, and treating the same ingredients differently can have a massive impact on the outcome. For example, for a dessert, the egg whites must first be separated from the yolks and then whipped until stiff, while for an omelet, the whole eggs are beaten directly. The same applies to data.
Let’s get the ingredients first. Both can come from different sources (supermarket, wholesaler, producer, etc., or data warehouses, cloud storage, API, etc.) and have different forms and packaging (data formats). The process of data ingestion is about gathering all the ingredients and putting them in a usable form on the work surface to start cooking.
Just like dates, ingredients can also vary in quality. A chef will always check the quality of his ingredients, discard some and even change suppliers if necessary. All this is the goal of the Data Cleaning process.
Modeling, Evaluation & Deployment
And now let’s cook! While the type of dish already limits the type of cookware, there is still plenty of room for experimentation. Similar to a cook who tries many different alternatives before the desired consistency, taste, or appearance is reached, data scientists also try different model versions, each with slight variations to find the best combination of ingredients, intermediate products, and cookware to find. This corresponds to the modeling phase.
Taste is subjective, and what the chef likes may not always be what the guests want or are willing to order. The art of the chef is to understand the tastes of the guests and adjust the dish if necessary. The same applies to data-driven products. It’s possible that both work well in a controlled environment, but poorly in a production environment where they are exposed to all sorts of guests. the goal of the evaluation process is to get feedback on the performance and, if necessary, to adjust or change the dish. This can be done for a specific group of guests or for a specific occasion. The idea here is not to waste too much time and rate the product as soon as possible.
Bringing a new dish from a restaurant’s kitchen to the dining room requires several things. Of course, the menu needs to change to allow guests to find, understand, and order the new dish (ie, incorporating the new data-driven product into the current portfolio may require new UX decisions). A price must be determined. The server should know how to describe and sell the dish to the guests. The chef and his team should be able to prepare the dish in time. The restaurant needs to ensure that feedback is collected on an ongoing basis, either directly from the guests or from the server, etc. This corresponds to the deployment process.
Conclusion
Just like in the kitchen, the different phases or processes in data science are not independent of each other. Usually, there are many iterations. It may be that a phase fails (e.g. not enough ingredients; not the right ingredients for the dish; the restaurant’s guests don’t order the new dish; etc.) and that an adjustment is needed (order new ingredients, change the dish, reorganize the menu, etc.). Also, the kitchen needs to be well organized to keep up during busy times, avoid waste, and maintain high standards of quality and hygiene. Recipes must be written down and updated as necessary to ensure guests are served the same dish every time they order the same dish.