Discovering the ‘Science’ in Data Science – Part 2
Updated: Aug 15, 2021
In our previous article, we have seen the basic idea of Data Science. In this article, let us consider the steps followed for using a scientific method to study and understand the natural world and then attempt to do the same to study and understand data.
Typically, the following steps are undertaken when a scientist wants to explore the natural world.
Define Objective Here is when you get acquainted with the purpose of the scientific study. In most cases, this objective will translate into a problem statement or question. Let us consider this: Objective- Determine the shape of Earth
Acquire Information At this stage, you will come up with various ways to make observations. These observations must provide elaborate information about the problem/objective. You look at the horizon on a seashore and observe its shape
Formulate Hypothesis This is the most crucial step. Here you will be making an educated guess regarding the solution or the answer to the question. How can you do this? You employ logical reasoning to make a statement that can be tested and either proved or disproved by an experiment. Hypothesis: The Earth is Spherical in shape. However, remember a hypothesis is an educated guess based on observations and inferences from these observations. In our example, you made an observation that there is a curvature in the horizon. Hence this is a hypothesis.
Conduct Experiment Here you may need engineering skills to design an experiment that can take measurements of the independent and dependent variables with accuracy. Einstein was famous for his ‘thought experiments. If interested, you can read more about his famous five thought experiments here. Experimentation is the fuel that drives the machine. So, in our example let’s keep things simple by observing and recording the lunar eclipse with the naked eye on a clear night. It will clearly show the spherical shadow of the Earth on the Moon. To be 100% sure let’s measure this phenomenon in both hemispheres and in the four corners of the earth.
Analyze Results I believe this is the most exciting part of the process – ‘the proof of the pudding is in the tasting’. However, a word of caution – in any scientific endeavor, one has to have the courage to accept both success or failure. At this point, you review the measured values, use mathematics or statistical models to make sense of the data, and tabulate it in an understandable manner. If the tabulation and subsequent analysis result is an inconclusive result, a decision is made whether further experimentation is needed. So, you iterate between experimentation and analysis until you get conclusive results. In our example, we review the multiple measurements we made of the lunar eclipse across the globe. We record the findings that in all the measurements we find the shadow to be spherical. Additionally, we could diagram the data, analyze photographs, etc to support our analysis.
Draw Conclusion If a scientifically arrived conclusion is to stand up to scrutiny, it is critical that every step above is not just followed but followed without bias – personal or professional. From the analysis, then an unbiased and obvious conclusion should emerge automatically. Here we either prove the hypothesis or disprove it and support the conclusion with our analyses. The way the hypothesis is framed, the experiment conducted, the data collected and tabulated, and analyzed will determine the success of the experiment as well as the credibility of the conclusion. In our example, here is where we would make the proclamation that the Earth is spherical in shape. Note: The simplistic method of measurement we used indicates the Earth is spherical in shape. In reality, the Earth is an ‘Oblate Spheroid’. We will need more advanced techniques to prove that.
Data Science (Scientific) Method
Now that we know how to apply science to study the natural world let’s attempt to apply the above methods to a Data Science problem and see how far we can go. Let’s take the universal problem of every CEO – i.e. boosting sales. Now, let’s review an existing solution and apply the above methods to see if they fit.
Boosting Sales using Machine Learning, written by Per Harald Borgen, is the article we will refer to. The article covers two important concepts in Machine Learning: Natural Language Processing (NLP) and Prediction.
1. Define Objective
The key objective here is to target the right leads so that it will lead to better conversion and hence boost the sales. Typically, the objective for finding a solution to a business problem will be outlined in a few sentences, and may sometimes be ambiguous. It is extremely critical to study and understand the objective and then ask the right questions to expand the problem definition into multiple smaller parts. Having the required domain knowledge is a big advantage since that will help in asking the relevant questions.
2. Acquire Information
This involves data capture, collection, and preparation for downstream use. As mentioned in this article the company information was extracted from a website through an API. However, some preparatory work went into coming up with the input URL. Most often than not data integration is hard due to a lack of standards and it is extremely difficult to templatize. In our example, data size is not that high. However, for large-scale data extraction and preparation proper data engineering practices must be followed. Especially when dealing with big data, an advanced technology stack is needed for building a data pipeline to process the data. In the article, the author also talks of data cleaning. This is a critical step and helps improve data quality. In this process, we get rid of text and other superfluous data. Tokenization, stemming and removal of stop words are stepping in NLP to clean data and improve quality.
Next, the data is transformed from text to vectors to create a numeric representation of the data and make it more suitable for use in prediction algorithms. Vectorization of words/phrases/sentences is a technique to create numeric data and hence make it easy to use in mathematical operations. Data gathering and processing to make it suitable for machine learning algorithms also saves a lot of time. This is important to understand as people often think (erroneously) that Data Science is just about using algorithms for classification and prediction and not about data preparation.
3. Formulate Hypothesis
In our example, we want to pick the companies that are better leads than others and thus have a higher chance of converting into customers. Since we have the company descriptions how would we go about figuring out who would be the better leads? This makes us understand how to formulate this hypothesis: Given a company description, we can predict the possibility of that company being a potential customer. The work involved would be to build a classifier.
4. Conduct Experiment
This happens to be one of the most interesting steps in the Data Science journey and generates the most value if done right. Understanding which problem can be solved using which algorithm gives you a head start. You will have to test with multiple algorithms and pick the best-performing one. Typically, you divide the available data into a ‘training dataset’ and a ‘test dataset’.Going back to our example the author has made a split of 70% for training data and 30% for test data. Since this is a classification problem, a Random Forest algorithm was picked. A model is developed using this algorithm and the training data set is then evaluated with the test data set. The training phase is analogous to the experimental setup required for conducting an experiment in the natural world. The testing phase is analogous to conducting the experiment and recording the results.
5. Analyse Results
At this point, you review the predictions made by the model with the test data and determine the level of accuracy. For a classifier, it is better to use a confusion matrix to measure the performance of the model. If you are not satisfied with the results you can iterate steps 4 and 5 with different algorithms and parameters. The model is then deployed into a production environment where it faces the real data and makes the predictions. The accuracy of the predictions made is determined to understand if the expected outcomes are released. In our example, the author has tweaked the algorithm parameters to reach an acceptable level of accuracy.
6. Draw Conclusion
Just like experimenting in the natural world, here too you approve or disprove the hypothesis, based on the results. If the model can maintain good accuracy when it faces the real data over a period then the hypothesis is proved. If not, it is disproved and it is back to the drawing boards. As you may have noticed by now the steps undertaken to solve a Data Science problem are very similar to how a scientist would undertake a scientific inquiry in the natural world. To some extent, the techniques used for processing and analyzing data are the same in both cases. The IT world is slowly moving from a ‘programming’ culture to a ‘learning’ culture. And the learning is not just for humans but machines too. Machines have evolved from simple transistors and capacitors to learning and eventually to thinking. In the future, machines will have intelligence and will be able to collaborate with other machines and humans to develop collective intelligence, very similar to how human beings have evolved through learning over thousands of years. Humans and machines can collaborate to solve complex unsolved mysteries like the origin of life and the origin of the universe. This is the promise of the ‘SCIENCE’ in Data Science.
Also read our blog on Discovering the ‘Science’ in Data Science – Part 1