I was having a very interesting conversation yesterday with my children. One of them asked me to define “big data”. In addition, we discussed the whole concept  of “data science”. These are very commonly used terms in the last few years, yet many people struggle to understand their true meaning. This is by no means a minor issue, since many companies are actively looking for new employees who have a background in big data and data science. But, if these terms are not well defined, then this leads to a great deal of confusion and disappointment on the part of the individual seeking work as well as the potential employer.

Let me start by saying that the whole purpose of big data and data science is to answer questions. Everybody has different questions, and these questions will be dependent on the particular business or set of activities that a person is engaged in. Understandably, the managers of Walmart have different questions from the managers of a major hospital. There are some universal topics, such as managing one’s inventory and how best to compensate a good employee. On the other hand, Walmart deals with customers seeking specific products to purchase, where as hospitals deal with individuals who have an illness and are seeking appropriate care.

One might actually consider it strange that a person with experience in big data and data science could potentially work at both institutions. The reason this is possible is because ultimately, data is data – a set of points of information that are collected, in order to be analyzed and then answer specific questions. The type of analyses are quite universal. So the issue will often be how to formulate the question in such a way that it can be answered by the universal approach in data science.

Ultimately, what people want is new knowledge. They want to know things they did not know before. They want to know things that their competition does not know. The frustrating part is that the answers to the questions being asked may often be hidden in trillions of pieces of information that are collected, sometimes on a daily basis. Sorting through this information and finding the answers to the various questions being posed is impossible without high-powered computers and advanced algorithms for doing the analysis. That is where the data scientist comes in. The purpose of the data scientist is to use a set of tools that allow him or her to process a big quantity of data and to generate a set of summaries. These summaries may be in text form or visual, as graphs and infographics, that present the necessary answers in an intuitive and easy to understand way.

I have delayed defining the term “data science” on purpose. This is a very difficult term to define, and many people have a very unique and different definition of the field. Once again, this can lead to significant confusion. A company or hospital can be looking for someone to analyze their huge quantities of data and actually hire someone who is formally recognized as a data scientist. However when the data scientist is actually presented with the database and computer system that holds all of the data, the data scientist may take a step back and say “oh, I have no experience with that specific system”.  How can this happen?

Imagine you need someone to transport an item from one location to another. You advertise for a driver. An individual comes and states that he is a driver and can transport the object. You then present the driver with your truck and ask him to take the package. The driver steps back and says that he has no experience driving a truck. He can drive a regular car and even a minivan. But no trucks. Did the driver lie? No. Simply put, the term driver is so broad that the employer must be more specific in the description of the task.

The same goes for anyone looking for a data scientist. A data scientist can have a tremendous amount of experience working with huge systems. But it could easily be that the data scientist’s experience is with systems that are totally different than the one a specific company or hospital is using. This makes it all the more difficult to find the  proper people to do this kind of work. Generally speaking, a data scientist has to have a good general background in various types of databases, various types of analysis tools, different types of presentation tools and as time goes on, a good level of comfort with advanced learning systems like IBM Watson. As you can see, this is a huge skillset to be mastered. This is also one of the reasons why data scientists are very hard to find and very expensive to employ.

In a hospital setting, it is basically a requirement to have at least one person, whose sole job it is to constantly review new data, and to extract from it, new practical and usable knowledge. For example, the data scientist in a hospital might determine that there is an increase in the number of infections following surgeries that are done in a specific operating room. This type of information is incredibly difficult to isolate without advanced data collection and analysis tools. However, the data scientist is specifically experienced in doing these type of extractions. And understandably, knowing that a particular operating theater is the source of infections, is critical for the well-being of the entire hospital. So the data scientist is a critical component of the normal functioning of the hospital.

What tends to happen is that an employer finds someone with a very solid and broad experience with various databases and analysis tools. If the hospital uses a different set of tools, then it is probably worth while to send the data scientist for formal training in the specific systems that the employer uses. After a few weeks/months, the data scientist can now be specifically capable of applying all of his or her knowledge to the unique system that the employer uses. This is a Win-Win situation that ends up producing a unique individual or team of data scientists who have a complete picture and control of the data being produced by the employer. With such a “secret weapon”, the employer can literally outshine the competition, with minimal additional investment in resources.

Data scientists can also train staff of an institution to think about their own data in a different way. The data scientist can get the staff to understand that data per se is, on its own, useless confetti. But if the staff can learn what questions to ask and how to explore their own environment, the data scientist can then produce pointed answers that are of tremendous value to the staff. As the staff begins to realize that they can get answers to questions that have never been answered before, this generates a level of excitement and satisfaction in the work place. Once again, this is a Win-Win situation where the employees feel far more adept at their own tasks, and the employer gets far more productivity out of his or her people.

It is truly critical that any institution, that generates significant data, comes to understand that the analysis of this data requires much more than sorting some columns in an Excel spreadsheet. In time, I hope, more and more companies will appreciate the critical need for a data scientist to keep up a steady flow of new knowledge, derived from the company’s own databases.. Hospitals that make maximal use of data science will be able to identify the various problems that limit their success. The appropriate use of data science is nothing less than revolutionary. But the key is to be “appropriate”. It’s not enough to throw data at the wall and see what sticks. Key members of the staff need to work in tandem with the data scientist to determine the most important questions to ask, so that they can benefit from the answers once they are delivered.

I hope this explanation clarifies a bit more the whole concept behind big data and data science. If not, you are free to ask me further questions. I would also suggest that reading up on these concepts, via the huge number of online articles on these topics, would definitely be worth your time.

Thanks for listening

My website is at http://mtc.expert