Introduction to Data Science (Notes)

Data science, also known as data-driven science, is an interdisciplinary field of scientific methods, processes, algorithms and systems to extract knowledge or insights from data in various forms, either structured or unstructured, similar to data mining.

Data Science is a super-set of the fields of statistics and machine learning (1). DSI (Data Science Initiative, 2015) website,  gives us an idea about what Data Science is :

“This coupling of scientific discovery and practice involves the collection, management, processing, analysis, visualization, and interpretation of vast amounts of heterogeneous data associated with a diverse array of scientific, translational, and interdisciplinary applications.”

Data Science vs Statistics:

According to Data Science Association’s “Professional Code of Conduct” (2):

“Data Scientist” means a professional who uses scientific methods to liberate and create meaning from raw data.

“Statistics” means the practice or science of collecting and analyzing numerical data in large quantities.

There exists difference of opinion, some says that Data Science is nothing but just re-branding of statistics like  Karl Broman, Univ. Wisconsin (3) says:

“When physicists do mathematics, they don’t say they’re doing number science. They’re doing math. If you’re analyzing data, you’re doing statistics. You can call it data science or informatics or analytics or whatever, but it’s still statistics. … You may not like what some statisticians do. You may feel they don’t share your values. They may embarrass you. But that shouldn’t lead us to abandon the term ‘‘statistics’’ .”

On the other hand some scientists say that Data Science is super-set of Statistics like  Vincent Granville, at the Data Science Central Blog (4) says:

“Data Science without statistics is possible, even desirable”

Andrew Gelman, Columbia University (5) says:

“Statistics is the least important part of data science”

Data Scientist (n.): A person who is better at statistics than any software engineer and better at software engineering than any statistician.

The activities of Greater Data Science (GDS) are classified into 6 divisions:

1. Data Exploration and Preparation (exploring basic properties and unexpected features, finding and fixing anomalies and artifacts in data-sets)
2. Data Representation and Transformation (Representing data received from different kind of formats/sources in a common format/source)
3. Computing with Data (Using languages like R, Python to perform computations on data)
4. Data Modeling (To define properties / parameters for data analysis)
5. Data Visualization and Presentation (Representing the data using colorful plots, histograms and charts so that the user can easily extract useful information from data-sets)
6. Science about Data Science (Data scientists are doing science about data science when they identify commonly-occurring analysis)

Conclusion:

Data Science is the study of tools and techniques to analyze large amount of data to extract useful insights / information. In my opinion, it’s a mixture of statistical methods and computing methods. It involves some key activities i.e data exploration and preparation, representation and transformation, computation, modeling, visualization and presentation and science about data science.  The scope and impact of this science will expand enormously in coming decades as scientific data and data about science itself become ubiquitously available

References:

(1) “50 Years of Data Science” by David Donoho, September 18, 2015, version 1.0

(2) http://www.datascienceassn.org/code-of-conduct.html

(3) https://kbroman.wordpress.com/2013/04/05/data-science-is-statistics/

(4) http://www.datasciencecentral.com/profiles/blogs/data-science-without-statistics-is-possible-even-desirable

(5) http://andrewgelman.com/2013/11/14/statistics-least-important-part-data-science/

(6) https://en.wikipedia.org/wiki/Data_science