Category: Artificial Intelligence

  • Review Rating Classification System Using Artificial Intelligence

    Our research article titled “Multi Class Review Rating Classification Using Deep Recurrent Neural Networks” was published in an international journal “Neural Processing Letters” (Impact Factor: 2.591) on 15 October 2019. In this tutorial, we briefly discuss the objectives, short summary, key contributions, and main findings presented in the article.

    Short Summary – Abstract:

    This paper presents a gated-recurrent-unit (GRU) based recurrent neural network (RNN) architecture titled as DSWE-GRNN for multi-class review rating classification problem. Our model incorporates domain-specific word embeddings and does not depend on the reviewer’s information because we usually don’t have many reviews from the same user to measure the leniency of the user towards a specific sentiment. The RNN based architecture captures the hidden contextual information from the domain-specific word embeddings to effectively and efficiently train the model for review rating classification. In this work, we also demonstrate that downsampling technique for data balancing can be very effective for the model’s performance. We have evaluated our model over two datasets i.e IMDB dataset and the Hotel Reviews dataset. The results demonstrate that our model’s performance (accuracy) is comparable with or even better than the four baseline methods used for sentiment classification in literature.

    https://link.springer.com/article/10.1007/s11063-019-10125-6

    Problem Statement:

    Now days there are many online platforms like Weblogs, Facebook, Instagram etc. where users express their opinions or sentiments about products, services, applications or any other type of entities. The problem is that we as humans cannot analyze this large amount of data. Thus, there is a need to design and train a deep-forward neural network-based model which can automatically classify our textual data (review) into multiple classes i.e. numbered rating (1-5 stars). The model predicts the star ratings from the textual reviews so that the rating system is reviewer independent with accurate predictions.

    Data Description:

    The authors have used two benchmark datasets i.e IMDB Datatset and Hotel Reviews Dataset. The IMDB dataset consists of 50,000 data samples (reviews) divided into ten classes whereas the Hotel Reviews dataset comprised of 14,895 data samples (reviews) divided into five classes. A data sample means a movie review in IMDB dataset and a hotel review in Hotel Reviews dataset.

    Proposed Model:

    The authors have proposed a model titled as Domain Specific Word Embeddings with Gated Recurrent Neural Networks (DSWE-GRNN) comprised of Gated Recurrent Units (GRU’s), domain-specific word embeddings (DSWE), reviewer/author independence and down-sampling technique for data balancing. The motivation behind using GRNN architecture is that, it can capture the contextual information from the given text while training the model using word embeddings. Moreover, the GRNN model is time efficient as compared to other recurrent neural network architectures like Long Short Term Memory Network (LSTM). The presented model does not depend on the reviewer specific attributes.

    Multi Class Review Rating Classification using Deep Recurrent Neural Networks
    Multi Class Review Rating Classification using Deep Recurrent Neural Networks

    Baseline Methods For Comparison:

    Following benchmark, baseline methods are used for the evaluation of the proposed model.

    1. WE-SimpleNN: In Embedd-SimpleNN, word embeddings are used as an input to the simple feed forward neural network
    2. CNN: Convolutional neural networks (CNN) is also considered as the stateof-the-art composition architecture for text sentiment classification
    3. LSTM: Long-Short-Term-Memory Network (LSTM), which is a type of Recurrent Neural Network (RNN), has also been implemented and the results were compared with the proposed model
    4. CNN-LSTM: A combination of Convolutional Neural Network (CNN) and Long Short Term Memory Network (LSTM) has also been implemented and the results were compared with the proposed model

    Results and Discussion:

    The accuracy of each baseline method over each dataset when compared with the proposed model (DSWE-GRNN) shows some interesting patterns. The methods like WE-SimpleNN and Convolutional Neural Networks (CNN) give similar accuracies at IMDB dataset. LSTM and CNN-LSTM methods gave relatively poor accuracy on Hotel Reviews dataset as compared with the IMDB dataset due to the fact that the Hotel Reviews dataset contains lesser number of domain specific keywords to train the network towards the possible decision classes. The proposed method outperforms all of the baseline methods on both datasets and the accuracies are highlighted in bold.

    MethodIMDB Dataset (Accuracy)Hotel Reviews Dataset (Accuracy)
    WE-SimpleNN 0.8675 0.8024
    CNN 0.8645 0.7877
    LSTM 0.8360 0.8092
    CNN-LSTM 0.8268 0.7843
    DSWE-GRNN 0.8780 0.8132

    Conclusion:

    The authors have introduced a gated-recurrent neural network based model (DSWE-GRNN) with domain specific word embeddings. The proposed model encodes data samples into domain specific word embeddings which act as feature vectors for the training of the gated recurrent neural network (GRNN). The proposed model is evaluated over two benchmark datasets and the results are compared with four baseline methods. The results as presented in the above table, clearly demonstrate that the proposed model achieves state-of-the-art performances on both datasets. The further analysis shows that:

    1. Gated recurrent neural network model (GRNN) efficiently encodes the training samples by incorporating the contextual information in the model
    2. Domain-specific word embeddings and downsampling technique for data balancing dramatically boosts the performance of the model
    3. GRNN based model is time efficient as compared with the other recurrent neural network model like LSTM
    4. The proposed model (DSWE-GRNN) outperforms the baseline models used for multi-class review rating classification problem
    5. The proposed model will be helpful in building a more intelligent review rating system which will be machine and data dependent instead of reviewer/user
  • 10 Data Science Career Paths

    Data Science is an interdisciplinary field of scientific algorithms, tools, and technologies to extract useful insights from the data. This blog post lists the 10 data science career paths with job responsibilities and required skills, tools and technologies.

    Data science, also known as data-driven science, is an interdisciplinary field of scientific methods, processes, algorithms and systems to extract knowledge or insights from data in various forms, either structured or unstructured, similar to data mining.

    http://blog.codoplex.com/introduction-data-sciences-notes/

    According to Zeeshan-ul-Hassan Usmani, the following is a list of 10 data science career paths with required skills, tools and technologies.

    1. Data Science Developer
    2. Machine Learning Developer
    3. Data Scientist
    4. Data Engineer
    5. Data Analyst
    6. Data Scavenger
    7. Data Visualization Expert
    8. Data Story Teller
    9. Big Data Expert
    10. Data Science Researcher

    10 Data Science Career Paths

    #TitleResponsibilitiesRequired Skills
    1Data Science DeveloperTo implement the algorithms using programming languagesPython, R, SQL, Jupiter Notebook, RStudio, etc.
    2Machine Learning DeveloperMachine Translation, Deep Learning, Computer Vision Scikit, NLP, Tensorflow, Feature Extraction, Classification, etc.
    3Data ScientistTo implement models for future predictionsA/B Testing, Causal Inferences, Co-Relation, Pattern Matching, etc.
    4Data EngineerTo make sure that the data is ready to be fed into ML algorithms/ ModelsPrepare data pipelines, Auto feeding of data to ML algorithms, ETL, SQL, etc.
    5Data AnalystTo clean and analyze dataData cleaning, Data Wrangling, Data Classification, SQL, ETL, etc.
    6Data ScavengerTo collect and prepare dataData Collection, Web Scrapping, Data Repositories, Data Extraction, Selenium, URLLib, Scrappy, etc.
    7Data Visualization ExpertTo display data in a meaningful waySeaborn, Tableau, Plotty, Matplotlib, Qlikview, etc.
    8Data Story TellerTo present the data like a useful story for evaluation of the commercial analytical value of the dataR, Matplotlib, Data Visualization, etc.
    9Big Data ExpertTo extract and analyze huge amount of dataScala, MongoDB, ApacheSpark, Hadoop, Weka, R, etc.
    10Data Science ResearcherTo find new possibilities, algorithms, tools, and technologies to efficiently and effectively utilize dataMath, Statistics, Linear Algebra, Calculus, Data-Driven Policy Making

    Reference:

  • An Introduction To Deep Learning – AI

    Deep learning is a branch of machine learning based on the artificial neural networks which automatically optimize their learned rules/ patterns when there’s a wrong prediction.

    Following are some of the Deep learning definitions from the sources.

    Deep Learning is a subfield of machine learning concerned with algorithms inspired by the structure and function of the brain called artificial neural networks.

    https://machinelearningmastery.com/what-is-deep-learning/

    Deep learning is part of a broader family of machine learning methods based on artificial neural networks with representation learning. Learning can be supervised, semi-supervised or unsupervised.

    https://en.wikipedia.org/wiki/Deep_learning

    Deep learning is a machine learning technique that teaches computers to do what comes naturally to humans: learn by example. Deep learning is a key technology behind driverless cars, enabling them to recognize a stop sign, or to distinguish a pedestrian from a lamppost.

    https://www.mathworks.com/discovery/deep-learning.html

    The Concept – Deep Learning

    • Deep learning model consists of more than 2 layers of artificial neural networks with a number of nodes
    • Each node represents a weighted number which contributes to the overall calculation/interpretation of the output from the given input
    • Each node is started with a random weighted number which is incremented in each forward pass with a small ratio known as Learning Rate
    • The wights of each node are adjusted if the predicted value of the target class doesn’t match the actual value of the class.
    • The process is repeated until either the defined number of epochs/ iterations are completed or the model fully learn the hidden parameters to predict the target class with maximum success rate

    Types of Deep Neural Networks

    • Feed Forward Neural Networks (FFNNs)
    • Convolutional Neural Networks (CNNs)
    • Recurrent Neural Networks (RNNs)
    • Long Short Term Memory Network (LSTM)
    • Gated Recurrent Neural Network (GRNN)
    • Self Organizing Maps (SOMs)
    • Boltzmann Machines
    • Auto Encoders

    References:

  • An Introduction To Machine Learning – AI

    Machine learning is a branch of Artificial Intelligence which deals with providing systems the ability to learn and improve from data samples without explicit programming instructions.

    Following are some definitions from the literature.

    Machine learning is an application of artificial intelligence (AI) that provides systems the ability to automatically learn and improve from experience without being explicitly programmed. Machine learning focuses on the development of computer programs that can access data and use it learn for themselves.

    https://expertsystem.com/machine-learning-definition/

    Machine learning is the scientific study of algorithms and statistical models that computer systems use to perform a specific task without using explicit instructions, relying on patterns and inference instead. It is seen as a subset of artificial intelligence.

    https://en.wikipedia.org/wiki/Machine_learning

    Machine learning techniques are used to automatically find the valuable underlying patterns within complex data that we would otherwise struggle to discover. The hidden patterns and knowledge about a problem can be used to predict future events and perform all kinds of complex decision making.

    https://towardsdatascience.com/machine-learning-an-introduction-23b84d51e6d0

    The Concept:

    Generally, computers solve any problem by running a set of predefined rules/ instructions. So we have the data (input), a set of rules/ instructions (model) and we get the result (output). This is called traditional programming. In machine learning we have the data (input), we also know the result (output) and we train the system to learn the hidden rules so that the system can predict the output of an unknown input using those learned rules.

    5 Steps Process of Machine Learning

    1. Data Collection – Data is collected from multiple sources
    2. Data Preparation – Removal of unwanted fields and to making sure the data is ready for analyses
    3. Model Training – The model is trained based on the prepared data
    4. Evaluation – The results are analysed and evaluated to check the performance of the model
    5. Tuning – The model features are tuned/adjusted to maximize the performance

    Machine Learning Types/ Approaches

    • Supervised Learning (we have inputs and outputs i.e labelled data and the model learns the rules)
    • Unsupervised Learning (We only have the input but we don’t have the outputs i.e. unlabeled data and the model learns the hidden rules)
    • Semi-supervised Learning (we have mixture of labelled and unlabeled data)
    • Reinforcement Learning (the model tries to correct the learned rules by repeating the process again and again)

    Reference:

  • An Introduction to Artificial Intelligence – AI

    Artificial intelligence is a field of computer science which deals with tools and techniques to train machines so that they can solve complex problems with maximum success rate. These machines learn the hidden parameters from the available data-set and then solves the unknown problems based on the learned patterns.

    What is Artificial Intelligence?

    “In computer science, artificial intelligence, sometimes called machine intelligence, is intelligence demonstrated by machines, in contrast to the natural intelligence displayed by humans.”

    Wikipedia

    “The study of how to make computers do things at which, at the moment, people are better.”

    Rich and Knight, 1992, 2009
    • According to Tesler’s theorem Intelligence is whatever machines haven’t done yet” also quoted as “AI is whatever hasn’t been done yet” and also known as AI Effect.
    • The AI field was founded on the bases of an assumption that human intelligence “can be so precisely described that a machine can be made to simulate it” presented at Dartmouth Conference
    • Some people consider that AI as a threat to humanity and possible risk for mass unemployment

    Concept:

    An artificially intelligent system learns the patterns in the data and forms the general rules for the solution of a particular problem. In future, when a new instance of the problem occurs, it matches the patterns in the new instance with the learned patterns to predict the possible solution. If the predicted solution matches the actual/true solution then well and good otherwise it optimizes/tweaks the learned patterns to correct the parameters thus improving the system for future instances.

    Types of Artificial Intelligence Models:

    • Search and Optimization: Intelligently searching through many possible solutions
    • Logic Based: mainly used for knowledge representation and problem solving
    • Probability Based: mainly used when there is incomplete or uncertain information
    • Classification Based: Matching patterns to classify data instances in different decision classes/categories
    • Artificial Neural Network Based Models: devised from the concept of neurons in the human brain. Learning algorithm adjust the weights of the neurons until the system successfully predicts the possible solution based on the data
    • Deep Feed Forward Neural Networks: A neural network with multiple hidden layers can deeply learn the hidden parameters
    • Deep Recurrent Neural Networks: mainly used to deal with sequential data (words/characters in text, frames in video etc.)

    Sub Fields of Artificial Intelligence:

    • Evolutionary Computation
    • Vision
    • Robotics
    • Expert Systems
    • Speech Processing
    • Natural Language Processing
    • Machine Learning
    • Neural Networks
    • Deep Learning

    Applications of Artificial Intelligence:

    • Healthcare: predicting possible medicine
    • Automotive: self driving cars
    • Finance and Economics: fraud prevention
    • Cyber-security: spam filtration
    • Government: mass surveillance using face recognition
    • and many more…

    References:

  • Code Example To Generate Word Cloud Using R – Data Analysis

    Word cloud help us to understand and visualize important keywords in given textual data sets. R is a powerful programming language used for exploration and visualization of data. Following code snippet can be used to generate word cloud using R programming language.

    [js]

    install.packages("tm") // package for text mining
    install.packages("wordcloud") // to generate word cloud
    install.packages("RColorBrewer") // to add colors in the word cloud
    library(tm)     // loading tm package
    library(RColorBrewer) // loading RColorBrewer package
    library(wordcloud) // loading wordcloud package
    text_data <- read_csv("data.csv") // reading data from csv file
    text <- text_data$col_name      // extracting data from column ‘col_name’
    text_corpus <- Corpus(VectorSource(text)) // creating corpus from textual data
    inspect(text_corpus) // to view the corpus data

    toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x)) // it will replace given pattern to space
    formatted_text <- tm_map(text_corpus, toSpace, "/")   // replacing ‘/’ with space
    formatted_text <- tm_map(text_corpus, toSpace, "@")  // replacing ‘@’ with space
    formatted_text <- tm_map(text_corpus, toSpace, "\|")  // replacing ‘\|’ with space
    formatted_text <- tm_map(formatted_text, tolower)  // converting text to lowercase
    formatted_text <- tm_map(formatted_text, removeWords, stopwords("english")) // removing stopwords
    formatted_text <- tm_map(formatted_text, removePunctuation) // removing punctuation marks
    formatted_text <- tm_map(formatted_text, stripWhitespace)  // removing white spaces

    // following functions create table containing the word frequencies

    text_tdm <- TermDocumentMatrix(formatted_text)
    text_m <- as.matrix(text_tdm)
    text_v <- sort(rowSums(text_m), decreasing = TRUE)
    text_d <- data.frame(word=names(text_v), freq = text_v)
    head(text_d, 10) // visualize first 10 entries

    set.seed(1234)
    // following function actually creates the word cloud
    wordcloud(words = text_d$word, freq = text_d$freq, min.freq = 1, max.words=200, random.order=FALSE, rot.per=0.35, colors=brewer.pal(8, "Dark2"))

    [/js]

    Word Cloud using R

  • Useful R Packages for Data Analysis

    R is a powerful programming language used for exploring and analyzing data effectively. R provides many built in functions for data analysis. Furthermore there are many other R packages for data analysis which can extend the data analysis functionality. Following are some useful R packages which can be installed for specific tasks.

    Twitter Data Analysis:

    //rtweet.info

    install.packages(rtweet)

    Text Mining:

    install.packages(“tm”) // for text mining

    install.packages(“SnowballC”) // for text stemming

    install.packages(“wordcloud”)  // word-cloud generator

    install.packages(“stopwords”) // for multilingual stop words

    Colors:

    install.packages(“RColorBrewer”) // to add colors

    Visualization:

    install.packages(“ggplot2”) // for data visualization functions

     

  • Useful R Functions – Exploratory Data Analysis

    R is a programming language used for statistical analysis and exploratory data analysis projects. According to the official website:

    R is a language and environment for statistical computing and graphics. It is a GNU project which is similar to the S language and environment which was developed at Bell Laboratories (formerly AT&T, now Lucent Technologies) by John Chambers and colleagues. [source]

    Following are some useful R functions which can be used for data exploration and visualization.

    To read data from CSV file:

    data_obj <- read_csv(“data.csv”)

    In above line data_obj is the object name in which your data will be saved, data.csv is the file name from which the data will be extracted.

    To get frequency of unique column values:

    some_variable <- as.data.frame(table(data_obj$col_name))

    In above line some_variable is the variable name in which you want to save the new data, data_obj is the object in which your raw data is saved, col_name is the column name in your raw data.

    To sort the data in Highest to Lowest (Descending Order) values in a column:

    sorted_values_descending <- data_obj[order(-data_obj$col_name), ]

    In above line sorted_values_descending is the variable name in which you want to save the new sorted data, data_obj is the object in which your raw data is saved, col_name is the column name in your raw data which will be used for sorting.

    To sort the data in Lowest to Highest (Ascending Order) values in a column:

    sorted_values_ascending <- data_obj[order(data_obj$col_name), ]

    In above line sorted_values_ascending is the variable name in which you want to save the new sorted data, data_obj is the object in which your raw data is saved, col_name is the column name in your raw data which will be used for sorting.

    To get some rows from the data (top to bottom):

    some_rows <- head(data_obj, 10)

    In above line some_rows is the variable name in which you want to save the rows data, data_obj is the object in which your raw data is saved, 10 is the number of rows you want to get.

    To get some rows from the data (bottom to top):

    some_rows <- tail(data_obj, 10)

    In above line some_rows is the variable name in which you want to save the rows data, data_obj is the object in which your raw data is saved, 10 is the number of rows you want to get.

    To Create Bar Plot (using geom_col()) from 2 Columns using ggplot2 package:

    g <- ggplot2::ggplot(data_obj, ggplot2::aes(x=data_obj$col1, y= data_obj$col1, l)) + ggplot2::geom_col() + ggplot2::xlab(“x axis label here”) + ggplot2::ylab(“y axis label here”) + ggplot2::ggtitle(“Plot title here”)

    In above line data_obj is the object in which your raw data is saved, col1 is the column name in data_obj which will be used for x axis, col2 is the column name in data_obj which will be used for y axis. Please make sure that ggplot2 package is installed before using this function.

    To Add Count on each Bar (geom_col()) in R using ggplot2 Package:

    g <- geom_text(data=data_obj, aes(label=col_name, y=col_name), colour=”red”, size=2.5)

    Scatter Plot in R using ggplot2 package:

    g <- ggplot(data_frame, aes(x=col_x, y=col_y)) + geom_point(col=”steelblue”, size=3)

    // change color using variable

    g <- ggplot(data_frame, aes(x=col_x, y=col_y)) + geom_point(col=col_y, size=3)

    To Remove Legend from Plots in R using ggplot2 Package:

    g <- g + theme(legend.position=”None”)

    To Change Legend Title in R using ggplot2 Package:

    g <- g + labs(col=”No of attacks”)

    To Change Legend Labels and Colors in R using ggplot2 Package:

    g <- g +  scale_color_manual(name=”Legend Title”,labels = c(“NA”,”High”,”Low”,”low”,”Medium”), values = c(” “=”blue”,”High”=”red”,”Low”=”yellow”,”low”=”yellow”,”Medium”=”orange”))

    Make x-axis Label Texts Vertical in R using ggplot2 Package:

    g <- g + theme(axis.text.x = element_text(angle = 90, hjust = 1))

    To Draw Best Fitting Line from Scatter Plot in R using ggplot2 Package:

    g <- g + geom_smooth(method=”lm”, col=”firebrick”)

    // lm stands for linear model

    To Limit x-axis and y-axis in R using ggplot2 Package:

    g <- g + xlim(0, 100) + ylim(0,10)

    To Add Titles and Labels in R using ggplot2 Package:

    g <- g + labs(title=”some title here”, subtitle=”subtitle here”, y=”y-axis label here”, x=”x-axis label here”, caption=”caption text here”)

    To Customize Axis Labels in R using ggplot2 Package:

    g <- g + scale_x_continuous(breaks=seq(0, 0.01, 0.1), labels = sprintf(“%1.2f%%”, seq(0, 0.1, 0.01))) + scale_y_continuous(breaks=seq(0, 1000000, 200000), labels = function(x){paste0(x/1000, ‘K’)})

    Bar Plot (using geom_bar()) to Pie Chart Using ggplot2 Package:

    bp <- ggplot2::ggplot(data_obj, ggplot2::aes(x=””, y=data_obj$col_name, fill = var_name))+ ggplot2::geom_bar( width = 1, stat = “identity”, show.legend = TRUE) + ggplot2::xlab(“x axis label here”) + ggplot2::ylab(“y axis label here”) + ggplot2::ggtitle(“Plot title here”)

    pie_chart <- bp + ggplot2::coord_polar(“y”, start=0)

    To Create Data Frame from Variables in R:

    var_data_frame <- data.frame(x = c(‘Male’, ‘Female’, ‘Children’), y = c(total_male, total_female, total_children))

    var_data_frame is the variable in which new data frame will be stored, x & y are the column names of the data frame, c() is used to create values of the column and total_male, total_female, total_children are variables. Its output will be:

    x                    y

    Male             120

    Female         200

    Children       500

    To Sum Values of a Column in R:

    total_col_values <- sum(data_obj$col_name, na.rm = TRUE)

    na.rm = TRUE is used to neglect missing values.

    To Exclude Empty / Missing Values from Data Frame in R

    data_obj <- data_obj[!(is.na(data_obj$col_name) | data_obj$col_name==””), ]

    To Map Locations Data in R using ggplot2 Package:

    pakistan <- map_data(“world”, “Pakistan”)

    map <- ggplot() + geom_polygon(data = pakistan, aes(x=long, y = lat, group = group), fill=”green” , color = “black”) + geom_point(data= pk_data, aes(x = Longitude, y = Latitude)) + xlab(“”) + ylab(“”) + ggtitle(“Title here”)

  • Limitations of Social Media Analysis for Participatory Urban Planning Process

    In previous post we discussed how social media participatory process can help city designers, planners or administrators in decision making process. In this post we discuss the limitations / shortcomings of social media participatory process for urban planning. This post is a short summary of the paper titled as ‘Missing intentionality: the limitations of social media analysis for participatory urban design’ by Luca Simeone.

    Objective:

    The objective of this case study was to find limitations of social media analysis for participatory urban planning process. They analysed what city inhabitants are publishing on their social media profiles to perceive what they think and how they live in urban environment. They mentioned the shortcomings / limitations they found during this urban planning process based on social media analysis.

    Data Set Used:

    They collected data from four different social media channels i.e Facebook, Twitter, Foursquare and Flickr. This data was then analyzed to see the most congested areas by tracking the number of contributions originated from specific geographic locations.

    Methodology:

    They proposed a method consisting of following steps:

    1. Collect data from social media channels
    2. Apply multiple strategies (like text mining) to analyze this data
    3. Choose the appropriate urban planning tasks e.g knowing user’s feelings towards local government policies, or to find best place to build hostel for university students etc.
    4. Visualize the results to see the patterns / results
    5. Make decisions based on the results

    Limitations:

    1. Not all city inhabitants have equal access to the technologies and skills to post Geo-located contributions
    2. Restricted access to social media data due to their privacy policy
    3. Lack of intention to participate in participatory process

    Author focused on the point that we analyse the social media data without knowing user’s intention. Whereas when we say participatory process then it means that user should be willing to participate in the design process or they should at-least know that their activities are being tracked for the purpose of planning process. According to author’s point of view this is the main limitation of this social media participatory process for urban planning.

    Conclusion:

    Social media analysis can be used to support urban design, decision making process. But at present their are certain limitations out of which lack of intentionality of the user’s is the major shortcoming of this process.

     

    References:

    https://www.researchgate.net/publication/274890298_Missing_intentionality_the_limitations_of_social_media_analysis_for_participatory_urban_design

  • Social Media Participation in Urban Planning

    Social media is generating a huge amount of data every second and this data can be used to make important decisions about any particular topic. This post is a short summary of a research paper titled as ‘SOCIAL MEDIA PARTICIPATION IN URBAN PLANNING: A NEW WAY TO INTERACT AND TAKE DECISIONS’ by E. López-Ornelas, R. Abascal-Mena, S. Zepeda-Hernández.

    Textual analysis is performed on social media data to know the sentiments (opinions) of the people about any topic. This type of analysis is very important for organizations and institutions to make important decisions. In this paper they have used the installation of a new airport in Mexico City as a case of study to highlight the importance of conducting a study of this nature.

    Urban Planning and Social Media:

    Urban planning deals with efficient utilization of land and design of the urban environment including air and water infrastructure management. On the other hand social media and online web based communities are a great sources to collect public opinion about any topic. This paper studies different steps/phases to extract public opinion from social media (twitter) and analyse the public opinion about new airport in Maxico City.

    Design Process For Information Extraction from Social Media:

    This paper proposes a design process that will help urban planners to analyze all the information contained in social network. To extract and analyse social media data they proposed following four steps to be studied and applied:

    1. Structure Analysis: This analysis is used to know information generators, amount of available information and linkage between different information generators.
    2. Sentiment Analysis: It’s used to know the tendency of opinion about particular topic.
    3. Community Detection: It’s important to simplify the social network and to identify groups of users sharing same information.
    4. User Identification: It’s used to identify the users which are actually generating the information (opinion) about the topic.

    Results:

    This analysis shows that when the statement was made about the installation of new airport in Maxico City then there was a big concentration of opinions that were rejecting the project. New analysis was made after the 2 weeks of the announcement and results show that there were large number of opinions approving the project.

    Future Work:

    The authors of this paper used twitter data to extract public opinion. As a future work they proposed to analyse data from other social networking websites and then to compare the results from different social networking websites. This will also help to identify which social networking website generate better results.

    Moreover to map the locations of participants will help us identify biased opinions and hence getting better results.

    Conclusion:

    Social media can be used as a medium that support participatory process. It can be used in areas like urban planning to help make decisions based on the public opinion.

    References:

    https://www.int-arch-photogramm-remote-sens-spatial-inf-sci.net/XLII-4-W3/59/2017/isprs-archives-XLII-4-W3-59-2017.pdf

  • 5 Steps involved in Data Analysis Process

    Data science deals with large amount of data and data scientist analyse that data to extract useful information from that data. This data analysis process involve 5 steps (1). In this post we will discuss those 5 steps involved in data analysis process and further we will explore some of the challenges we face during each step.

    In previous post we concluded that data science is the mixture of computing methods and statistical methods. In both data science and statistics, the core objective is to analyse the data. But in data science we automate some of the steps involved in data analysis process, which is the major difference between these fields.

    5 steps involved in data analysis process:

    These 5 steps involved in data analysis process are mentioned in a paper named as ‘Enterprise Data Analysis and Visualization: An Interview Study’ by Sean Kandel and companions. It is an interview study conducted by the authors of the paper where they interviewed 35 data scientists from 25 organizations. They mention it as follows:

    “To better understand the enterprise analysts’ ecosystem, we conducted semi-structured interviews with 35 data analysts from 25 organizations across a variety of sectors, including healthcare, retail, marketing and finance. Based on our interview data, we characterize the process of industrial data analysis and document how organizational features of an enterprise impact it”

    The 5 data analysis steps mentioned in the paper are as follows:

    1. Discovery
    2. Wrangling
    3. Profiling
    4. Modeling
    5. Reporting

    Discovery:

    First step in data analysis process is to discover / collect data for analysis. Data can be gathered from multiple sources like database tables, log files, spreadsheets or from an online source. The challenges involved in this phase are finding relevant data and interpreting certain fields in the database tables etc.

    Wrangling:

    Once the data is collected, the next step is wrangling or cleaning the available data. Data manipulation and integration of data obtained from multiple sources are the main tasks performed in this phase. Some of the issues data scientists face in this phase are processing semi-structured data e.g data received from log files, integration of data obtained from diverse sources etc.

    Profiling:

    Before using the available data in any analysis, we need to make sure that there are no issues in our data. Data may have quality issues like missing, erroneous or extreme values which may affect the analysis results. In this phase data analyst make sure that there are no anomalies in the data that we are going to use in our analysis.

    Modeling:

    In this phase data analyst decides the features, scale and statistical method to be used for the analysis process. Some of the issues faced during this phase are relevant features selection and data size scale issues with data analyzing tools etc.

    Reporting:

    In this final step, insights gained from the analysis process are reported. Communicating the assumptions involved in analysis process effectively and static reports (i.e no interactive method to check the results) are some of the points need to be considered in this phase.

    Conclusion:

    In data analysis process 5 phases are involved namely discovery, wrangling, profiling, modeling and reporting. Some data analysis may exclude some of the steps depending upon the nature of the data analysis. Some of the issues faced by data analyst’s during each phase are also discussed in this post.

     

    References:

    (1) Enterprise Data Analysis and Visualization: An Interview Study by Sean Kandel and companions

  • Introduction to Machine Learning (Notes)

    Machine Learning is basically, to train machines (computers) by feeding them with huge amount of data. As a result they can predict/extract useful information based on previously available data. For example in order for a computer to recognize hand writing, we need to train that computer by feeding it with large amount of different handwriting samples. Machine learning helps to maximize profit and minimize cost by adding business intelligence in previously available data.

    Machine learning evolved from pattern recognition and applying algorithms that can learn from data and then make predictions, and it’s closely related to computational statistics (thank you Wikipedia). Some examples of machine learning are character recognition in handwriting, facial recognition, automatic spam filtration etc.

    “Machine Learning is about using the data you already have to make predictions. This sounds really fancy, but most of the time, the ‘prediction’ is really just a label,” says Hillary (Data Scientist).

    Have you wondered when you see an AD related to a product you recently searched for? Were you amazed to see recommendations list about your favorite clothing brand? This is machine learning. The companies, organizations, or brands keep track of activities, behavior, likes and dislikes of their customers to train machines (computers) at their back offices. Now based on these data instances they can recommend related products to customers and thus maximizing their profit.

    According to Forbes, “Machine Learning Engineers, Data Scientists, and Big Data Engineers rank among the top emerging jobs on LinkedIn. Data scientist roles have grown over 650% since 2012, but currently, 35,000 people in the US have data science skills, while hundreds of companies are hiring for those roles.”

    The language used as the basis for many machine learning algorithms is Python. It’s powerful, easy for beginners and has well-supported documentation.

    Important Machine Learning Concepts

    1. Association Rules
    2. Classification
    3. Pattern Recognition
    4. Outlier Detection
    5. Compression
    6. Regression
    7. Supervised Learning
    8. Unsupervised Learning
    9. Document Clustering
    10. Density Estimation
    11. Reinforcement Learning
    12. Probably Approximately Correct Learning (PAC Learning)
    13. Learning Multiple Classes
    14. Model Selection
    15. Optimization Procedure
    16. Geometric Model
    17. Simple Linear Classifier
    18. Nearest Neighbor Classifier
    19. Clustering
    20. Probabilistic Model
    21. Feature Extraction
    22. Feature Selection
    23. Bayes Rule

    References:

    http://news.codecademy.com/what-is-machine-learning/

    https://en.wikipedia.org/wiki/Machine_learning#Applications

    https://www.forbes.com/sites/louiscolumbus/2017/12/11/linkedins-fastest-growing-jobs-today-are-in-data-science-machine-learning/#3b727e6451bd

  • Introduction to Data Science (Notes)

    Data science, also known as data-driven science, is an interdisciplinary field of scientific methods, processes, algorithms and systems to extract knowledge or insights from data in various forms, either structured or unstructured, similar to data mining.

    Data Science is a super-set of the fields of statistics and machine learning (1). DSI (Data Science Initiative, 2015) website,  gives us an idea about what Data Science is :

    “This coupling of scientific discovery and practice involves the collection, management, processing, analysis, visualization, and interpretation of vast amounts of heterogeneous data associated with a diverse array of scientific, translational, and interdisciplinary applications.”

    Data Science vs Statistics:

    According to Data Science Association’s “Professional Code of Conduct” (2):

    “Data Scientist” means a professional who uses scientific methods to liberate and create meaning from raw data.

    “Statistics” means the practice or science of collecting and analyzing numerical data in large quantities.

    There exists difference of opinion, some says that Data Science is nothing but just re-branding of statistics like  Karl Broman, Univ. Wisconsin (3) says:

    “When physicists do mathematics, they don’t say they’re doing number science. They’re doing math. If you’re analyzing data, you’re doing statistics. You can call it data science or informatics or analytics or whatever, but it’s still statistics. … You may not like what some statisticians do. You may feel they don’t share your values. They may embarrass you. But that shouldn’t lead us to abandon the term ‘‘statistics’’ .”

    On the other hand some scientists say that Data Science is super-set of Statistics like  Vincent Granville, at the Data Science Central Blog (4) says:

    “Data Science without statistics is possible, even desirable”

    Andrew Gelman, Columbia University (5) says:

    “Statistics is the least important part of data science”

    Data Scientist (n.): A person who is better at statistics than any software engineer and better at software engineering than any statistician.

    The activities of Greater Data Science (GDS) are classified into 6 divisions:

    1. Data Exploration and Preparation (exploring basic properties and unexpected features, finding and fixing anomalies and artifacts in data-sets)
    2. Data Representation and Transformation (Representing data received from different kind of formats/sources in a common format/source)
    3. Computing with Data (Using languages like R, Python to perform computations on data)
    4. Data Modeling (To define properties / parameters for data analysis)
    5. Data Visualization and Presentation (Representing the data using colorful plots, histograms and charts so that the user can easily extract useful information from data-sets)
    6. Science about Data Science (Data scientists are doing science about data science when they identify commonly-occurring analysis)

    Conclusion:

    Data Science is the study of tools and techniques to analyze large amount of data to extract useful insights / information. In my opinion, it’s a mixture of statistical methods and computing methods. It involves some key activities i.e data exploration and preparation, representation and transformation, computation, modeling, visualization and presentation and science about data science.  The scope and impact of this science will expand enormously in coming decades as scientific data and data about science itself become ubiquitously available

    References:

    (1) “50 Years of Data Science” by David Donoho, September 18, 2015, version 1.0

    (2) http://www.datascienceassn.org/code-of-conduct.html

    (3) https://kbroman.wordpress.com/2013/04/05/data-science-is-statistics/

    (4) http://www.datasciencecentral.com/profiles/blogs/data-science-without-statistics-is-possible-even-desirable

    (5) http://andrewgelman.com/2013/11/14/statistics-least-important-part-data-science/

    (6) https://en.wikipedia.org/wiki/Data_science