Useful R Functions – Exploratory Data Analysis

R is a programming language used for statistical analysis and exploratory data analysis projects. According to the official website:

R is a language and environment for statistical computing and graphics. It is a GNU project which is similar to the S language and environment which was developed at Bell Laboratories (formerly AT&T, now Lucent Technologies) by John Chambers and colleagues. [source]

Following are some useful R functions which can be used for data exploration and visualization.

To read data from CSV file:

data_obj <- read_csv(“data.csv”)

In above line data_obj is the object name in which your data will be saved, data.csv is the file name from which the data will be extracted.

To get frequency of unique column values:

some_variable <- as.data.frame(table(data_obj$col_name))

In above line some_variable is the variable name in which you want to save the new data, data_obj is the object in which your raw data is saved, col_name is the column name in your raw data.

To sort the data in Highest to Lowest (Descending Order) values in a column:

sorted_values_descending <- data_obj[order(-data_obj$col_name), ]

In above line sorted_values_descending is the variable name in which you want to save the new sorted data, data_obj is the object in which your raw data is saved, col_name is the column name in your raw data which will be used for sorting.

To sort the data in Lowest to Highest (Ascending Order) values in a column:

sorted_values_ascending <- data_obj[order(data_obj$col_name), ]

In above line sorted_values_ascending is the variable name in which you want to save the new sorted data, data_obj is the object in which your raw data is saved, col_name is the column name in your raw data which will be used for sorting.

To get some rows from the data (top to bottom):

some_rows <- head(data_obj, 10)

In above line some_rows is the variable name in which you want to save the rows data, data_obj is the object in which your raw data is saved, 10 is the number of rows you want to get.

To get some rows from the data (bottom to top):

some_rows <- tail(data_obj, 10)

In above line some_rows is the variable name in which you want to save the rows data, data_obj is the object in which your raw data is saved, 10 is the number of rows you want to get.

To Create Bar Plot (using geom_col()) from 2 Columns using ggplot2 package:

g <- ggplot2::ggplot(data_obj, ggplot2::aes(x=data_obj$col1, y= data_obj$col1, l)) + ggplot2::geom_col() + ggplot2::xlab(“x axis label here”) + ggplot2::ylab(“y axis label here”) + ggplot2::ggtitle(“Plot title here”)

In above line data_obj is the object in which your raw data is saved, col1 is the column name in data_obj which will be used for x axis, col2 is the column name in data_obj which will be used for y axis. Please make sure that ggplot2 package is installed before using this function.

To Add Count on each Bar (geom_col()) in R using ggplot2 Package:

g <- geom_text(data=data_obj, aes(label=col_name, y=col_name), colour=”red”, size=2.5)

Scatter Plot in R using ggplot2 package:

g <- ggplot(data_frame, aes(x=col_x, y=col_y)) + geom_point(col=”steelblue”, size=3)

// change color using variable

g <- ggplot(data_frame, aes(x=col_x, y=col_y)) + geom_point(col=col_y, size=3)

To Remove Legend from Plots in R using ggplot2 Package:

g <- g + theme(legend.position=”None”)

To Change Legend Title in R using ggplot2 Package:

g <- g + labs(col=”No of attacks”)

To Change Legend Labels and Colors in R using ggplot2 Package:

g <- g +  scale_color_manual(name=”Legend Title”,labels = c(“NA”,”High”,”Low”,”low”,”Medium”), values = c(” “=”blue”,”High”=”red”,”Low”=”yellow”,”low”=”yellow”,”Medium”=”orange”))

Make x-axis Label Texts Vertical in R using ggplot2 Package:

g <- g + theme(axis.text.x = element_text(angle = 90, hjust = 1))

To Draw Best Fitting Line from Scatter Plot in R using ggplot2 Package:

g <- g + geom_smooth(method=”lm”, col=”firebrick”)

// lm stands for linear model

To Limit x-axis and y-axis in R using ggplot2 Package:

g <- g + xlim(0, 100) + ylim(0,10)

To Add Titles and Labels in R using ggplot2 Package:

g <- g + labs(title=”some title here”, subtitle=”subtitle here”, y=”y-axis label here”, x=”x-axis label here”, caption=”caption text here”)

To Customize Axis Labels in R using ggplot2 Package:

g <- g + scale_x_continuous(breaks=seq(0, 0.01, 0.1), labels = sprintf(“%1.2f%%”, seq(0, 0.1, 0.01))) + scale_y_continuous(breaks=seq(0, 1000000, 200000), labels = function(x){paste0(x/1000, ‘K’)})

Bar Plot (using geom_bar()) to Pie Chart Using ggplot2 Package:

bp <- ggplot2::ggplot(data_obj, ggplot2::aes(x=””, y=data_obj$col_name, fill = var_name))+ ggplot2::geom_bar( width = 1, stat = “identity”, show.legend = TRUE) + ggplot2::xlab(“x axis label here”) + ggplot2::ylab(“y axis label here”) + ggplot2::ggtitle(“Plot title here”)

pie_chart <- bp + ggplot2::coord_polar(“y”, start=0)

To Create Data Frame from Variables in R:

var_data_frame <- data.frame(x = c(‘Male’, ‘Female’, ‘Children’), y = c(total_male, total_female, total_children))

var_data_frame is the variable in which new data frame will be stored, x & y are the column names of the data frame, c() is used to create values of the column and total_male, total_female, total_children are variables. Its output will be:

x                    y

Male             120

Female         200

Children       500

To Sum Values of a Column in R:

total_col_values <- sum(data_obj$col_name, na.rm = TRUE)

na.rm = TRUE is used to neglect missing values.

To Exclude Empty / Missing Values from Data Frame in R

data_obj <- data_obj[!(is.na(data_obj$col_name) | data_obj$col_name==””), ]

To Map Locations Data in R using ggplot2 Package:

pakistan <- map_data(“world”, “Pakistan”)

map <- ggplot() + geom_polygon(data = pakistan, aes(x=long, y = lat, group = group), fill=”green” , color = “black”) + geom_point(data= pk_data, aes(x = Longitude, y = Latitude)) + xlab(“”) + ylab(“”) + ggtitle(“Title here”)

Leave a Comment

Your email address will not be published. Required fields are marked *