BASICS OF DATA ANALYSIS USING R
R YOU READY? LET’S BEGIN..
R language has been primarily used for analysing data as it was built for statistical computing and graphics. We can analyse imported datasets or built in datasets using the R Studio software freely available to download from :
https://www.rstudio.com/products/rstudio/download/
I analysed the Orange data set using linear regression which is inbuilt in R Studio .This dataset has columns named Tree(type of tree between 1 and 5), Age , Circumference.
First we need to import Orange dataset using data() and then select the top 6 elements in the dataset using head(). Then we find the correlation coefficient between the columns circumference and age so that we can apply linear regression on the data. Correlation coefficient measures how strong the relation is between two variables. Then we plot the relation.
data("Orange")
head(Orange)
cor(Orange$circumference, Orange$age)
plot(Orange$circumference, Orange$age)
Now comes the part for applying linear regression to predict the age of the tree using circumference input. Here lm() is used or linear regression in R. We will create a model which I named model only.
lm( y~x , dataset) is the approach where y is the dependent variable and x is independent. In my case x is circumference which is independent as user input and y is age which is dependent on circumference of the tree.
model <- lm(age ~ circumference , data = Orange)
summary(model)#predicting the type of tree and age using the linear regression 'model' created above:predict(model,data.frame("circumference"=100)) #100 as circumference
predict(model,data.frame("circumference"=50770))
The result of the tree with 100 as circumference comes out to be age-798.2035. The result of the tree with 50770 circumference is age- 396834.8 . Isn’t this cool? Just knowing about the circumference will give you the age of the orange tree you need!
The next thing is to draw the regression line between age and circumference. So we again plot the graph with proper x and y axis labels between circumference and age. Then we use abline() to get a line with our choice of color pass through their relation.
plot(Orange$circumference,Orange$age,xlab='Circumference',ylab='Age')
abline(model,col="red",lty=2,lwd=3)
This is how you can do basic linear regression even on a dataset you create or import. For importing a comma separated file (dataset) the code is:
dataset <- read.csv("path.csv")
View(dataset)
attach(dataset)
Creating your own dataframe (df) in R:
col1 <- c("val1","val2",..)
col2 <- c("val1", "val2",..)df <- data.frame(col1,col2)
If you liked the story please hit the clap button ,share it and comment below.
THANKS FOR READING!