Skip to main content

1.1 Data Preparation



1.1 Data Preparation

In this section we are going to learn about the process of Data Preparation. we have to focus on this section very much because it is very tedious and time consuming but most important task of all. Nearly 70% of the time while working on any project is actually spent on Data Preparation. We will learn about how to load data, how to find missing values, how to replace missing data, etc. We are going to learn it though a simple project named “Financial Review”

Project Brief:

You have been hired by the “Future 500” magazine. The stakeholders have supplied you a list of 500 companies and would like you to create some draft visualizations for their upcoming online publication.They have requested the following charts:
  • A scatter-plot classified by industry showing revenue, expenses, profit
  • A scatter-plot that includes industry trends for the expenses~revenue relationship
  • Box-plots showing growth by industry
Note that the dataset has numerous discrepancies that need to be addressed before analysis can be performed.Have a look at the data here by downloading the data from the link
https://drive.google.com/file/d/0B3RXs3bXRr3oR2c5cnhJZ2laejQ/view?usp=sharing


Let’s Import data in R now:-

Save the data file in particular folder in your computer. Hence we have to set the working directory as that same folder so that we can work in that folder right away.
Set working directory
I have saved the file in directory C:\SAM\R\SA, hence have to set the working directory as same using setwd() and getw() displays this working directory.
Read the data using following command
Reading the CSV file
This will create a data-object or dataframe named fin. You can see this in Data Pane at the right. Now let’s look at the top few rows of fin using following code
1.1.3.1Frst few rows using the function Head
We can also have a look at bottom rows. let’s view last 10 rows.
Last few rows using the tail function1.1.4.2
Now, let’s look at the structure of our dataframe using str(), note that it shows the data type of each column.
function str to the structure of dataset
Summary(), this function gives the summary of each column, with various details as follows.
Summary fucntion gives the overall details of dataset
The Complete Code is like this
Complete Code
After carefully analyzing these dataset fin we understand that numerous columns which are factors i.e Categorical variables. we will analyze this more in next tutorial.

Comments

Post a Comment

Popular posts from this blog

1.9 Removing Records with Missing Data

1.9 Removing Records with Missing Data In the post number 1.5 Dealing with missing data, we saw various methods and lets implement few of those in this tutorial. First of all let's have a look at the CSV file. We see that, we have decided the option of removing rows where values in Industry column are missing. Before proceeding to the R, I would suggest you to always make a back up of the data so that in case you do any mistake in between you always have the original data to start again. Let's create the backup of our fin dataset. . And this one line can save us a lot of trouble. Now, let's find out all of the rows that have empty value in any of the column. We see two rows where values in Industry column is missing. Let's single out these rows using is.na() So we got two rows with ID 14, 15 where value in Industry column is missing. Now to remove these two rows, we just do the opposite and find out the rows which don't have NA in them and assign it b...

1.3 The Factor Variable Trap

1.3 The Factor Variable Trap The Factor Variable Trap or the FVT comes into play when we ate trying to convert a variable from factor to non-factor. It is a known phenomenon, but isn’t very well publicized. Let’s create a vector named a with the values “12“,“13“,“14“,“12”, “12”. (five values all in quotation marks. The values are in character because of double quotation marks, we can verify this with the function typeof() Now let’s convert this vector into type integer with the function an.numeric() So, the above code was to convert characters into numeric. but how to convert factors into numeric?? For this, let’s create a factor Z which contains exactly same values as of vector a.   When we run the above command, output is shown without quotation marks and levels are also displayed. Thus R is recognizing it as categories. Now let’s convert it into numeric, as done before and save it in vector Y to see the output. OOPS!!! What happ...
1.8 Data Filters: is.na() for Missing Data In previous post we have learned how to filter data for non missing data. In this one, we will learn how to filter out missing data using is.na(). Let's look at first 24 rows using head() to see the missing values. Just like previous post, if we use the same logic we get NA. Thus, it is not helping at all. The other way to tackle this is is.na(). This function checks if the value contained is NA or not. We try this function, by creating a vector named "a" putting some NAs in it and checking it with is.na(). It gives the value FALSE if its not NA and TRUE if it contains NA. We will use the similar function for our dataset to find out NA in Revenue column. It correctly identifies the values in Revenue column which are equal to NAs. Try to implement it in other columns as well and find out the rows which contains missing values .