1.12 Replacing Missing Data: Median Imputation Method
So
far we have a lot with missing data and have replaced many of them. If
we look at the excel sheet, we see that those cells marked in green are
already dealt with.
And in this post we are going to learn about how to proxy the missing
value in the Employees and Growth column using median imputation method.
We are using the median values instead of mean because, median is least affected with the outliers and we are taking industry median, because it is very logical to take the median of number of employees belonging to the same industry rather than taking the median of whole column Employees. You can take median based on other variables as well like rows belonging to same state or city, whichever is applicable to your analysis.
So let's start with out with our coding and have a look at the complete.cases().
We see there are two missing rows in here, one in industry Retail and one in Financial Services

The above function will give the result as NA because the column Employee contains NAs or missing values. So, let's take out the NA and then calculate the median.

So, we see that the median of whole employee column is 56. Now, let's take the median of the employees belonging to Retail Industry.
So,
we understand that median for Employees belonging to Retail industry is
much lesser than that of whole column median. Let's save this value in
variable named med_empl_retail
Now, lets replace the missing data with this variable.

Thus we replaced the missing value in Employees column belonging to Retail industry.
Let's repeat the same procedure for Financial Service industry. Looking at complete.cases()
Creating the variable called med_empl_fs to save median of employees of Financial Services industry.

Now replacing the missing data with the value in variable med_empl_fs.

Checking complete.cases() will show us only 4 rows with missing data.
Now, we will apply the same method as above to Growth column.
Quite easy, right??
Here is the complete code.
Have a look at complete.case(), and it shows only 4 rows with missing data.
Now we will also learn to put the value in Revenue and Expenses column. We will follow the similar procedure as above and replace the missing value with the median of Revenue for industry Construction.

Let's check the complete.cases().

We see that Revenue column is filled for rows with the median values belonging to the Industry "Construction". Now let's deal with the column Expenses, but here we have to be very cautious as we don't want to replace all missing data in this column because missing value in Expenses column or row number 15 can be calculated from Revenue and Profit. So, we are performing the operation only on row number 8 and 42. In a way we are lucky that the row number 15 does not belong to the construction industry, hence we can use following code

If this would not have been the case then we would have added one more extra protection code, putting a condition that profit should also be NA, like below

Now we have just one more step is remaining, where we have to derive the values of Profit and Expenses wherever it is missing.
We are using the median values instead of mean because, median is least affected with the outliers and we are taking industry median, because it is very logical to take the median of number of employees belonging to the same industry rather than taking the median of whole column Employees. You can take median based on other variables as well like rows belonging to same state or city, whichever is applicable to your analysis.
So let's start with out with our coding and have a look at the complete.cases().
The above function will give the result as NA because the column Employee contains NAs or missing values. So, let's take out the NA and then calculate the median.
So, we see that the median of whole employee column is 56. Now, let's take the median of the employees belonging to Retail Industry.
Now, lets replace the missing data with this variable.
Thus we replaced the missing value in Employees column belonging to Retail industry.
Let's repeat the same procedure for Financial Service industry. Looking at complete.cases()
Now replacing the missing data with the value in variable med_empl_fs.
Checking complete.cases() will show us only 4 rows with missing data.
Now, we will apply the same method as above to Growth column.
Here is the complete code.
Now we will also learn to put the value in Revenue and Expenses column. We will follow the similar procedure as above and replace the missing value with the median of Revenue for industry Construction.
Let's check the complete.cases().
We see that Revenue column is filled for rows with the median values belonging to the Industry "Construction". Now let's deal with the column Expenses, but here we have to be very cautious as we don't want to replace all missing data in this column because missing value in Expenses column or row number 15 can be calculated from Revenue and Profit. So, we are performing the operation only on row number 8 and 42. In a way we are lucky that the row number 15 does not belong to the construction industry, hence we can use following code
If this would not have been the case then we would have added one more extra protection code, putting a condition that profit should also be NA, like below
Now we have just one more step is remaining, where we have to derive the values of Profit and Expenses wherever it is missing.
Comments
Post a Comment