1.11 Replacing Missing Data: Factual Analysis Method
In this post we are going to proceed with correcting our dataset with the factual analysis method i.e replacing the missing data with 100% certainty. Looking at our excel sheet we see that the missing data in column state can be replaced with 100% accuracy from the city.

Similarly, missing data in expense column can also be replaced. First of all let's check all the missing values.

This gives the list of rows which are having NA in the dataset fin. So, let's single out the rows which are having NA in State column.

Now we want to fix this by putting NY in the rows having city as New York and CA(California) in the rows having city as San Fransisco. We can definitely do it one by one based upon the ID or row number, etc. because it has only four rows with missing data. But in case out dataset has many rows with missing values in state column, this approach wont be useful. So we are going to use extra filter, and replace the missing data with the relevant value in one go.
The filter below gives the rows with missing data in state column but with the city name as New York.

Now lets replace the missing data with the state NY.

This would put NY in rows with missing value in State column where city is New York. Now, let's check it with the help of row numbers.

We see that, it has been done. Let's check the complete.cases() once again and it shows two rows less.

Follow the similar steps for rows with the city as San Fransisco.


Let's have a look at the complete.cases() once again and it shows only 6 more rows with missing data and we have to deal with only 5 of them because we have decided to let the row with missing data in Inception column to be kept as it is.
The complete code is

Similarly, missing data in expense column can also be replaced. First of all let's check all the missing values.
This gives the list of rows which are having NA in the dataset fin. So, let's single out the rows which are having NA in State column.
Now we want to fix this by putting NY in the rows having city as New York and CA(California) in the rows having city as San Fransisco. We can definitely do it one by one based upon the ID or row number, etc. because it has only four rows with missing data. But in case out dataset has many rows with missing values in state column, this approach wont be useful. So we are going to use extra filter, and replace the missing data with the relevant value in one go.
The filter below gives the rows with missing data in state column but with the city name as New York.
Now lets replace the missing data with the state NY.
This would put NY in rows with missing value in State column where city is New York. Now, let's check it with the help of row numbers.
We see that, it has been done. Let's check the complete.cases() once again and it shows two rows less.
Follow the similar steps for rows with the city as San Fransisco.
Let's have a look at the complete.cases() once again and it shows only 6 more rows with missing data and we have to deal with only 5 of them because we have decided to let the row with missing data in Inception column to be kept as it is.
Comments
Post a Comment