1.6 What is NA
For dealing with missing data, it is very important to understand what is NA. In R missing values are represented by the symbol NA i.e it is a special variable. So, let's read a little bit about it by typing question mark(?) in front of it.

We see that it stands for Not Available or Missing Values. It is a logical constant with length 1 and contains a missing value indicator.
Let's understand about it more detail. We have two know logical constants TRUE which stands for 1 and FALSE which stands for 0. So what does NA contain?? Let's check it using following code

These logical constants behave as expected. Now, let's try using NA

The expected output you think would be FALSE, but it is not the case with NA. The output is NA only

When we try other options with NA, the following output is given

We see that, missing value is compared with, TRUE or FALSE or some other values, Hence, it gives NA as output.
Now, lets locate this missing data in out dataset. For this we look at top 24 rows in our dataset.
We see that at some places there NAs and few empty values as well, We also see some NA in brackets in column Inception. Now we want to pull out these rows from dataset which have only NA or missing values. In R, there is a very good way to do it with the function complete.cases(). If you run it, it will give the values as TRUE and FALSE depending on the tows being complete or not.
Now let us subset this dataset and extract the rows which are incomplete or has missing data. For this we will put a negation(i.e ! mark) ahead of complete.cases(), so that it will return TRUE values of incomplete rows.
This shows that there are only 6 rows which are having missing data but in actual dataset there are many more rows which are incomplete. So, howcome it is only picking up 6 rows?
The answer to this it that, some of the missing data is not indicated by NA but it just does not have any value, which is not the same as NA that's why R thinks that it is a complete row. Like line number 14 and 15, which has no value in the column Industry, hence R recognizes it as complete row. The best way to tackle this is to fix the problem at its core.i.e at the stage of importing the data in R.
Hence, try the following code..

na.strings=c("") will replace all the rows which has value mentioned within quotation mark will be replaced by NA.
Let's run head(fin, 20) command and we will see that the empty records now contain NA in it.
Now run the complete.cases() function and check for the incomplete rows.

Now we see that it is picking many more rows than previously. You must have observed that there are some rows which are having brackets around NA and some are not, why so??
Answer is pretty simple, i.e. all the variables which are recognized as factors are having brackets around NAs and integers don't have the brackets. Thus, R helps us to recognize the categories based on NAs as well.
In next post we will learn, how to tackle these missing data. Until then Happy Coding!!
Let's understand about it more detail. We have two know logical constants TRUE which stands for 1 and FALSE which stands for 0. So what does NA contain?? Let's check it using following code
These logical constants behave as expected. Now, let's try using NA
The expected output you think would be FALSE, but it is not the case with NA. The output is NA only
When we try other options with NA, the following output is given
We see that, missing value is compared with, TRUE or FALSE or some other values, Hence, it gives NA as output.
Now, lets locate this missing data in out dataset. For this we look at top 24 rows in our dataset.
The answer to this it that, some of the missing data is not indicated by NA but it just does not have any value, which is not the same as NA that's why R thinks that it is a complete row. Like line number 14 and 15, which has no value in the column Industry, hence R recognizes it as complete row. The best way to tackle this is to fix the problem at its core.i.e at the stage of importing the data in R.
Hence, try the following code..
na.strings=c("") will replace all the rows which has value mentioned within quotation mark will be replaced by NA.
Let's run head(fin, 20) command and we will see that the empty records now contain NA in it.
Now we see that it is picking many more rows than previously. You must have observed that there are some rows which are having brackets around NA and some are not, why so??
Answer is pretty simple, i.e. all the variables which are recognized as factors are having brackets around NAs and integers don't have the brackets. Thus, R helps us to recognize the categories based on NAs as well.
In next post we will learn, how to tackle these missing data. Until then Happy Coding!!
Comments
Post a Comment