Skip to main content

1.5 Dealing with Missing Data

1.5 Dealing with Missing Data

In this tutorial we are talking about missing data. Missing data very common term in data science or analytics and it happens datasets are sometimes not full or there are some errors while it is being supplied or just basically that data was not collected in the first place itself. So dealing with missing data is important. we need to know what options we have and what approaches we can take in order to fix that data, So, we will quickly have overview what exactly are the options we have.

Dealing with Missing Data:- 

  • Predict with 100% accuracy 
    • For Example: City and State, if we know city then we can predict the state with 100% accuracy

  •  Leave the record as it
    • If the field is not that much important, then we can leave the record as it is

  • Remove the record entirely
    • Sometimes, if you are missing some critical data, in that case you cannot restore the data then the only option left is to remove the record completely. The drawback here is that this analysis becomes less significant or has smaller that can have certain implications.

  • Replace with mean or median
    • This is very popular approach and we are going to see an example of this in our dataset. Mean is good option but usually median is preferred because median is less affected by outliers.

  • Fill in by exploring correlations and similarities
    • We can use this enhanced techniques to use explore regression and correlation to predict the missing variable.

  • Introduce dummy variable for "Missingness"
    • In this approach we actually introduce new variable where some data is missing, for example yes flag for missing and no flag not missing. Then explore the correlation of that variable with the outcome that you are looking for.

Let's look at the dataset in csv format and see what data is missing in the same

 


  • In the column Employees, missing data can be replaced with the industry median. Hence, we can just proxy it with the median of employees of Retail industry.

  • In column Industry we can keep the record or remove it. We understand that Industry column is quite important for us, hence we will remove the record completely.

  • The missing value in Inception column cannot be replaced with the median or mean of the same as it won't make sense, In our case it is not much of importance so that we can keep it as it is.

  • Empty values in column state can be predicted easily based on the city. Similarly, expenses can be predicted using Revenue-Profit.

  • We see that the line where Revenue, Expense, Profit and Growth is missing. We can proxy these values with the industry median. 
Following image shows that how we are going to deal with these missing values.
 

Comments

Popular posts from this blog

1.9 Removing Records with Missing Data

1.9 Removing Records with Missing Data In the post number 1.5 Dealing with missing data, we saw various methods and lets implement few of those in this tutorial. First of all let's have a look at the CSV file. We see that, we have decided the option of removing rows where values in Industry column are missing. Before proceeding to the R, I would suggest you to always make a back up of the data so that in case you do any mistake in between you always have the original data to start again. Let's create the backup of our fin dataset. . And this one line can save us a lot of trouble. Now, let's find out all of the rows that have empty value in any of the column. We see two rows where values in Industry column is missing. Let's single out these rows using is.na() So we got two rows with ID 14, 15 where value in Industry column is missing. Now to remove these two rows, we just do the opposite and find out the rows which don't have NA in them and assign it b...

1.3 The Factor Variable Trap

1.3 The Factor Variable Trap The Factor Variable Trap or the FVT comes into play when we ate trying to convert a variable from factor to non-factor. It is a known phenomenon, but isn’t very well publicized. Let’s create a vector named a with the values “12“,“13“,“14“,“12”, “12”. (five values all in quotation marks. The values are in character because of double quotation marks, we can verify this with the function typeof() Now let’s convert this vector into type integer with the function an.numeric() So, the above code was to convert characters into numeric. but how to convert factors into numeric?? For this, let’s create a factor Z which contains exactly same values as of vector a.   When we run the above command, output is shown without quotation marks and levels are also displayed. Thus R is recognizing it as categories. Now let’s convert it into numeric, as done before and save it in vector Y to see the output. OOPS!!! What happ...
1.8 Data Filters: is.na() for Missing Data In previous post we have learned how to filter data for non missing data. In this one, we will learn how to filter out missing data using is.na(). Let's look at first 24 rows using head() to see the missing values. Just like previous post, if we use the same logic we get NA. Thus, it is not helping at all. The other way to tackle this is is.na(). This function checks if the value contained is NA or not. We try this function, by creating a vector named "a" putting some NAs in it and checking it with is.na(). It gives the value FALSE if its not NA and TRUE if it contains NA. We will use the similar function for our dataset to find out NA in Revenue column. It correctly identifies the values in Revenue column which are equal to NAs. Try to implement it in other columns as well and find out the rows which contains missing values .