1.2 What are Factors?? (Refresher)
Continuing with the previous post, we understand that dataset fin contains lost of variables as factors. So lets understand what factors are.
Factors are the data objects which are used to categorize the data and store it as levels. They can store both strings and integers. They are useful in the columns which have a limited number of unique values. Like “Male, “Female” and True, False etc. They are useful in data analysis for statistical modeling.
Factors are the data objects which are used to categorize the data and store it as levels. They can store both strings and integers. They are useful in the columns which have a limited number of unique values. Like “Male, “Female” and True, False etc. They are useful in data analysis for statistical modeling.
Factors are created using the factor () function by taking a vector as input.
In our example, we see that industry is a factor variable with 8 levels and in summary we see that it contains values like IT Service, Health, Software, etc. It also contains one empty category. In structure we see that their are numbers which acts as an identifier for each value in the variable.

We see that many variables are also been recognized as Integer like ID, Inception, profit, etc. Careful thinking reveals that ID need not be an integer because we are not doing any arithmetic operation on it. Hence, it is better if it is recognized as a factor. Same concept applies to variable Inception as well.
We also see that Revenue, Expense and Growth are recognized as factors whereas it should be recognized as integer, because there is “$” sign in Revenue, “,” in Expenses and “%” sign in Growth, which are preventing them to be recognized as integer.
Let’s deal with it now…

Similarly for Inception, check it using the function str()

The complete code is as follows

We will learn about the factor variable trap in next tutorial.
Comments
Post a Comment