Skip to main content

1.4 GSUB() and SUB()



1.4 GSUB() and SUB()

Hello and welcome back to the advanced course on R programming. In this tutorial, we are going to deal with factor variables, the revenue and expenses ones and learn how to convert them into non-factor variables.



because as we can see, they are just numeric variables which contain information on dollar amount but it is recognized as factor, same with revenue and growth.It is mainly due to presence of word 'Dollars' in variable Expenses, presence of sign '$' and '%' in variables Revenue and Growth respectively.We have to convert it into numeric, but for that we will function SUB() and GSBU().

We will get the details on executing following command

 

So, what does these functions do is that they look for pattern and replace it with the desired pattern. The difference between SUB() and GSUB() is that, sub() replaces  just the first instance but gsub() replaces all the instances.. Hence, lets go ahead and try these functions out.



We will start with the expenses column in which we want to replace " Dollars" (make a note of empty space before word Dollars) with nothing.

Hence, lets run the following line.

and you will see that, the Dollars is removed from the column Expenses.

Now we have to replace the commas in the same column. Repeat the process similarly

Now, you can see that expenses has no longer commas. Lets check the str() on fin again.


We will see that, expenses is no longer a factor but it is of type character now.

Now lets deal with the variable Revenue. We will use the same gusb() as before with a slight change.

Please note that, '$' is itself a special character, so to make R recognize this sign as a part of a value in variable we use escape sequence, which is two backslashes in the variable Revenue.

Let's remove the commas now. It will also be converted into the type Character. 

We will have to repeat the same process for Growth variable as well.

Now we have all three variables in the type character, now we can very easily convert them in numeric with the function as.numeric().

Now, these three variables are being actually recognized as numeric, which is exactly what we wanted.

Here is the complete code.

Comments

Popular posts from this blog

1.9 Removing Records with Missing Data

1.9 Removing Records with Missing Data In the post number 1.5 Dealing with missing data, we saw various methods and lets implement few of those in this tutorial. First of all let's have a look at the CSV file. We see that, we have decided the option of removing rows where values in Industry column are missing. Before proceeding to the R, I would suggest you to always make a back up of the data so that in case you do any mistake in between you always have the original data to start again. Let's create the backup of our fin dataset. . And this one line can save us a lot of trouble. Now, let's find out all of the rows that have empty value in any of the column. We see two rows where values in Industry column is missing. Let's single out these rows using is.na() So we got two rows with ID 14, 15 where value in Industry column is missing. Now to remove these two rows, we just do the opposite and find out the rows which don't have NA in them and assign it b...

1.3 The Factor Variable Trap

1.3 The Factor Variable Trap The Factor Variable Trap or the FVT comes into play when we ate trying to convert a variable from factor to non-factor. It is a known phenomenon, but isn’t very well publicized. Let’s create a vector named a with the values “12“,“13“,“14“,“12”, “12”. (five values all in quotation marks. The values are in character because of double quotation marks, we can verify this with the function typeof() Now let’s convert this vector into type integer with the function an.numeric() So, the above code was to convert characters into numeric. but how to convert factors into numeric?? For this, let’s create a factor Z which contains exactly same values as of vector a.   When we run the above command, output is shown without quotation marks and levels are also displayed. Thus R is recognizing it as categories. Now let’s convert it into numeric, as done before and save it in vector Y to see the output. OOPS!!! What happ...
1.8 Data Filters: is.na() for Missing Data In previous post we have learned how to filter data for non missing data. In this one, we will learn how to filter out missing data using is.na(). Let's look at first 24 rows using head() to see the missing values. Just like previous post, if we use the same logic we get NA. Thus, it is not helping at all. The other way to tackle this is is.na(). This function checks if the value contained is NA or not. We try this function, by creating a vector named "a" putting some NAs in it and checking it with is.na(). It gives the value FALSE if its not NA and TRUE if it contains NA. We will use the similar function for our dataset to find out NA in Revenue column. It correctly identifies the values in Revenue column which are equal to NAs. Try to implement it in other columns as well and find out the rows which contains missing values .