What are the most used data import functions in R

What are the most used data import functions in R


  R Interview Questions

Read csv files

> tags = read.csv("C:\\Users\\Ajay Tech\\Downloads\\ml-20m\\ml-20m\\tags.csv")

Read files with any separator

tags = read.table("C:\\Users\\FGT0008\\Downloads\\ml-20m\\ml-20m\\tags.csv",
                  sep="\t",  # tab separator
                  header = TRUE,
                  na.strings = "NA")

Read JSON files

# Read JSON file from weather.com website
> weather = fromJSON(file="http://api.openweathermap.org/data/2.5/weather?q=chicago&APPID=37a81ae1e682ac417883b0a3xxxxxx")

This result is a list

> str(weather)
List of 12
 $ coord     :List of 2
  ..$ lon: num -87.6
  ..$ lat: num 41.9
 $ weather   :List of 2
  ..$ :List of 4
  .. ..$ id         : num 701
  .. ..$ main       : chr "Mist"
  .. ..$ description: chr "mist"
  .. ..$ icon       : chr "50d"
  ..$ :List of 4
  .. ..$ id         : num 721
  .. ..$ main       : chr "Haze"
  .. ..$ description: chr "haze"
  .. ..$ icon       : chr "50d"
 $ base      : chr "stations"
 $ main      :List of 5
  ..$ temp    : num 292
  ..$ pressure: num 1013
  ..$ humidity: num 100
  ..$ temp_min: num 290
  ..$ temp_max: num 294
 $ visibility: num 4828
 $ wind      :List of 2
  ..$ speed: num 1.48
  ..$ deg  : num 258
 $ clouds    :List of 1
  ..$ all: num 1
 $ dt        : num 1.53e+09
 $ sys       :List of 6
  ..$ type   : num 1
  ..$ id     : num 966
  ..$ message: num 0.0043
  ..$ country: chr "US"
  ..$ sunrise: num 1.53e+09
  ..$ sunset : num 1.53e+09
 $ id        : num 4887398
 $ name      : chr "Chicago"
 $ cod       : num 200

The JSON stream looks like this.

{"coord":{"lon":-87.62,"lat":41.88},"weather":[{"id":701,"main":"Mist","description":"mist","icon":"50d"},{"id":721,"main":"Haze","description":"haze","icon":"50d"}],"base":"stations","main":{"temp":291.91,"pressure":1013,"humidity":100,"temp_min":290.15,"temp_max":294.15},"visibility":4828,"wind":{"speed":1.48,"deg":258},"clouds":{"all":1},"dt":1530185760,"sys":{"type":1,"id":966,"message":0.0043,"country":"US","sunrise":1530181076,"sunset":1530235782},"id":4887398,"name":"Chicago","cod":200}

Read CSV file from the web

> iris_data = read.csv("https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data")

Scrape HTML tables from the web

If you want to scrape data from HTML tables in the web, you can use the htmltab package. To get the specific HTML table location , use xpath.

Once you get the xpath, use the following code

require(htmltab)
xpath = "//*[@id='mw-content-text']/div/table"
country_iso_codes = htmltab(doc = "https://simple.wikipedia.org/wiki/List_of_U.S._states",
                            which = xpath)

And you get a data frame

> country_iso_codes
   Sl no. Abbreviation     State Name        Capital    Became a State
2       1           AL        Alabama     Montgomery December 14, 1819
3       2           AK         Alaska         Juneau   January 3, 1959
4       3           AZ        Arizona        Phoenix February 14, 1912
5       4           AR       Arkansas    Little Rock     June 15, 1836

What are the most used data import functions in R

What are the most used data import functions in R


  R Interview Questions

Read csv files

> tags = read.csv("C:\\Users\\Ajay Tech\\Downloads\\ml-20m\\ml-20m\\tags.csv")

Read files with any separator

tags = read.table("C:\\Users\\FGT0008\\Downloads\\ml-20m\\ml-20m\\tags.csv",
                  sep="\t",  # tab separator
                  header = TRUE,
                  na.strings = "NA")

Read JSON files

# Read JSON file from weather.com website
> weather = fromJSON(file="http://api.openweathermap.org/data/2.5/weather?q=chicago&APPID=37a81ae1e682ac417883b0a3xxxxxx")

This result is a list

> str(weather)
List of 12
 $ coord     :List of 2
  ..$ lon: num -87.6
  ..$ lat: num 41.9
 $ weather   :List of 2
  ..$ :List of 4
  .. ..$ id         : num 701
  .. ..$ main       : chr "Mist"
  .. ..$ description: chr "mist"
  .. ..$ icon       : chr "50d"
  ..$ :List of 4
  .. ..$ id         : num 721
  .. ..$ main       : chr "Haze"
  .. ..$ description: chr "haze"
  .. ..$ icon       : chr "50d"
 $ base      : chr "stations"
 $ main      :List of 5
  ..$ temp    : num 292
  ..$ pressure: num 1013
  ..$ humidity: num 100
  ..$ temp_min: num 290
  ..$ temp_max: num 294
 $ visibility: num 4828
 $ wind      :List of 2
  ..$ speed: num 1.48
  ..$ deg  : num 258
 $ clouds    :List of 1
  ..$ all: num 1
 $ dt        : num 1.53e+09
 $ sys       :List of 6
  ..$ type   : num 1
  ..$ id     : num 966
  ..$ message: num 0.0043
  ..$ country: chr "US"
  ..$ sunrise: num 1.53e+09
  ..$ sunset : num 1.53e+09
 $ id        : num 4887398
 $ name      : chr "Chicago"
 $ cod       : num 200

The JSON stream looks like this.

{"coord":{"lon":-87.62,"lat":41.88},"weather":[{"id":701,"main":"Mist","description":"mist","icon":"50d"},{"id":721,"main":"Haze","description":"haze","icon":"50d"}],"base":"stations","main":{"temp":291.91,"pressure":1013,"humidity":100,"temp_min":290.15,"temp_max":294.15},"visibility":4828,"wind":{"speed":1.48,"deg":258},"clouds":{"all":1},"dt":1530185760,"sys":{"type":1,"id":966,"message":0.0043,"country":"US","sunrise":1530181076,"sunset":1530235782},"id":4887398,"name":"Chicago","cod":200}

Read CSV file from the web

> iris_data = read.csv("https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data")

Scrape HTML tables from the web

If you want to scrape data from HTML tables in the web, you can use the htmltab package. To get the specific HTML table location , use xpath.

Once you get the xpath, use the following code

require(htmltab)
xpath = "//*[@id='mw-content-text']/div/table"
country_iso_codes = htmltab(doc = "https://simple.wikipedia.org/wiki/List_of_U.S._states",
                            which = xpath)

And you get a data frame

> country_iso_codes
   Sl no. Abbreviation     State Name        Capital    Became a State
2       1           AL        Alabama     Montgomery December 14, 1819
3       2           AK         Alaska         Juneau   January 3, 1959
4       3           AZ        Arizona        Phoenix February 14, 1912
5       4           AR       Arkansas    Little Rock     June 15, 1836

What is the difference between data frame and data table

What is the difference between data frame and data table


  R Interview Questions

Data frame is one of the basic data structures in R. You can perform many manipulations on a data frame like selection, subsetting, merging, stacking, unstacking etc. However, data in a data frame is all in memory. 

So, obviously there is a limit to how much data you can manipulate in a data frame. Data tables are created with the following in mind

  • Easy and fast grouping
  • Easy SQL like syntax for filtering
  • Faster joins
  • Index based retrieval
  • Faster file reading
  • Ability to check the amount of memory used

How to find out the unique elements in a vector

How to find out the unique elements in a vector


  R Interview Questions

For more R interview questions go here.

Say there are 100 students being graded

# Prepare some sample grades. 
> grades = round(rnorm(100,mean=3.5,sd=0.5),1)
> grades
  [1] 4.0 4.2 2.9 3.6 3.5 3.5 3.6 4.2 3.6 3.2 4.0 3.8 2.8 4.2
 [15] 3.2 4.1 3.8 3.4 3.1 4.2 3.5 2.8 3.5 4.2 3.2 3.0 3.2 3.4
 [29] 3.9 3.2 3.8 3.2 3.3 3.5 3.8 4.0 3.2 3.5 2.7 3.5 3.1 3.3
 [43] 3.8 3.9 4.0 3.8 4.7 2.9 2.6 3.8 4.1 3.1 3.7 4.1 3.7 2.4
 [57] 3.6 2.4 2.8 3.3 3.2 2.9 3.3 3.1 4.0 3.3 3.4 3.5 3.8 4.2
 [71] 3.4 3.3 2.3 3.5 3.5 4.1 4.2 3.7 4.3 2.7 3.7 4.5 3.2 3.3
 [85] 3.3 3.0 3.5 2.5 3.8 3.1 4.4 2.7 3.3 3.7 3.4 3.5 4.1 4.2
 [99] 3.7 4.2

and you want to find out how many unique grades there are.

UNIQUE () function

> unique(grades)
 [1] 4.0 4.2 2.9 3.6 3.5 3.2 3.8 2.8 4.1 3.4 3.1 3.0 3.9 3.3 2.7
[16] 4.7 2.6 3.7 2.4 2.3 4.3 4.5 2.5 4.4

TABLE () function

> table(grades)
grades
2.3 2.4 2.5 2.6 2.7 2.8 2.9   3 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9   4 4.1 4.2 4.3 4.4 4.5 4.7 
  1   2   1   1   3   3   3   2   5   9   9   5  12   4   6   9   2   5   5   9   1   1   1   1

What is the difference between dataframe and matrix in R

What is the difference between dataframe and matrix in R


  R Interview Questions

Let’s create an employee table.

install.packages("randomNames")
require(randomNames)
# Get 100 random names
name = randomNames(100)
# Get 100 random ages
age = round(rnorm(100,mean = 30, sd = 10))

Now, let’s create a data frame with just 2 columns – name and age

employees = data.frame(names, age, stringsAsFactors=FALSE) 
> str(employees)
'data.frame': 100 obs. of  2 variables:
 $ names: chr  "Persons, Shelby" "Taylor, Chukwuma" "Jarvis, Destiny" "Rape, Zachery" ...
 $ age  : num  13 20 31 42 23 37 27 27 20 22 ...

You could do this because data.frame can contain columns of different types. In this case, names is a string and age is a number.

Can you do this with a matrix ? Of course not. Matrix can only contain one type of data.

Convert a dataframe to Matrix

If you try to convert this dataframe to a matrix, look at what happens.

> employees_m = data.matrix(employees)
Warning message:
In data.matrix(employees) : NAs introduced by coercion

What does the matrix contain ? The names (string) column was coerced to NAs.

> str(employees_m)
 num [1:100, 1:2] NA NA NA NA NA NA NA NA NA NA ...
 - attr(*, "dimnames")=List of 2
  ..$ : NULL
  ..$ : chr [1:2] "names" "age"

As you can see, the data in the names column is gone.

> head(employees_m)
     names age
[1,]    NA  13
[2,]    NA  20
[3,]    NA  31
[4,]    NA  42
[5,]    NA  23
[6,]    NA  37

Here are the differences

1. Matrix is homogeneous but a data frame can be heterogeneous.

2. You can have factors in a data frame but not in a matrix

R Data Structures

R Data Structures


  R Interview Questions

There are 4 basic data structures in R. I am not referring to the basic data types (numeric, integer, character, logical, complex ).

1. Vector – Sequence of elements of the same basic data type.

# These are some examples of vectors. 
# A numeric vector showing temperatures in Chicago of a particular week.
> temp = c(12.4, 13.5, 15.6, 20, 21.5, 13.6, 12.4)
> temp
[1] 12.4 13.5 15.6 20.0 21.5 13.6 12.4
# You cannot have different data types in a vector. 
# Since 12.4 is now defined as a string, all of the numbers have been
# converted to strings automatically.
> temp = c(12.4, 13.5, 15.6, 20, 21.5, 13.6, "12.4")
> temp
[1] "12.4" "13.5" "15.6" "20"   "21.5" "13.6" "12.4"

2. Matrix – If a Vector is 1D (1-dimensional), a matrix is 2D

# For example, if the vector temp_month contains the temperatures in 
# Chicago for an entire month
> temp_month
 [1] 12.4 13.5 15.6 20.0 21.5 13.6 12.4 13.2 12.6 17.6
[11] 12.8 19.4 14.3 16.7 17.2 14.5 16.7 19.2 21.5 14.0
[21] 19.4 19.3 17.5 18.5 20.1 22.5 23.5 34.1
<pre class="wp-block-syntaxhighlighter-code"># this can be folded into a 4x7 matrix as below
> month_matrix = matrix(temp_month, nrow=4, ncol=7, byrow=TRUE)<div>> month_matrix</div><div>     [,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,] 12.4 13.5 15.6 20.0 21.5 13.6 12.4
[2,] 13.2 12.6 17.6 12.8 19.4 14.3 16.7
[3,] 17.2 14.5 16.7 19.2 21.5 14.0 19.4
[4,] 19.3 17.5 18.5 20.1 22.5 23.5 34.1</div></pre>

3. List – Like a vector, but not limited to “same data type”

# For example, different data types like a person's age (numeric), 
# name(string), city(string),zip code (numeric) can be stored in a list. 
> person = list(30, "Ajay", "San Francisco", 94000 )
> person
[[1]]
[1] 30
[[2]]
[1] "Ajay"
[[3]]
[1] "San Francisco"
[[4]]
[1] 94000

4. Data Frame – Like a Matrix, but not limited to “same data type”

<pre class="wp-block-syntaxhighlighter-code"># A data frame is usually created by reading from external data
# using functions like read.csv() or read.table().
# Imagine 4 vectors
> age = c(30,32,21,60)<div>> names = c("Ajay","Adam","Mary","Aishu")
> cities = c("San Francisco","New York","Sunnyvale","San Jose")
> zip = c(94000,40101,94010,94001)

</pre>
<pre class="wp-block-syntaxhighlighter-code"># We can combine the vectors into a data frame like so
> persons = data.frame(age,names,cities,zip)<div>> persons</div><div>  age names        cities   zip
1  30  Ajay San Francisco 94000
2  32  Adam      New York 40100
3  21  Mary     Sunnyvale 94010
4  60 Aishu      San Jose 94001</div></div>
</pre>

How to subset a vector in R

How to subset a vector in R


  R Interview Questions

Go here for a full list of R Interview Questions

Sub setting a vector in R is simple. You just have to understand how to address each individual elements in the vector. The same principle applies to sub setting a data frame in R except with vectors, the addresses of its elements are much easier – it’s just 1D.

Being able to address the elements in a vector is key to be able to subset it. Consider the following vector that is part of base R.

There is 50 states in this vector and sub setting these is a matter of just addressing them.

> # state.abb vector that comes with base R
> state.abb
 [1] "AL" "AK" "AZ" "AR" "CA" "CO" "CT" "DE" "FL" "GA" "HI" "ID" "IL" "IN" "IA" "KS" "KY" "LA" "ME"
[20] "MD" "MA" "MI" "MN" "MS" "MO" "MT" "NE" "NV" "NH" "NJ" "NM" "NY" "NC" "ND" "OH" "OK" "OR" "PA"
[39] "RI" "SC" "SD" "TN" "TX" "UT" "VT" "VA" "WA" "WV" "WI" "WY"

The screenshot above just shows 26 states – just imagine a long list of 50 to the right.

Sub setting by range

> # First 5 states (by index range)
> state.abb[1:5]
[1] "AL" "AK" "AZ" "AR" "CA"

Sub setting with negative indices and range

> # First 5 states (by negative index range)
> state.abb[-6:-50]
[1] "AL" "AK" "AZ" "AR" "CA"
> 
> # The other way is fine too
> state.abb[-50:-6]
[1] "AL" "AK" "AZ" "AR" "CA"

Sub setting with specific indices

Get the 5th, 10th and 15th states

> # Get the 5th, 10th and 15th states
> state.abb[ c(5, 10, 15)]
[1] "CA" "GA" "IA"

Sub setting with names

There is another vector – state.names that lists all the state names. Create a new vector us_states with the values as state names and names as the state abbreviation

> # set names to the states vector
> us_states = setNames(state.name, state.abb)
> us_states
              AL               AK               AZ               AR               CA 
       "Alabama"         "Alaska"        "Arizona"       "Arkansas"     "California" 
              CO               CT               DE               FL               GA 
      "Colorado"    "Connecticut"       "Delaware"        "Florida"        "Georgia" 
              HI               ID               IL               IN               IA 
        "Hawaii"          "Idaho"       "Illinois"        "Indiana"           "Iowa" 
              KS               KY               LA               ME               MD 
        "Kansas"       "Kentucky"      "Louisiana"          "Maine"       "Maryland" 
              MA               MI               MN               MS               MO 
 "Massachusetts"       "Michigan"      "Minnesota"    "Mississippi"       "Missouri" 
              MT               NE               NV               NH               NJ 
       "Montana"       "Nebraska"         "Nevada"  "New Hampshire"     "New Jersey" 
              NM               NY               NC               ND               OH 
    "New Mexico"       "New York" "North Carolina"   "North Dakota"           "Ohio" 
              OK               OR               PA               RI               SC 
      "Oklahoma"         "Oregon"   "Pennsylvania"   "Rhode Island" "South Carolina" 
              SD               TN               TX               UT               VT 
  "South Dakota"      "Tennessee"          "Texas"           "Utah"        "Vermont" 
              VA               WA               WV               WI               WY 
      "Virginia"     "Washington"  "West Virginia"      "Wisconsin"        "Wyoming" 

Now, let’s get the state names with the abbreviations A

> # Get the states with abbreviation AZ, CO and DE
> us_states[c("AZ","CO","DE")]
        AZ         CO         DE 
 "Arizona" "Colorado" "Delaware" 

Sub setting with filter conditions

Old Faithful is a geyser in the US that spits hot water periodically. The wait time for the fountain and the eruption time is documented in a data frame “faithful”.

> waiting = faithful$waiting
> waiting
  [1] 79 54 74 62 85 55 88 85 51 85 54 84 78 47 83 52 62 84 52 79 51 47 78 69 74 83 55 76 78 79 73
 [32] 77 66 80 74 52 48 80 59 90 80 58 84 58 73 83 64 53 82 59 75 90 54 80 54 83 71 64 77 81 59 84
 [63] 48 82 60 92 78 78 65 73 82 56 79 71 62 76 60 78 76 83 75 82 70 65 73 88 76 80 48 86 60 90 50
 [94] 78 63 72 84 75 51 82 62 88 49 83 81 47 84 52 86 81 75 59 89 79 59 81 50 85 59 87 53 69 77 56
[125] 88 81 45 82 55 90 45 83 56 89 46 82 51 86 53 79 81 60 82 77 76 59 80 49 96 53 77 77 65 81 71
[156] 70 81 93 53 89 45 86 58 78 66 76 63 88 52 93 49 57 77 68 81 81 73 50 85 74 55 77 83 83 51 78
[187] 84 46 83 55 81 57 76 84 77 81 87 77 51 78 60 82 91 53 78 46 77 84 49 83 71 80 49 75 64 76 53
[218] 94 55 76 50 82 54 75 78 79 78 78 70 79 70 54 86 50 90 54 54 77 79 64 75 47 86 63 85 82 57 82
[249] 67 74 54 83 73 73 88 80 71 83 56 79 78 84 58 83 43 60 75 81 46 90 46 74

Let’s pick up all the wait times > 90 minutes

> # Get all the wait times > 90 minutes
> waiting[ waiting > 90 ]
[1] 92 96 93 93 91 94

Simple enough – right ? The waiting > 90 comparison test produces a logical vector that is being passed on as indices to the vector. Only the TRUE index values are returned. You can combine as many logical tests as you like.