Gabor Data Analysis: Chapter 1 Stuck in a Code: How to Set Directory

89 Views Asked by At

Dear Stackoverflow Community,

I am trying to self-teach myself R and data analysis using the textbook "Békés, Gábor. Data Analysis for Business, Economics, and Policy" (https://gabors-data-analysis.com/) and I am stuck in the below code using Hotel Vienna Dataset (https://osf.io/y6jvb).

I do have a bit of experience with R but since my basics are very weak I am re-teaching myself from the very beginning and really need your step-by-step guidance on how to figure out the below code.

PRACTICAL QUESTION OF THE TEXTBOOK: Take the hotels-viennadataset used in this chapter and use your computer to pick samples of size 25, 50, and 200. Calculate the simple average of hotel price in each sample and compare them to those in the entire dataset. Repeat this exercise three times and record the results. Comment on how the average varies across samples of different sizes.

DATASET : https://osf.io/y6jvb

CODE

library(tidyverse)

# set working directory
# option A: open material as project
# option B: set working directory for da_case_studies

#example: setwd("C:/Users/bekes.gabor/Documents/github/da_case_studies/")

#set data dir, load theme and functions

setwd("C:/Users/sha/Desktop/R/intro/data/da_case_studies/")

source("ch00-tech-prep/theme_bg.R")
source("ch00-tech-prep/da_helper_functions.R")`

I dont know how to do data_dir and to get the set-data-directory.R (link about how to set the computer https://gabors-data-analysis.com/howto-r/)

Data used:

source("set-data-directory.R") #data_dir must be first defined #data_in <- paste(data_dir,"hotels-vienna","clean/", sep = "/")

use_case_dir <- "ch01-hotels-data-collect/"
data_out <- use_case_diroutput <- paste0(use_case_dir,"output/")create_output_if_doesnt_exist(output)

# load in clean and tidy data and create workfile

df <-  read.csv(paste0(data_in,"hotels-vienna.csv"))

# or from the website

df <- read_csv("https://osf.io/y6jvb/download")


# First look

df <- df%>%
  select(hotel_id, accommodation_type, country, city, city_actual, neighbourhood, center1label, distance,center2label, distance_alter, stars, rating, rating_count, ratingta, ratingta_count, year, month,weekend, holiday, nnights, price, scarce_room, offer, offer_cat)

summary(df)glimpse(df)

# export list

df <- subset(df, select = c(hotel_id, accommodation_type, country, city, city_actual, center1label, distance, stars, rating, price)) 
write.csv(df[1:5,], paste0(output, "hotel_listobs.csv"), row.names = F)

Dataset using Dput()

dput(head(df[, c(1:10)]))

structure(list(hotel_id = c(21894L, 21897L, 21901L, 21902L, 21903L, 
21904L), accommodation_type = c("Apartment", "Hotel", "Hotel", 
"Hotel", "Hotel", "Apartment"), country = c("Austria", "Austria", 
"Austria", "Austria", "Austria", "Austria"), city = c("Vienna", 
"Vienna", "Vienna", "Vienna", "Vienna", "Vienna"), city_actual = c("Vienna", 
"Vienna", "Vienna", "Vienna", "Vienna", "Vienna"), center1label = c("City centre", 
"City centre", "City centre", "City centre", "City centre", "City centre"
), distance = c(2.7, 1.7, 1.4, 1.7, 1.2, 0.9), stars = c(4, 4, 
4, 3, 4, 5), rating = c(4.4, 3.9, 3.7, 4, 3.9, 4.8), price = c(81L, 
81L, 85L, 83L, 82L, 229L)), row.names = c(NA, 6L), class = "data.frame")

What I've tried:

setwd("C:/Users/sha03/Desktop/R/intro/data/da_case_studies/")

source("theme_bg.R")
source("da_helper_functions.R")

read.csv('C:/Users/sha03/Desktop/R/intro/data/da_case_studies/hotels-vienna.csv')

summary(df)
glimpse(df)

I Cant seem to get the answer I am supposed to get, which is (https://github.com/gabors-data-analysis/da_case_studies/blob/master/ch01-hotels-data-collect/ch01-hotels-data-collect.ipynb)

1

There are 1 best solutions below

0
jpsmith On

To calculate the average of samples of size 25, 50, 200 and the full dataset, you can use sample to index the rows and index the price column. Remember to set a seed when dealing with random samples so that results are always reproducible.

You can do it individually:

set.seed(123)
n25 <- mean(df[sample(seq_len(nrow(df)), 25), "price"])
n50 <- mean(df[sample(seq_len(nrow(df)), 50), "price"])
n200 <- mean(df[sample(seq_len(nrow(df)), 200), "price"])
nall <- mean(df[, "price"])

#> n25
# [1] 139.88
# > n50
# [1] 145.86
# > n200
# [1] 119.655
# > nall
# [1] 131.3668

Or all in one go:

set.seed(123)
n <- c(25, 50, 200, nrow(df))

setNames(
  vapply(n, \(x) mean(df[sample(seq_len(nrow(df)), x), "price"]), 1),
         paste0("n = ", n))

#   n = 25   n = 50  n = 200  n = 428 
# 132.6800 137.9800 121.4600 131.3668 

If you need to repeat this three times, you can use lapply, which outputs a list:

set.seed(123)

lapply(1:3, \(y) setNames(
  vapply(n, \(x) mean(df[sample(seq_len(nrow(df)), x), "price"]), 1),
         paste0("n = ", n)))

# [[1]]
# n = 25   n = 50  n = 200  n = 428 
# 139.8800 145.8600 119.6550 131.3668 
# 
# [[2]]
# n = 25   n = 50  n = 200  n = 428 
# 140.1200 113.6000 132.2300 131.3668 
# 
# [[3]]
# n = 25   n = 50  n = 200  n = 428 
# 132.6800 137.9800 121.4600 131.3668