Dear Stackoverflow Community,
I am trying to self-teach myself R and data analysis using the textbook "Békés, Gábor. Data Analysis for Business, Economics, and Policy" (https://gabors-data-analysis.com/) and I am stuck in the below code using Hotel Vienna Dataset (https://osf.io/y6jvb).
I do have a bit of experience with R but since my basics are very weak I am re-teaching myself from the very beginning and really need your step-by-step guidance on how to figure out the below code.
PRACTICAL QUESTION OF THE TEXTBOOK: Take the hotels-viennadataset used in this chapter and use your computer to pick samples of size 25, 50, and 200. Calculate the simple average of hotel price in each sample and compare them to those in the entire dataset. Repeat this exercise three times and record the results. Comment on how the average varies across samples of different sizes.
DATASET : https://osf.io/y6jvb
CODE
library(tidyverse)
# set working directory
# option A: open material as project
# option B: set working directory for da_case_studies
#example: setwd("C:/Users/bekes.gabor/Documents/github/da_case_studies/")
#set data dir, load theme and functions
setwd("C:/Users/sha/Desktop/R/intro/data/da_case_studies/")
source("ch00-tech-prep/theme_bg.R")
source("ch00-tech-prep/da_helper_functions.R")`
I dont know how to do data_dir and to get the set-data-directory.R (link about how to set the computer https://gabors-data-analysis.com/howto-r/)
Data used:
source("set-data-directory.R") #data_dir must be first defined #data_in <- paste(data_dir,"hotels-vienna","clean/", sep = "/")
use_case_dir <- "ch01-hotels-data-collect/"
data_out <- use_case_diroutput <- paste0(use_case_dir,"output/")create_output_if_doesnt_exist(output)
# load in clean and tidy data and create workfile
df <- read.csv(paste0(data_in,"hotels-vienna.csv"))
# or from the website
df <- read_csv("https://osf.io/y6jvb/download")
# First look
df <- df%>%
select(hotel_id, accommodation_type, country, city, city_actual, neighbourhood, center1label, distance,center2label, distance_alter, stars, rating, rating_count, ratingta, ratingta_count, year, month,weekend, holiday, nnights, price, scarce_room, offer, offer_cat)
summary(df)glimpse(df)
# export list
df <- subset(df, select = c(hotel_id, accommodation_type, country, city, city_actual, center1label, distance, stars, rating, price))
write.csv(df[1:5,], paste0(output, "hotel_listobs.csv"), row.names = F)
Dataset using Dput()
dput(head(df[, c(1:10)]))
structure(list(hotel_id = c(21894L, 21897L, 21901L, 21902L, 21903L,
21904L), accommodation_type = c("Apartment", "Hotel", "Hotel",
"Hotel", "Hotel", "Apartment"), country = c("Austria", "Austria",
"Austria", "Austria", "Austria", "Austria"), city = c("Vienna",
"Vienna", "Vienna", "Vienna", "Vienna", "Vienna"), city_actual = c("Vienna",
"Vienna", "Vienna", "Vienna", "Vienna", "Vienna"), center1label = c("City centre",
"City centre", "City centre", "City centre", "City centre", "City centre"
), distance = c(2.7, 1.7, 1.4, 1.7, 1.2, 0.9), stars = c(4, 4,
4, 3, 4, 5), rating = c(4.4, 3.9, 3.7, 4, 3.9, 4.8), price = c(81L,
81L, 85L, 83L, 82L, 229L)), row.names = c(NA, 6L), class = "data.frame")
What I've tried:
setwd("C:/Users/sha03/Desktop/R/intro/data/da_case_studies/")
source("theme_bg.R")
source("da_helper_functions.R")
read.csv('C:/Users/sha03/Desktop/R/intro/data/da_case_studies/hotels-vienna.csv')
summary(df)
glimpse(df)
I Cant seem to get the answer I am supposed to get, which is (https://github.com/gabors-data-analysis/da_case_studies/blob/master/ch01-hotels-data-collect/ch01-hotels-data-collect.ipynb)
To calculate the average of samples of size 25, 50, 200 and the full dataset, you can use
sampleto index the rows and index thepricecolumn. Remember to set a seed when dealing with random samples so that results are always reproducible.You can do it individually:
Or all in one go:
If you need to repeat this three times, you can use
lapply, which outputs a list: