An R library for interacting with the Google Cloud Storage JSON API (api docs).
Google Cloud Storage charges you for storage (prices here).
You can use your own Google Project with a credit card added to create buckets, where the charges will apply. This can be done in the Google API Console
You will first need to create a Google Cloud Project and make sure that the Cloud Storage API is turned on. It is by default for new projects.
The recommended way is to use gcs_setup()
which will
help you create and download an authentication JSON key, and set your
default bucket.
library(googleCloudStorageR)
gcs_setup()
#ℹ ==Welcome to googleCloudStorageR v0.6.0 setup==
#This wizard will scan your system for setup options and help you with any that are missing.
#Hit 0 or ESC to cancel.
#
#1: Create and download JSON service account key
#2: Setup auto-authentication (JSON service account key)
#3: Setup default bucket
#
#Selection: |
It uses googleAuthR::gar_setup_menu()
to create the
wizard. You will need to have owner access to the project you are
using.
After each menu option has completed, restart R and rerun
gcs_setup()
function to continue to the next step.
Upon successful set-up, you should see a message similar to below:
gcs_setup()
works through the steps detailed below.
The instructions below are for when you visit the Google API console
(https://console.developers.google.com/apis/
)
By default, all cloudyr
packages look for the access key
ID and secret access key in environment variables. You can also use this
to specify a default bucket, and auto-authentication upon attaching the
library. For example:
Sys.setenv("GCS_DEFAULT_BUCKET" = "my-default-bucket",
"GCS_AUTH_FILE" = "/fullpath/to/service-auth.json")
These can alternatively be set on the command line or via an
Renviron.site or .Renviron file
(https://cran.r-project.org/web/packages/httr/vignettes/api-packages.html
).
e.g.
In your .Renviron
:
GCS_AUTH_FILE="/fullpath/to/service-auth.json"
GCS_DEFAULT_BUCKET=my-default-bucket
The best method for authentication is to use your own Google Cloud Project. You can specify the location of a service account JSON file taken from your Google Project:
Sys.setenv("GCS_AUTH_FILE" = "/fullpath/to/auth.json")
This file will then used for authentication via
gcs_auth()
when you load the library:
You can also seamlessly authenticate as the service account your
compute resource (i.e. Cloud-Function, AI Notebook etc…) is using by
requesting a token and then passing that token to gcs_auth
with the gargle library
## Load googleCloudStorageR and gargle
library(googleCloudStorageR)
library(gargle)
## Fetch token. See: https://developers.google.com/identity/protocols/oauth2/scopes
scope <-c("https://www.googleapis.com/auth/cloud-platform")
token <- token_fetch(scopes = scope)
## Pass your token to gcs_auth
gcs_auth(token = token)
## Perform gcs operations as normal
gcs_list_objects(bucket = "my-bucket")
To avoid specifying the bucket in the functions below, you can set
the name of your default bucket via environmental variables or via the
function gcs_global_bucket()
. See the
Setting environment variables
section for more details.
## set bucket via environment
Sys.setenv("GCS_DEFAULT_BUCKET" = "my-default-bucket")
library(googleCloudStorageR)
## check what the default bucket is
gcs_get_global_bucket()
[1] "my-default-bucket"
## you can also set a default bucket after loading the library for that session
gcs_global_bucket("your-default-bucket-2")
gcs_get_global_bucket()
[1] "my-default-bucket-2"
You can also use a GCS emulator instead of the real Google Cloud
Storage. By providing a STORAGE_EMULATOR_HOST
environment
variable, the library will direct all API request to the emulator
server. The variable’s value should include scheme, host and port (e.g.:
http://localhost:8080
).
When using an emulator, you don’t have to provide authentication credentials.
This is generally useful in the context of automated tests. And depending on the emulator implementation, it can even give local filesystem support to applications that were previously hardwired to use GCS.
## start a GCS emulator outside R listening at 127.0.0.1:1234, for example
## set emulator host via environment
Sys.setenv("STORAGE_EMULATOR_HOST" = "http://127.0.0.1:1234")
library(googleCloudStorageR)
proj <- "my-dummy-project-id"
# perform GCS operations normally, like:
gcs_create_bucket("my-bucket", proj)
==Google Cloud Storage Bucket==
Bucket: my-bucket
Location: US-CENTRAL1
Class: STANDARD
Created: 2023-07-22 07:31:28
Updated: 2023-07-22 07:31:28
gcs_list_buckets(proj)
name storageClass location updated
1 my-bucket STANDARD US-CENTRAL1 2023-07-22 07:31:28
Once you have a Google project and created a bucket with an object in it, you can download it as below:
library(googleCloudStorageR)
## get your project name from the API console
proj <- "your-project"
## get bucket info
buckets <- gcs_list_buckets(proj)
bucket <- "your-bucket"
bucket_info <- gcs_get_bucket(bucket)
bucket_info
==Google Cloud Storage Bucket==
Bucket: your-bucket
Project Number: 1123123123
Location: EU
Class: STANDARD
Created: 2016-04-28 11:39:06
Updated: 2016-04-28 11:39:06
Meta-generation: 1
eTag: Cxx=
## get object info in the default bucket
objects <- gcs_list_objects()
## save directly to an R object (warning, don't run out of RAM if its a big object)
## the download type is guessed into an appropriate R object
parsed_download <- gcs_get_object(objects$name[[1]])
## if you want to do your own parsing, set parseObject to FALSE
## use httr::content() to parse afterwards
raw_download <- gcs_get_object(objects$name[[1]],
parseObject = FALSE)
## save directly to a file in your working directory
## parseObject has no effect, it is a httr::content(req, "raw") download
gcs_get_object(objects$name[[1]], saveToDisk = "csv_downloaded.csv")
Objects can be uploaded via files saved to disk, or passed in
directly if they are data frames or list type R objects. By default,
data frames will be converted to CSV via write.csv()
, lists
to JSON via jsonlite::toJSON
.
If you want to use other functions for transforming R objects, for
example setting row.names = FALSE
or using
write.csv2
, pass the function through
object_function
## upload a file - type will be guessed from file extension or supply type
write.csv(mtcars, file = filename)
gcs_upload(filename)
## upload an R data.frame directly - will be converted to csv via write.csv
gcs_upload(mtcars)
## upload an R list - will be converted to json via jsonlite::toJSON
gcs_upload(list(a = 1, b = 3, c = list(d = 2, e = 5)))
## upload an R data.frame directly, with a custom function
## function should have arguments 'input' and 'output'
## safest to supply type too
f <- function(input, output) write.csv(input, row.names = FALSE, file = output)
gcs_upload(mtcars,
object_function = f,
type = "text/csv")
Since 2019 you can also set bucket level access permissions. To
upload to those buckets, specify the
defaultAcl="bucketLevel"
You can pass metadata with an object via the function
gcs_metadata_object()
.
the name you pass to the metadata object will override the name if it is also set elsewhere.
If the file/object is small, simple uploads are used. You can modify
this limit using option(googleCloudStorageR.upload_limit)
or gcs_upload_set_limit()
- default is 5000000L or 5MB
(#120)
For files greater than the upload limit, resumable uploads are used. This allows you to upload up to 5TB.
If you get an interrupted connection when uploading,
gcs_upload
will retry 3 times, if it fails it will return a
Retry object, that you can try again later from where the upload
stopped. Call this via gcs_retry_upload
## write a big object to a file
big_file <- "big_filename.csv"
write.csv(big_object, file = big_file)
## attempt upload
upload_try <- gcs_upload(big_file)
## if successful, upload_try is an object metadata object
upload_try
==Google Cloud Storage Object==
Name: "big_filename.csv"
Size: 8.5 Gb
Media URL https://www.googleapis.com/download/storage/v1/b/xxxx
Bucket: your-bucket
ID: your-bucket/"test.pdf"/xxxx
MD5 Hash: rshao1nxxxxxY68JZQ==
Class: STANDARD
Created: 2016-08-12 17:33:05
Updated: 2016-08-12 17:33:05
Generation: 1471023185977000
Meta Generation: 1
eTag: CKi90xxxxxEAE=
crc32c: j4i1sQ==
## if unsuccessful after 3 retries, upload_try is a Retry object
==Google Cloud Storage Upload Retry Object==
File Location: big_filename.csv
Retry Upload URL: http://xxxx
Created: 2016-08-12 17:33:05
Type: csv
File Size: 8.5 Gb
Upload Byte: 4343
Upload remaining: 8.1 Gb
## you can retry to upload the remaining data using gcs_retry_upload()
try2 <- gcs_retry_upload(upload_try)
You can change who can access objects via gcs_update_acl
to one of READER
or OWNER
, on a user, group,
domain, project or public for all users or authenticated users.
By default you are “OWNER” of all the objects and buckets you upload and create.
## update access of object to READER for all public
gcs_update_object_acl("your-object.csv", entity_type = "allUsers")
## update access of object for user [email protected] to OWNER
gcs_update_acl("your-object.csv",
entity = "[email protected]",
role = "OWNER")
## update access of object for googlegroup users to READER
gcs_update_object_acl("your-object.csv",
entity = "[email protected]",
entity_type = "group")
## update access of object for all users to OWNER on your Google Apps domain
gcs_update_object_acl("your-object.csv",
entity = "yourdomain.com",
entity_type = "domain",
role = OWNER)
Since 2019 you can also set bucket level access permissions. To
upload to those buckets, specify the
defaultAcl="bucketLevel"
Delete an object by passing its name (and bucket if not default)
Use gcs_get_object_acl()
to see what the current access
is for an entity
+ entity_type
.
## default entity_type is user
acl <- gcs_get_object_acl("your-object.csv",
entity = "[email protected]")
acl$role
[1] "OWNER"
## for allUsers and allAuthenticated users, you don't need to supply entity
acl <- gcs_get_object_acl("your-object.csv",
entity_type = "allUsers")
acl$role
[1] "READER"
Once a user (or group or the public) has access, they can reach that
object via a download link generated by the function
gcs_download_url
Versions of save.image()
, save()
and
load()
are implemented called
gcs_save_image()
, gcs_save()
and
gcs_load()
. These functions save and load the global R
session to the cloud.
## save the current R session including all objects
gcs_save_image()
### wipe environment
rm(list = ls())
## load up environment again
gcs_load()
Save specific objects:
cc <- 3
d <- "test1"
gcs_save("cc","d", file = "gcs_save_test.RData")
## remove the objects saved in cloud from local environment
rm(cc,d)
## load them back in from GCS
gcs_load(file = "gcs_save_test.RData")
cc == 3
[1] TRUE
d == "test1"
[1] TRUE
You can also upload .R
code files and source them
directly using gcs_source
:
The library is also compatible with Shiny authentication flows, so you can create Shiny apps that lets users log in and upload their own data.
An example of that is shown below:
library("shiny")
library("googleAuthR")
library("googleCloudStorageR")
## you need to start Shiny app on port 1221
## as thats what the default googleAuthR project expects for OAuth2 authentication
## options(shiny.port = 1221)
## print(source('shiny_test.R')$value) or push the "Run App" button in RStudio
shinyApp(
ui = shinyUI(
fluidPage(
googleAuthR::googleAuthUI("login"),
fileInput("picture", "picture"),
textInput("filename", label = "Name on Google Cloud Storage",value = "myObject"),
actionButton("submit", "submit"),
textOutput("meta_file")
)
),
server = shinyServer(function(input, output, session){
access_token <- shiny::callModule(googleAuth, "login")
meta <- eventReactive(input$submit, {
message("Uploading to Google Cloud Storage")
# from googleCloudStorageR
with_shiny(gcs_upload,
file = input$picture$datapath,
# enter your bucket name here
bucket = "gogauth-test",
type = input$picture$type,
name = input$filename,
shiny_access_token = access_token())
})
output$meta_file <- renderText({
req(meta())
str(meta())
paste("Uploaded: ", meta()$name)
})
})
)
There are various functions to manipulate Buckets:
gcs_list_buckets
gcs_get_bucket
gcs_create_bucket
You can get meta data about an object by passing
meta=TRUE
to gcs_get_object
googleCloudStorageR
has its own Google project which is
used to call the Google Cloud Storage API, but does not have access to
the objects or buckets in your Google Project unless you give permission
for the library to access your own buckets during the OAuth2
authentication process.
No other user, including the owner of the Google Cloud Storage API project has access unless you have given them access, but you may want to change to use your own Google Project (that could or could not be the same as the one that holds your buckets).