This is useful to me and not necessarily for outside consumption.

Peer group

Packages to consider in design of googlesheets4:

  • googlesheets: let’s keep the good and fix the bad
  • readr: reference point for UI, col spec, parsing
  • readxl: reference point for parsing, (work)sheets/tabs, cell ranges

Col spec

All 3 packages provide a way for user to specify the output type of each column.

readr has the col spec system that we are most commited to long-term. col_types accepts a string using a single letter shorthand (e.g. "c" for character), as well as a longer form more conducive to passing ancillary info, such as datetime format. If no column type is given, it is guessed.

googlesheets exposes the readr col spec API. All gs_read*() functions literally call either read_csv() or type_convert(). This has downsides, primarily the fact that, by default, googlesheets reads and parses formatted cell contents. The good news is that googlesheets did not introduce its own way to describe col spec.

readxl has its own col spec system, but it is slated to evolve towards readr. col_types is a character vector with one entry per column from these options: “skip”, “guess”, “logical”, “numeric”, “date”, “text” or “list”. It lacks some of readr’s output types and control: no integers (Excel itself has no concept of integer, but still), no factors, less variety wrt date/time/datetime. The “list” type is unique to readxl and is equally relevant to Google Sheets.

Parsing

Given a column type, each cell is processed to produce an atom in a vector of the target type. This is a loop over cells (within a loop over columns).

readxl: We infer type for every cell, either in the guessing pass or at parse time. Cell type inference is based on Excel’s declared cell type, the cell data, and user inputs na and trim_ws (e.g., https://github.com/tidyverse/readxl/blob/c02f81c3b9f7c6177ed9d13748377450bff9abe1/src/XlsxCell.h#L56).

The cell type and the column type are both enums:

There is a large switch() statement that fits a cell of type x into a column of type y. This is simple for matching types, such as CELL_TEXT and COL_TEXT. When it’s not, we must coerce. The simplest coercions are when NA is the only sensible value, such as CELL_LOGICAL to COL_DATE. Examples of trickier coercions: CELL_TEXT to COL_LOGICAL (we use R’s rules for this) or CELL_DATE to COL_TEXT (we can’t really do this right now).

readr. I don’t know it well enough to give same summary. Will do if becomes necessary. The googlesheets4 parsing is more likely to be modelled after readxl anyway, because the info available for each cell is more similar. Presumably there is no notion of “cell type” for readr, since it is always a string.

List of the readr parsers (single character shorthand in parens):

Guessing

readxl guesses a col type like so:

  • Determine cell type = cell enum value.
  • Track cumulative max enum value for the first guess_max cells. Bit of hand-waving here because, while it’s clear that Boolean < integer < double, datetimes are just special.
  • The max enum value is the guessed col type.

readr: I haven’t looked at this. I assume it is also a “max col type” approach.

Observation: It makes sense and is very R-like to take the “maximum” type, i.e. the current pushes everything towards character. Is it worth considering a “majority rules” mentality?

Google Sheet cell data from Sheets API v4

Two options:

  • spreadsheets.values collection: simplest for reading/writing spreadsheet values. Would be nice to use this, but …
  • spreadsheets collection: for reading/writing any aspect of the spreadsheet. I must use this, at least for reading. Cell typing and coercion demand that we have the contents in various forms (e.g., as entered, unformatted, and formatted), plus information about the format itself.

In everything below, I’m suppressing the fact that it’s possible to limit the read to specific cells. Not relevant for picking the endpoint.

Example Sheet

I’ve built an example Sheet with numeric data, especially datetimes and datetime formulas, to study what’s available for a cell.

https://docs.google.com/spreadsheets/d/1xTUxWGcFLtDIHoYJ1WsjQuLmpUtBf--8Bcu5lQ302SU/edit#gid=0

spreadsheets.values.get

This can be skipped, because we will not use this endpoint. But I needed to take a look. It is also the easiest way to juxtapose: as entered, unformatted, formatted.

spreadsheets.values.get has three query parameters:

  • majorDimension: whether row or column increments fastest in the reply. Boring. Not explored.
  • valueRenderOption: An enum:
    • FORMATTED_VALUE (default): Values calculated & formatted according to the cell’s formatting (automatic or explicit) and spreadsheet’s locale. You basically get the strings that user sees in the browser. Current googlesheets package works with v3 equivalent of this and processes via readr.
    • UNFORMATTED_VALUE: Values are calculated but not formatted. Except there’s a loophole: iff the cell has a date-ish format and dateTimeRenderOption = "FORMATTED_STRING", then those specific cells are handled as if valueRenderOption = "FORMATTED_VALUE". I don’t see any clever way to exploit this, but just note it.
    • FORMULA: Values are not calculated. You get what was entered. If the cell contains a formula, you get that. For non-formula cells, seems to be equivalent toUNFORMATTED_VALUE. Except for the same loophole noted above for cells bearing a date-ish format when dateTimeRenderOption = "FORMATTED_STRING".
  • dateTimeRenderOption: An enum:
    • Ignored if valueRenderOption = "FORMATTED_VALUE", which is the default.
    • SERIAL_NUMBER (default): Cells formatted as date, time, datetime, or duration are output as doubles in “serial number” format. The whole number portion of the value (left of the decimal) counts the days since December 30th 1899. The fractional portion (right of the decimal) counts the time as a fraction of the day. 1900 is not treated as a leap year, i.e. Sheets does not reproduce the Lotus 1-2-3 bug.
    • FORMATTED_STRING: Date-ish cells are returned as formatted strings, even if valueRenderOption = "UNFORMATTED_VALUE".

hidden chunk here that hits the spreadsheets.values.get endpoint and gets response ready for inspection

Side-by-side comparison of FORMATTED_VALUE vs. UNFORMATTED_VALUE vs. FORMULA. Remember this requires 3 separate API calls, i.e. you can’t get all 3 of these from spreadsheet.values at the same time.

#> # A tibble: 30 x 5
#>    as_entered         format    FORMATTED_VALUE   UNFORMATTED_VAL… FORMULA
#>    <chr>              <chr>     <chr>             <chr>            <chr>  
#>  1 3/10/2018 11:11:11 Date time 3/10/2018 11:11:… 43169.466099537  43169.…
#>  2 =TODAY()           automatic 5/15/2018         43235            =TODAY…
#>  3 =TODAY()           Date      5/15/2018         43235            =TODAY…
#>  4 =TODAY()           Time      12:00:00 AM       43235            =TODAY…
#>  5 =TODAY()           Date time 5/15/2018 0:00:00 43235            =TODAY…
#>  6 =TODAY()           Plain te… 43235             43235            =TODAY…
#>  7 =TODAY()           Number    43,235.00         43235            =TODAY…
#>  8 =TODAY()           "From \"… Tuesday, May 15,… 43235            =TODAY…
#>  9 =now()             automatic 5/15/2018 12:29:… 43235.5204181366 =now() 
#> 10 =now()             Date      5/15/2018         43235.5204181366 =now() 
#> 11 =now()             Time      12:29:24 PM       43235.5204181366 =now() 
#> 12 =now()             Date time 5/15/2018 12:29:… 43235.5204181366 =now() 
#> 13 =now()             Plain te… 43235.52042       43235.5204181366 =now() 
#> 14 =now()             Number    43,235.52         43235.5204181366 =now() 
#> 15 =now()             "From \"… Tuesday, May 15,… 43235.5204181366 =now() 
#> 16 =EDATE(A9, -4)     automatic 1/15/2018         43115            =EDATE…
#> 17 =EDATE(A9, -4)     Date      1/15/2018         43115            =EDATE…
#> 18 =EDATE(A9, -4)     Time      12:00:00 AM       43115            =EDATE…
#> 19 =EDATE(A9, -4)     Date time 1/15/2018 0:00:00 43115            =EDATE…
#> 20 =EDATE(A9, -4)     Plain te… 43115             43115            =EDATE…
#> 21 =EDATE(A9, -4)     Number    43,115.00         43115            =EDATE…
#> 22 =EDATE(A9, -4)     "From \"… Monday, January … 43115            =EDATE…
#> 23 0.015              automatic 0.015             0.015            0.015  
#> 24 0.015              Percent   1.50%             0.015            0.015  
#> 25 1.5%               automatic 1.50%             0.015            0.015  
#> 26 1234567890         automatic 1234567890        1234567890       123456…
#> 27 1234567890         Scientif… 1.23E+09          1234567890       123456…
#> 28 $1.23              automatic $1.23             1.23             1.23   
#> 29 $1.23              Currency… $1                1.23             1.23   
#> 30 1.23               Currency  $1.23             1.23             1.23

Here’s a look at same, but combined with dateTimeRenderOption = "FORMATTED_STRING". Observations:

  • FORMATTED_VALUE is same as above.
  • UNFORMATTED_VALUE is same as FORMATTED_VALUE for cells bearing a date-ish format.
  • FORMULA is same as above for formula cells and numbers entered as numbers. I note FORMULA is different in the case of the datetime that was typed in (first row).
#> # A tibble: 30 x 5
#>    as_entered         format   FORMATTED_VALUE2  UNFORMATTED_VAL… FORMULA2
#>    <chr>              <chr>    <chr>             <chr>            <chr>   
#>  1 3/10/2018 11:11:11 Date ti… 3/10/2018 11:11:… 3/10/2018 11:11… 3/10/20…
#>  2 =TODAY()           automat… 5/15/2018         5/15/2018        =TODAY()
#>  3 =TODAY()           Date     5/15/2018         5/15/2018        =TODAY()
#>  4 =TODAY()           Time     12:00:00 AM       12:00:00 AM      =TODAY()
#>  5 =TODAY()           Date ti… 5/15/2018 0:00:00 5/15/2018 0:00:… =TODAY()
#>  6 =TODAY()           Plain t… 43235             43235            =TODAY()
#>  7 =TODAY()           Number   43,235.00         43235            =TODAY()
#>  8 =TODAY()           "From \… Tuesday, May 15,… Tuesday, May 15… =TODAY()
#>  9 =now()             automat… 5/15/2018 12:29:… 5/15/2018 12:29… =now()  
#> 10 =now()             Date     5/15/2018         5/15/2018        =now()  
#> 11 =now()             Time     12:29:24 PM       12:29:24 PM      =now()  
#> 12 =now()             Date ti… 5/15/2018 12:29:… 5/15/2018 12:29… =now()  
#> 13 =now()             Plain t… 43235.52042       43235.5204181366 =now()  
#> 14 =now()             Number   43,235.52         43235.5204181366 =now()  
#> 15 =now()             "From \… Tuesday, May 15,… Tuesday, May 15… =now()  
#> 16 =EDATE(A9, -4)     automat… 1/15/2018         1/15/2018        =EDATE(…
#> 17 =EDATE(A9, -4)     Date     1/15/2018         1/15/2018        =EDATE(…
#> 18 =EDATE(A9, -4)     Time     12:00:00 AM       12:00:00 AM      =EDATE(…
#> 19 =EDATE(A9, -4)     Date ti… 1/15/2018 0:00:00 1/15/2018 0:00:… =EDATE(…
#> 20 =EDATE(A9, -4)     Plain t… 43115             43115            =EDATE(…
#> 21 =EDATE(A9, -4)     Number   43,115.00         43115            =EDATE(…
#> 22 =EDATE(A9, -4)     "From \… Monday, January … Monday, January… =EDATE(…
#> 23 0.015              automat… 0.015             0.015            0.015   
#> 24 0.015              Percent  1.50%             0.015            0.015   
#> 25 1.5%               automat… 1.50%             0.015            0.015   
#> 26 1234567890         automat… 1234567890        1234567890       1234567…
#> 27 1234567890         Scienti… 1.23E+09          1234567890       1234567…
#> 28 $1.23              automat… $1.23             1.23             1.23    
#> 29 $1.23              Currenc… $1                1.23             1.23    
#> 30 1.23               Currency $1.23             1.23             1.23

spreadsheets.get

spreadsheets.get can return cell data:

  • includeGridData = "TRUE": Put this in the query to request cell data. You will get everything.
  • Field mask. More granular way to specify desired fields in the query. We’ll use this in the long-run, for fine control (see end of this article). I’ve used includeGridData here to see full detail (some of which I am still suppressing).

hidden chunk here that hits the spreadsheets.get endpoint and gets response ready for inspection

If you drill down, the data you get for each cell is an instance of CellData.

Here’s (trimmed down) CellData on cell A2, a literal, explicitly formatted datetime:

Here’s (trimmed down) CellData on cell A3, an automatically formatted datetime formula:

Observations:

  • formattedValue is useful for coercing to character, i.e. Google has already done it and, unlike Excel, exposes it. It is the same as what you get from spreadsheets.values.get when valueRenderOption = "FORMATTED_VALUE".
  • userEnteredValue could be used if we aspire to return formulas. It is basically the same as what you get from spreadsheets.values.get when valueRenderOption = "FORMULA".
  • effectiveValue is our main source of cell data. It is basically the same as what you get from spreadsheets.values.get when valueRenderOption = "UNFORMATTED_VALUE".
  • effectiveValue and effectiveFormat, together, provide the basis for cell typing.
  • effectiveFormat could provide a basis for transmitting datetime format from the Sheet into R, if we had a lexicon or a grammar for the formats. Sheets API docs on Date and Number Formats. TimeFormatStrings, a C library to read and write time format strings (Excel, Stata, Unicode).

Let’s look at these fields for all cells.

userEnteredValue and effectiveValue hold an instance of ExtendedValue, which is always exactly one of the following:

  • numberValue
  • stringValue
  • boolValue
  • formulaValue
  • errorValue, which will be an instance of ErrorValue

effectiveFormat, if it exists, holds an instance of CellFormat. We only care about the sub-component NumberFormat. If it exists, NumberFormat has a type and possibly a pattern. Possible values for type of NumberFormat:

  • TEXT
  • NUMBER
  • PERCENT
  • CURRENCY
  • DATE
  • TIME
  • DATE_TIME
  • SCIENTIFIC

A cell will not have an effectiveFormat if it bears the defaultFormat, which is a property of the enclosing spreadsheet.

A cell with an effectiveFormat does not necessarily have a NumberFormat, indicated by NA above. A cell with a NumberFormat does not necessarily have a pattern, indicated by NA above. That means a default pattern for that type is selected based on the user’s locale. The docs say “user’s locale” and in other places are careful to say “spreadsheet’s locale”. Not sure if this is sloppiness or is significant.

spreadsheets.values.get vs. spreadsheets.get

Nailing down my understanding of the two endpoints.

Claim: cell values returned by spreadsheets.values.get when valueRenderOption = "FORMATTED_VALUE" (default) are same as formattedValue from spreadsheets.get.

TRUE. My use of the now() datetime formula in the example spreadsheet is regrettable here, hence those rows are filtered out.

Claim: cell values returned by spreadsheets.values.get when valueRenderOption = "UNFORMATTED_VALUE" are same as effectiveValue from spreadsheets.get.

MEH. I suspect this is TRUE up to the width/precision of the double-to-string conversion, which inexplicably seems to be different across the two endpoints. It’s too bad that spreadsheets.values.get seems to report greater precision for the literal datetime than spreadsheets.get. It’s hard to understand why the endpoint described as more powerful would have lower precision for any sort of cell.

Claim: cell values returned by spreadsheets.values.get when valueRenderOption = "FORMULA" are same as userEnteredValue from spreadsheets.get.

tibble(
  ss_values_get = val_df$FORMULA,
  ss_get = df$userEnteredValue_value[-1],
  ss_get_type = df$userEnteredValue_name[-1],
  same = ss_values_get == ss_get
  ) %>%
  print(n = Inf)
#> # A tibble: 30 x 4
#>    ss_values_get   ss_get          ss_get_type  same 
#>    <chr>           <chr>           <chr>        <lgl>
#>  1 43169.466099537 43169.466100    numberValue  FALSE
#>  2 =TODAY()        =TODAY()        formulaValue TRUE 
#>  3 =TODAY()        =TODAY()        formulaValue TRUE 
#>  4 =TODAY()        =TODAY()        formulaValue TRUE 
#>  5 =TODAY()        =TODAY()        formulaValue TRUE 
#>  6 =TODAY()        =TODAY()        formulaValue TRUE 
#>  7 =TODAY()        =TODAY()        formulaValue TRUE 
#>  8 =TODAY()        =TODAY()        formulaValue TRUE 
#>  9 =now()          =now()          formulaValue TRUE 
#> 10 =now()          =now()          formulaValue TRUE 
#> 11 =now()          =now()          formulaValue TRUE 
#> 12 =now()          =now()          formulaValue TRUE 
#> 13 =now()          =now()          formulaValue TRUE 
#> 14 =now()          =now()          formulaValue TRUE 
#> 15 =now()          =now()          formulaValue TRUE 
#> 16 =EDATE(A10, -4) =EDATE(A10, -4) formulaValue TRUE 
#> 17 =EDATE(A11, -4) =EDATE(A11, -4) formulaValue TRUE 
#> 18 =EDATE(A12, -4) =EDATE(A12, -4) formulaValue TRUE 
#> 19 =EDATE(A13, -4) =EDATE(A13, -4) formulaValue TRUE 
#> 20 =EDATE(A14, -4) =EDATE(A14, -4) formulaValue TRUE 
#> 21 =EDATE(A15, -4) =EDATE(A15, -4) formulaValue TRUE 
#> 22 =EDATE(A16, -4) =EDATE(A16, -4) formulaValue TRUE 
#> 23 0.015           0.015000        numberValue  FALSE
#> 24 0.015           0.015000        numberValue  FALSE
#> 25 0.015           0.015000        numberValue  FALSE
#> 26 1234567890      1234567890      numberValue  TRUE 
#> 27 1234567890      1234567890      numberValue  TRUE 
#> 28 1.23            1.230000        numberValue  FALSE
#> 29 1.23            1.230000        numberValue  FALSE
#> 30 1.23            1.230000        numberValue  FALSE

MEH. As above, seems to be same content, but with differences in width/precision of doubles.

Field mask and partial response

https://developers.google.com/sheets/api/guides/concepts#partial_responses

It’s clear I will ultimately use spreadsheets.get for Sheet reading, so here I explore how to control the payload via a field mask. This seems pretty close to what I’ll need to read rectangular regions. Example with the range A2:B3.

Spreadsheet-level metadata:

Metadata about the sheets is available here, but my current field mask does not request it. I assume we’ll populate info about the sheets when registering the Sheet, not while reading rectangles.

The reply has a component sheets, holding instances of Sheet, one per sheet given in ranges in the request. We have just one because ranges = "Sheet1!A2:B3". We look at the data component, which contains an instance of GridData.

The upper left corner of the rectangle is identified via startRow and startColumn. Both are zero-based and, apparently, are omitted if the value is zero.

The cell data is found within rowData, one object per row and, within that, values and, within that, one instance of CellData per cell.

Here it is, post-rectangling.

#> # A tibble: 4 x 8
#>   cell  enter_type   enter_value effective_type effective_value
#>   <chr> <chr>        <list>      <chr>          <list>         
#> 1 A2    numberValue  <dbl [1]>   numberValue    <dbl [1]>      
#> 2 A3    stringValue  <chr [1]>   stringValue    <chr [1]>      
#> 3 B2    formulaValue <chr [1]>   numberValue    <int [1]>      
#> 4 B3    stringValue  <chr [1]>   stringValue    <chr [1]>      
#> # ... with 3 more variables: formatted_value <chr>, format_type <chr>,
#> #   format_pattern <chr>
#> Observations: 4
#> Variables: 8
#> $ cell            <chr> "A2", "A3", "B2", "B3"
#> $ enter_type      <chr> "numberValue", "stringValue", "formulaValue", ...
#> $ enter_value     <list> [43169.47, "3/10/2018 11:11:11", "=TODAY()", ...
#> $ effective_type  <chr> "numberValue", "stringValue", "numberValue", "...
#> $ effective_value <list> [43169.47, "3/10/2018 11:11:11", 43235, "=TOD...
#> $ formatted_value <chr> "3/10/2018 11:11:11", "3/10/2018 11:11:11", "5...
#> $ format_type     <chr> "DATE_TIME", NA, "DATE", NA
#> $ format_pattern  <chr> "M/d/yyyy H:mm:ss", NA, NA, NA

Usage and design

Big question: Will googlesheets4 have a notion of a spreadsheet object?

Peer packages

What do peer packages do?

More thoughts

A spreadsheet is more like a database with multiple tables than a csv file. Which suggests it’s not crazy to think of establishing a connection, then reading from and writing to the different tables within it.

readxl and readr are read-only and the target files tend to be local and static. In contrast, googlesheets4 will be read/write and the target file is more volatile. I feel like this supports the use of a persistent object, but I can’t say exactly why.

File id is the only thing that can’t change about a Google Sheet. Even if we augment the file id with other metadata, at request time, there’s no guarantee that the metadata is still accurate.

  • Storing the file name along with the id just seems humane. It will almost always be correct.
  • It’s very nice to have an overview of the worksheets: name, extent. Even readxl will list the worksheet names for you (readxl::excel_sheets()).
  • How much do we rely on the most recent metadata versus dutifully update? For example, when asked to read a specific worksheet, do we just try or do we use current or refreshed metadata to first establish that the worksheet actually exists?

These are related, but distinct, and shouldn’t be conflated:

  • Registering a spreadsheet as a pointer-with-metadata.
    • Example: current googlesheet object.
  • Capturing an exhaustive snapshot of a spreadsheet or a worksheet as an R object that is intelligible and amenable to further manipulation. The linen object is an example. A lower-tech example is the tibble returned by googlesheets::gs_read_cellfeed(). It has one row per cell and columns for, e.g., cell reference, various forms of cell contents, and cell formatting.

googlesheets makes data immediately available in two forms:

  • A data frame with same “shape” as the Sheet. Defaults to formatted cell contents, but gs_read(..., literal = FALSE) requests unformatted contents.
  • A data frame with one row per cell via gs_read_cellfeed(), with cell location data and cell contents in three forms (“literal value” = formatted, “input value” = (sort of) what was typed, “numeric value” = actual numeric value).

In fact, in general gs_read() is a call to gs_read_cellfeed() followed by a call to gs_reshape_cellfeed() (which reshapes and enforces col types). This seems like a good design, i.e. to expose the “one row per cell” form, but offer a nice wrapper that does the reshaping.