# Some examples of my workflow for analysing battery test data with R and tidyverse

## Intro

I received an email from a former colleague the other day, who asked me a question about plotting some data from a battery test via some makeshift R functions I made several years ago (arbintools, as it happens), which got me thinking.

At that time, I was starting to really get into the swing of scripting all my data analysis work, I had written code to quickly import, analyse and plot data off instruments that a lot of people were using, so I polished it up and shared it, trying to flatten out the learning curve for others to try it out for themselves. But since then, I haven't kept arbintools updated, while a lot has changed in the R community, especially around the superb tidyverse ecosystem.

So, I thought it might be a good time to write a post here about what my normal workflow towards importing, analysing and plotting battery data in R would typically look like ("tidy arbintools", as it will turn out). I have a few motivations here. Foremost, I can't recommend strongly enough, especially to PhD students in this or any remotely similar field, to learn how to program, in whatever language you like best or consider most relevant. While I know a number of people are using code I've written to do data analysis like this now, it's always worth the time investment to learn the basics, so you can write code for whatever purpose you need. But also importantly, I'm a big proponent of using scripted data analysis as a record of how data is treated between acquisition and presentation, and including this in publications. I've written before about some reasons why I think this is important. And lastly, my hope is that this might serve as some inspiration as to how useful this can be, or as a source of tips for anyone interested in learning how to do this.

In the interests of keeping explanations brief, I'm sharing code here assuming some familiarity with R on the part of the reader. For learning the basics, and especially the tidyverse system, I can strongly recommend the freely-available book "R for Data Science".

## Where to start

As I've said, I am a big fan of the tidyverse series of add-on packages in R, so the first line in pretty much every R script I'll ever write is:

library(tidyverse)


Now, if I'm going to show some data analysis examples, I need some data. For this I'll use the data from of my recent papers on lithium-sulfur batteries, for which we shared the dataset openly on Zenodo (the paper itself is here). If you're interested, therefore, you can follow the code here and play with the dataset yourself (note though, the full dataset is quite big, at around 415 MB).

# Download the file from Zenodo, and save as ccc.zip into the current working directory

# Now unzip the contents of ccc.zip into a new folder ccc/
unzip(zipfile = "ccc.zip", exdir = "ccc")


For the purposes of this post I'll just work with the data which is in the "3_separators" folder, which contains a long-duration charge/discharge cycling data for six Li-S cells with three different separators. So, the first thing I'll do is set my working directory to that folder.

setwd("ccc/data/3_separators/")


### Write your own importing function

Importing plain text files in R is no problem, and tidyverse has its own version via the readr package. But our data here was collected using, as it happens, an Arbin battery tester, which only spits out its data in Excel format. Thankfully, tidyverse also includes the splendid readxl package for this purpose. This needs to be loaded separately, so:

library(readxl)


Now there are several Excel files in this folder:

grep(".xlsx", list.files(), value = TRUE)

[1] "YC25A.xlsx" "YC26A.xlsx" "YC64A.xlsx" "YC66D.xlsx" "YC66E.xlsx" "YC71B.xlsx"


I'll just choose one to work with for now, which I'll give to an object called filename for convenience later:

filename = "YC64A.xlsx"


Ordinarily, you can just import an Excel sheet using something of the form:

read_excel(path = "path/to/file.xlsx", sheet = "Sheet1")


In this case, because of the particular experiment, we have a lot of data points. And it so happens that if we have more than 65,000 rows in the raw data, then the macro creates a new sheet and carries on in the new sheet (this is related to limitations in the number of rows in old versions of Excel). So to import all the raw data, we have to take this into account. In R, this can be quite straightforward.

excel_sheets() returns a vector of the names of all the sheets in an Excel file. But I'm only interested in the ones that start with "Channel", as these are where the raw data is. I can pass the result of excel_sheets() to the str_subset() function in the stringr package, which I can use to match only the ones that include "Channel". (Incidentally, another tip: learn how to use the pipe, or %>% operator: here's a tutorial.)

excel_sheets(path = filename) %>%
str_subset("Channel")

[1] "Channel_1-031_1" "Channel_1-031_2" "Channel_1-031_3" "Channel_1-031_4" "Channel_1-031_5"
[6] "Channel_1-031_6"


There are six sheets then, and I would like to read in each of them and stitch them together into a single table of data. For this sort of job, I use R's lapply() function, which I use for pretty much any sort of batch task over, rather than for loops. (Another tip: if you're learning R, it is well worth learning how to work with lists, which lapply() is all about. Here's a tutorial.)

So what I can do is take the above list of sheet names, and use lapply() to import them one at a time:

imported.data <- excel_sheets(path = filename) %>%
str_subset("Channel") %>%
lapply(function(sheet) {
read_excel(path = filename, sheet = sheet)
})


What the above code does is go through the Excel sheet names one by one, reads in that sheet from the Excel file, and at the end returns a list of all six tables (or, rather, "tibbles", since this is tidyverse), saved as the object imported.data. But this is not quite what I want — I would like:

• One single table, not a list of six separate tables
• Only the columns I want - say time, cycle number, current, voltage, charge... with shorter names, since the column names are a bit long, like "Discharge_Capacity(Ah)"
• To have this as a function, so I can batch import multiple Excel files with lapply()
• Maybe I would like to be able to normalise charge values by active mass?
• Maybe I would like to add a unique identifier column, which might be useful later?

Skipping ahead a few steps, I can do all of these and write a simple import function which does all of these things:

import_arbin_raw <- function(filename, mass = NULL, ident = NULL) {

excel_sheets(path = filename) %>%
str_subset("Channel") %>%
lapply(function(sheet) {

# for each Excel sheet containing the word "Channel",
# read_excel reads it in and assigns it to the object tbl
tbl <- read_excel(path = filename, sheet = sheet)

# to.select is a list of variables I want from each sheet
# and how they should be renamed during import
to.select <- list(t = sym("Test_Time(s)"),
cyc.n = sym("Cycle_Index"),
I = sym("Current(A)"),
E = sym("Voltage(V)"),
Q.c = sym("Charge_Capacity(Ah)"),
Q.d = sym("Discharge_Capacity(Ah)"))

# Create a new object tbl2 with the selections we
# want from tbl
tbl2 <- tbl %>%
select(!!!to.select)

# If the mass argument is not null, correct capacity values
# (in Ah) to mAh/g (assuming mass is in units of mg)
if(!is.null(mass)) {
tbl2 <- tbl2 %>%
mutate(Q.c = Q.c * 1E6 / mass, Q.d = Q.d * 1E6 / mass)
}

# If the ident argument is not null, add it as a column
if(!is.null(ident)) {
tbl2$ident <- ident } # Return the tbl2 object return(tbl2) }) %>% # now use do.call(rbind, .) to take the list of "tibbles" (or data frames) # returned, and stitch them all together in sequence. do.call(rbind, .) }  Now, if I ever want to import a file like this again, I can do it in one line, with the option to provide the mass and ident information if I wish. imported.data <- import_arbin_raw(filename, mass = 2.9705, ident = "Celgard") imported.data  # A tibble: 385,755 x 7 t cyc.n I E Q.c Q.d ident 1 0.0522 1 0 3.06 0 0 Celgard 2 600. 1 0 3.06 0 0 Celgard 3 1200. 1 0 3.06 0 0 Celgard 4 1800. 1 0 3.06 0 0 Celgard 5 2400. 1 0 3.06 0 0 Celgard 6 3000. 1 0 3.05 0 0 Celgard 7 3600. 1 0 3.05 0 0 Celgard 8 4200. 1 0 3.05 0 0 Celgard 9 4800. 1 0 3.05 0 0 Celgard 10 5400. 1 0 3.04 0 0 Celgard # … with 385,745 more rows  ## Summarising and plotting data Battery test instruments usually export aggregated statistics themselves, but why not do it myself? dplyr's functions for subsetting, sorting and modifying data make this easy. Suppose I want an aggregated table of capacity and coulombic efficiency vs cycle number, which I can assign to an object stats: stats <- imported.data %>% group_by(cyc.n) %>% summarise(Q.c = last(Q.c), Q.d = last(Q.d)) %>% mutate(CE = Q.d / Q.c) stats  # A tibble: 191 x 4 cyc.n Q.c Q.d CE 1 1 1266. 1214. 0.959 2 2 1075. 1054. 0.981 3 3 1046. 1022. 0.977 4 4 1025. 995. 0.971 5 5 1012. 981. 0.970 6 6 1000. 968. 0.967 7 7 990. 958. 0.968 8 8 987. 954. 0.967 9 9 981. 949. 0.967 10 10 975. 942. 0.967 # … with 181 more rows  I very much like ggplot2 for my plotting work, and though I have no idea whether this is considered good practice, I like to pipe (%>%) data objects through various functions to wrangle it into the format I want it straight into ggplot(). In this way, I can go from raw data to a publication-quality plot in only a few lines. As an example, I can filter out only up to 100 cycles and create a very simple plot: stats %>% filter(cyc.n <= 100) %>% ggplot(aes(x = cyc.n, y = Q.d)) + geom_point()  It's an OK start but needs some work - the axes need labelling, for one thing, rescale the y-axis to start at zero (personal preference for this sort of plot), but mostly I'd quite like to modify the theme and maybe add a touch of colour. ggplot2 has a few built-in themes, but of course we'd quite like to make our own, in the form of a function which can be added onto a ggplot() call like the one above. Here's the theme I usually use: theme_Lacey2 <- function(base_size=16, base_family="Helvetica Neue", alt_family = base_family, legend.position = "top", panel.background.fill = "#fafafa") { library(grid) library(ggthemes) (theme_foundation(base_size=base_size, base_family=base_family) + theme(plot.title = element_text(size = rel(1.3), hjust = 0, face = "bold"), plot.subtitle = element_text(face = "italic"), text = element_text(), panel.background = element_rect(fill = panel.background.fill), plot.background = element_rect(fill = "transparent", colour=NA), panel.border = element_rect(colour = "#333333", size = 0.3), axis.title = element_text(size = rel(1), colour="#333333"), axis.title.y = element_text(angle=90, colour="#333333"), axis.text = element_text(size = rel(0.8), family = alt_family), axis.ticks.length=unit(0.15, "cm"), axis.text.x = element_text(margin = margin(0.2, 0, 0.2, 0, "cm"), colour="#333333"), axis.text.y = element_text(margin = margin(0, 0.2, 0, 0.2, "cm"), colour="#333333"), panel.grid.major = element_line(colour="#eaeaea", size = 0.5), panel.grid.minor = element_line(colour="#eaeaea", size = 0.2), legend.key = element_rect(colour = NA), legend.key.size = unit(0.6, "cm"), legend.background = element_blank(), strip.background=element_rect(colour="#eaeaea",fill="#eaeaea"), strip.text = element_text(colour = "#333333", lineheight=0.7), legend.title = element_text(size = rel(0.8), family = alt_family), legend.position = legend.position )) }  So adding in this, along with some properly-formatted axis labels, scaling, and a splash of colour, we can have: stats %>% filter(cyc.n <= 100) %>% ggplot(aes(x = cyc.n, y = Q.d)) + geom_point(size = 2, color = "#0789ce") + scale_y_continuous(limits = c(0, NA)) + labs(x = "cycle number", y = "Q"[discharge]~"/ mAh g"^"-1") + theme_Lacey2()  ## More complicated plots Let's take a voltage profile (potential vs charge) as an example, since this used to take me a good 20 minutes per plot before I learned to script these. In this data, we have one column for the charge on the discharge half-cycle, and one column for the charge on the charge half-cycle. So, the first thought might be to plot one line for each column, filtering out only the cycle numbers we need, and making each cycle a different colour: imported.data %>% filter(cyc.n %in% c(1, 10, 20, 50, 100)) %>% ggplot(aes(y = E)) + geom_path(aes(x = Q.c, color = cyc.n)) + geom_path(aes(x = Q.d, color = cyc.n))  This plot is pretty horrible. A big problem here is that, for example, at the end of the discharge cycle, the Q.d value remains constant during the charge, while the voltage increases again - so at the end of each discharge the plot is drawing a straight vertical line, while it's also drawing the charge profile. Another problem - by default, ggplot interprets cycle number as a continuous variable (i.e., it thinks you could have, say, cycle 1.63 between cycles 1 and 2), so it plots a continuous scale for the colour. And, for this specific case, we have a lot of current interruptions (part of the technique used in this work), which show up as distracting little spikes. So, I would like to do a few things: • Remove those distracting spikes by filtering out parts of the data where current is zero • Fix the 'two charge values' problem, by setting charge capacity to a blank value for discharges, and vice versa • Plot only one geom (or one set of lines) for E vs Q • Scale properly, make colours nice, properly format labels, add nice theme etc... Again, all of these things we can do in one piped command with functions from several tidyverse packages (introducing here pivot_longer() from tidyr, which converts multiple columns into key-value pairs): imported.data %>% # filter out specific cycles, and any data where current is zero filter(cyc.n %in% c(1, 10, 20, 50, 100), I != 0) %>% # if I > 0, then discharge capacity is blank, and vice versa. mutate(Q.c = ifelse(I > 0, Q.c, NA), Q.d = ifelse(I < 0, Q.d, NA)) %>% # select only what's needed for the plot select(cyc.n, E, Q.c, Q.d) %>% # convert to "long" format, i.e., columns of cyc.n, E, Q, mode, where # mode has values of Q.d or Q.c - so there is only one column of Q values pivot_longer(cols = starts_with("Q"), names_to = "mode", values_to = "Q") %>% # now plot ggplot(aes(x = Q, y = E)) + # use factor(cyc.n) so that cyc.n is not interpreted as a continuous variable geom_path(aes(color = factor(cyc.n), group = mode), size = 0.8) + scale_y_continuous(limits = c(1.75, 2.65)) + scale_x_continuous(breaks = seq(0, 1400, 200)) + # change the colour scale scale_color_brewer("cycle number", palette = "Set1") + theme_Lacey2() + labs(x = "Q / mAh g"^-1~"", y = "E / V")  12 lines (plus some comments), and a plot that would look decent in any article. ## Batch importing and plotting Batch processing of repetitive data processing and analysis, like this, is where scripting really saves you time. And when it comes to battery materials science, in almost every publication you'll see plots of, say, capacity vs cycle number for several different materials. What I'll usually do here is have a table of all the information I need. With the import function I wrote earlier, I want a filename, I want an active material mass, and I want an identifying label. Now, the tibble package has a neat function called tribble() (transposed tibble) which is handy for directly writing in small tables in the code, like this: files <- tribble( # file | mass | ident # -----|------|------ ~filename, ~mass, ~ident, "YC64A.xlsx", 2.9705, "Celgard", "YC26A.xlsx", 2.977, "CCC", "YC25A.xlsx", 2.9445, "Cellulose" ) files  # A tibble: 3 x 3 filename mass ident 1 YC64A.xlsx 2.97 Celgard 2 YC26A.xlsx 2.98 CCC 3 YC25A.xlsx 2.94 Cellulose  With this, I can do a batch import with lapply() much like I used for writing the import function before — and this is usually what I'd do since I'm most familiar with it: separators <- lapply(1:nrow(files), function(i) { import_arbin_raw(files$filename[i], mass = files$mass[i], ident = files$ident[i])
}) %>%
do.call(rbind, .)


However, there is also the purrr package in tidyverse, which aims to do a lot of the same things that the apply series of functions do but in a more consistent and readable way. So I could instead write something that does the same thing using the pmap_dfr() function, which goes through a tibble row by row, runs a function for each row, and returns a single tibble bound by rows, which is the same as what do.call(rbind, .) does. So this would look like:

separators <- files %>%
pmap_dfr(function(filename, mass, ident) {
import_arbin_raw(filename, mass, ident)
})


which is pretty neat really. So now, to make those plots we made earlier, I can summarise the data using very nearly the same code I already wrote:

separators_stats <- separators %>%
# Add ident in the group_by line
group_by(ident, cyc.n) %>%
summarise(Q.c = last(Q.c), Q.d = last(Q.d)) %>%
mutate(CE = Q.d / Q.c)


And using some of the same tools I showed earlier, plus some other tricks from my blog archive, I can make some useful plots without too much effort:

separators_stats %>%
filter(cyc.n <= 100) %>%
select(cyc.n, ident, Q.d, CE) %>%
rename(Q[discharge]~"/ mAh g"^-1~"" = Q.d) %>%
pivot_longer(cols = -c("cyc.n", "ident"), names_to = "key", values_to = "value") %>%
ggplot(aes(x = cyc.n, y = value)) +
geom_point(aes(color = ident)) +
scale_color_brewer("", palette = "Set1") +
facet_grid(key ~ ., scales = "free_y",
labeller = label_parsed,
switch = "y") +
theme_Lacey2() +
theme(strip.background = element_blank(),
axis.title.y = element_blank(),
strip.text = element_text(size = rel(1)),
strip.placement = "outside") +
labs(x = "cycle number")


Or, with only minor modifications to the voltage profile code from before, I can do this:

separators %>%
# Filtering and sorting data
filter(cyc.n %in% c(1, 10, 20, 50, 100), I != 0) %>%
mutate(Q.c = ifelse(I > 0, Q.c, NA), Q.d = ifelse(I < 0, Q.d, NA)) %>%
# Include ident in the selection here
select(cyc.n, ident, E, Q.c, Q.d) %>%
pivot_longer(cols = starts_with("Q"), names_to = "mode", values_to = "Q") %>%
# Plot
ggplot(aes(x = Q, y = E)) +
geom_path(aes(color = factor(cyc.n), group = mode), size = 0.8) +
scale_y_continuous(limits = c(1.75, 2.65)) +
scale_x_continuous(breaks = seq(0, 1800, 200)) +
scale_color_brewer("cycle number", palette = "Set1") +
# Facet by ident
facet_grid(ident ~ .) +
theme_Lacey2() +
labs(x = "Q / mAh g"^-1~"", y = "E / V")


Now, this sort of work becomes largely a copy/paste exercise. Let's say I want to plot all six data files, which are in fact two sets of three - one with an electrolyte additive, and one without:

# Table of filenames, masses and identifiers
files2 <- tribble(
# file | mass | ident
~filename, ~mass, ~ident,
"YC64A.xlsx", 2.9705, "Celgard-LiNO3",
"YC26A.xlsx", 2.977, "CCC-LiNO3",
"YC25A.xlsx", 2.9445, "Cellulose-LiNO3",
"YC71B.xlsx", 3.2305, "Celgard-No LiNO3",
"YC66E.xlsx", 3.146, "CCC-No LiNO3",
"YC66D.xlsx", 3.2955, "Cellulose-No LiNO3"
)

# Import the data
separators2 <- files2 %>%
pmap_dfr(function(filename, mass, ident) {
import_arbin_raw(filename, mass, ident) %>%
# Split ident into two new columns
mutate(separator = str_split(ident, "-")[[1]][1],
})

# Sort the data and plot it
separators2 %>%
# Filtering and sorting data
filter(cyc.n %in% c(1, 10, 20, 50, 100), I != 0) %>%
mutate(Q.c = ifelse(I > 0, Q.c, NA), Q.d = ifelse(I < 0, Q.d, NA)) %>%
# Include separator and additive in the selection here
select(cyc.n, separator, additive, E, Q.c, Q.d) %>%
pivot_longer(cols = starts_with("Q"), names_to = "mode", values_to = "Q") %>%
# Plot
ggplot(aes(x = Q, y = E)) +
geom_path(aes(color = factor(cyc.n), group = mode), size = 0.8) +
scale_y_continuous(limits = c(1.75, 2.65)) +
scale_x_continuous(breaks = seq(0, 1800, 200)) +
scale_color_brewer("cycle number", palette = "Set1") +
# Facet by separator and additive

• The tidyverse set of packages, which is pretty much a 'dialect' within R, is pretty great. I think I used every one of the "core" packages (except forcats) in this post in some way or another. I've found it more intuitive to get started with and learn than, for example, Python, if you're on the fence about what to learn...
• Even though ggplot2 has its (often intentional) limits in places, it's pretty powerful overall. I don't think I've used anything else to make a plot for any serious purpose in over 5 years.