Hi! Welcome to my data journalism R cheat sheet for cleaning and wrangling data. You may have seen other R cheat sheet resources of package-based cheat sheets. Journalists have also put out cheat sheets before, like MaryJo Webster’s R cheat sheet. Hat tip to her and her amazing collection of data journalism resources! This is what I use for quick references to data cleaning functions!

I use this cheat sheet as an roadmap to functions I will likely encounter and use, so I am hoping that it can save you some time googling around. Most of the functions listed here are what I collected when cleaning data at the NICAR data library and for the accountability project at the Investigative Reporting Workshop.

What’s this for?

Much as I use data cleaning functions on a regular basis, the syntax or function names sometimes get fuzzy every once in a while. So I have organized some most-frequently-used functions for certain goals in data processing and also pitfalls when using them! I have included useful links to discussions about the functions by other people. Thanks, internet!

A word about programming

I always think the most important thing to know about functions and to tap into their strengths is to understand what you’re dealing with, i.e. data structures. When I can’t write clean code and get it to work, I always take out my pen and pad and jot down what I want to achieve and the end products I wish to have. This basically helps me get a grasp of the objects at hand, and break them down into the smallest unit possible, and how these units can form the end products you want. For example, are you trying to modify a column (rewrite) or add a new column based on an existing column, which are zipped vectors made up of individual elements (That’s probably why the length(df)returns the number of columns)? If you’re modifying a column, you can assign a new vector to df$col. If you’re adding a column, you’ll probably use mutate on the dataframe itself with mutate from the dyplyr package.

There are, of course, more than one way to let’s not skin a cat! For example, picture a jumbled field of last name, first name. If you wish to separate the names into different columns, you can either use the separate() function to split up the column by a certain character, or you can use regular expression to capture whatever goes before the comma with str_match("^(.+),")[,2], and whatever comes after as str_match(",(.+)$")[,2]. Pick and choose as you wish!

This cheat sheet doesn’t cover the nuts and bolts of R, for which I highly recommend Andrew Ba Tran’s amazing tutorial Journalism with R.

Something about libraries

Since other people have spent all this time writing functions and packaged them, we use their functions by calling the package names (which may take developers a lot of time to come up with) first in the R console or scripts. It’s important to declare the packages used at the top of your R Markdown or R scripts file, especially if you need to knit the file (in order to generate a standalone html for the Rmd in a readable format), otherwise it will throw a bunch of errors. This also assumes that you have the packages installed already. If you wish to install new packages, use install.packages("packagename") first or use the packages pane to manage your packages (by default, the lower right-hand side pane). If you are trying to install a package that is not on CRAN yet and couldn’t be installed by using the install.packages("packagename") function, use the remotes package’s remotes::install_github("r-lib/remotes"). The double colons help you run a function from a certain package, use {package_name}::{function_name}. I am sure that you have figured out that remotes is on CRAN, so that it can be installed, with the command install.packages('remotes'). The tidyverse package solves most of the problems related to data cleaning. It includes all these packages:

library(tidyverse)
tidyverse_packages()
#>  [1] "broom"       "cli"         "crayon"      "dplyr"       "dbplyr"     
#>  [6] "forcats"     "ggplot2"     "haven"       "hms"         "httr"       
#> [11] "jsonlite"    "lubridate"   "magrittr"    "modelr"      "purrr"      
#> [16] "readr"       "readxl\n(>=" "reprex"      "rlang"       "rstudioapi" 
#> [21] "rvest"       "stringr"     "tibble"      "tidyr"       "xml2"       
#> [26] "tidyverse"

Something about functions

Functions are just objects. They exist in those packages! So if you want ot know more about functions themselves, call them by their names, without parathenses in the console. For the documentation, type “?” followed by the function name, will give you its documentation laid out in the help window. Try ?str_split. You can also see popups in the console when you type in function name, press F1 (Fn + F1) will also generate the function help view in the Help pane.
For how the function was written to make sense of how it’s executed, simply type the name of the function. Like str_split.
Further Readings: Introduction to the R Language Functions, Berkeley Workshop

Writing fnctions and loops

Here’s some function help. The basic syntax of a function is

f <- function(<arguments>) {
## Do something interesting
}

Trying to write for loops? R for data science has a great chapter on it.

output <- vector("double", ncol(df))  # 1. output
for (i in seq_along(df)) {            # 2. sequence
  output[[i]] <- median(df[[i]])      # 3. body
}

Function execution(order, arguments)

Computer reads scripts in a certain order and it can get confused when you have many arguments. To help organize things, and make your code less confusing for machine and humans to read, you can bracket code chunks you wish to execute first with () or {}.

A common usage: you have a numeric variable x=1, and you wish to do some calculations on it, like x+1, and use the : to populate all the numbers between x+1 and 15. x+1:15 won’t give you the result you want, but return anything between 2 and 16. That’s because R treats x:15 as a vector and applies the +1 to everything in this vector. You can solve this by specifying {x+1}:15, so that R knows to treat everything wrapped between the brackets as a whole.

There are certain operators in R that streamline code executation and saves you some typing. For example, . can represent the object being passed into a function. For example, if you wish to concatenate a vector with other strings, you can use the syntax df$col %>% str_c("before",.,"after") to specify the order of the string placement. Learn more about the magrittr’s . placeholder here.

Data import

Very useful references for understanding data type and data structure.
Andrew’s data import/export tutorial shows you how to import and later export most types of data files. R’s tidyverse cheat sheet also offers a comprehensive view of reading data, and the how you can tap into the functionalities of tibble (a type of data frame) for your tables.

You’re now ready to use functions to solve problems like these:

Better printing & Inspection

function	package	syntax	notes	references
Convert tow index to column	tibble	wy <- tibble::rowid_to_column(wy, “id”)
print truncated elements in a vertical layout	dplyr	wy %>% filter() %>% glimpse()

Dealing with NAs

function	package	syntax	notes	references
filter out NA rows	dplyr, stats	filter(!is.na(colname)) complete.cases(df$col)	This has the effect of filter(!is.na(colname)==TRUE), this function returns a dataframe. complete.cases() directly applies to a number of the vectors and returns a corresponding logical vector after testing if ALL specified columns are complete. When it is applied to a vector, the result equals to !is.na(). To get a dataframe whose entries were removed if the column “col” evaluates to NA , see below.	https://statisticsglobe.com/na-omit-r-example/
filter out NA fields in vectors	stats	unique(na.omit(ct[“column_name”])	na.omit() will create an "NA’ class	https://github.com/irworkshop/accountability_datacleaning/blob/campfin/R_campfin/ct/expends/docs/ct_expends_diary.md
drop rows for columns containing NA values	tidyr	drop_na(col_name)	drop_na() is column agnostic so any columns containing NA values will be dropped! The link shows how to drop NAs for a specific column too. df[complete.cases(df$col),] almost achieves the same as df %>% drop_na(col)	https://stackoverflow.com/questions/26665319/removing-na-in-dplyr-pipe https://stackoverflow.com/questions/4862178/remove-rows-with-all-or-some-nas-missing-values-in-data-frame
turn values into NA	dplyr	na_if(df$col, y) OR wv <- wv %>% mutate(city_clean = case_when( city_clean %in% c(“WV”,“WEB BASED”,“A”,“PO BOX”,“ANYWHERE USA”,“VARIES”,“COUNTY”) ~ NA_character_, TRUE ~ as.character(city_clean)))	The value to be replaced is not a regex. To replace multiple values with NA, try case_when()

Modify Strings

function	package	syntax	notes	references
change case	stringr, base	str_to_upper()/toupper() str_to_lower()/tolower()
ignore case	stringr	fixed(‘toyota’,ignore_case=TRUE)
replace matching strings in a dataframe	base, stringr	RETA2016_negative <- mutate_if(as_tibble(reta2016_clean), is.character, str_replace_all, pattern = “J”, replacement = “-1”) la_lobby <- la_lobby %>% mutate(lobbyist_name_clean = str_remove(LobbyistName1, “MRS.\\s\|MR.\\s\|MS.\\s\|MISS\\s\|DR.\\s”))	gsub() is the base R version. When trying to replace or remove with multiple patterns, use regex “\|” . Otherwise, the function matches the first element in the pattern against the first element of string vector, etc. Therefore, you could “subtract” the content of one column from another in effect with . df <- df %>% mutate(statezip = str_remove(df$citystatezip, df$city)). str_replace_all will replace all the matches while str_replace will only replace the first match.	https://community.rstudio.com/t/understanding-the-use-of-str-replace-all-with-multiple-patterns/1849 https://stackoverflow.com/questions/29036960/remove-multiple-patterns-from-text-vector-r https://community.rstudio.com/t/understanding-the-use-of-str-replace-all-with-multiple-patterns/1849
concatenate (Concatenate vectors after converting to character.)	base	mutate(ZIP=paste0(“0”, as.character(ZIP))) %>% mutate(location = paste0(ADDRESS, “,”, CITY, “, CT”, ZIP))	paste0 is different from paste() in that paste() takes a default separator of " " with space and paste0 takes "" with no space. A way to remember this is that paste0 concatenates the 0 without any space.	http://learn.r-journalism.com/en/mapping/geolocating/geolocating/
concatenate	stringr	str_c(“Letter”,letters,sep=“:”)	stri_c() concatenate individual strings. If you want to pass in vectors and concatenate the result vector, use the collapse = "" to achieve similar effects. One way to remember it is that collapse = "" executes after a vector is returned and collapse the vector into one single string.
extract the complete match of a string with regex	stringr	str_extract(text, “\d{5}(?:-\d{4})?”)
extract part of a string	stringr	str_match(strings, phone)	str_match() returns a character matrix. First column is the complete match, followed by one column for each capture group. str_match_all returns a list of character matrices. Use the [,n] index for the nth capture group
View HTML rendering of regular expression match	stringr	str_view(c(“abc”, “def”, “fgh”), “d\|e”, match = FALSE)
add a zero to make it (#) digits	base	fipsst <- mutate(fipsst, STATE = sprintf(“%03d”, STBRDG))	three digits of code
make strings containing executable R code	glue	The first method applies to a vector, basically counting the sum of logical values evaluated to “TRUE”, the filter() method summarizes a datatable	separate different parts of the string with commas

Index & subsetting

It is very important to note that negative index does not work the same way as the inverse order in Python syntax. In R, the minus sign (-) means to exclude a certain element. As a result, you can use vector[length(vector)] to access the last element of a vector. R for Data Science has a great chapter on vector and list indexing and used a superb analogy of a pepper shaker.

function	package	syntax	notes	references
Extract every nth element of a vector	base	a <- 1:120 b <- a[seq(1, length(a), 6)]
get the position of a column named “B” in a data frame/vector	base	grep(“B”,colnames(df)) - containing “B” grep(“^B$”,colnames(df)) - called B OR which(colnames(df)==“B”)	returns a vector of indices of the character strings that contains the pattern	https://thomasleeper.com/Rcourse/Tutorials/vectorindexing.html
Get the index of a string in a vector	s	match(“CONTRIBUTOR”, pa_col_names)	match returns a vector of the positions of (first) matches of its first argument in its second.	https://stackoverflow.com/questions/27556353/subset-columns-based-on-list-of-column-names-and-bring-the-column-before-it

Making a dataframe

function	package	syntax	notes
Take a sequence of vector, matrix or data-frame arguments and combine by columns or rows	base	stations <- cbind(stations, geo) OR bind_cols()
string together two dataframes/vectors vertically by binding rows	dplyr, base	bind_rows() OR rbind(a = 1, b = 1:3)	two dataframes need to have the same number of columns, or the longer vector should be a multiple of short vector’s length
create a dataframe from vectors	base, dplyr	tibble(), data.frame(), tibble(col1 = c(1,2,3), name = c(“A”,“B”,“C”))	for tibble(), stringAsFactors is set to F! On the left is the column names, and on the right is the vector that will be assigned to the column. Data frames are essentially columns (vectors) bound together.

Columns

function	package	syntax	notes	references
select a column whose name is…	dplyr	df$colname, or df[[“colname”]] OR pull(df, col)	pull() can help turn a one-column data frame into a vector. To select multiple, use df$[c(“col1”, “col2”]
select a column whose index is n		df[[n]] OR pull(df,n)	without double brackets, the index will slice the tibble into another tibble
filtering a dataframe based on certain conditions of columns	dplyr	my_data%>%select(starts_with(“Petal”))	starts_with() is a syntax that is part of the dplyr package that works with select() or vars() when used with mutate_at/summarize_at(). See batch processing session for usage of mutate_at().
change one column name	base	MO_offense <- rename(MO_offense, Off.Age = Offender.Age.at.Time.of.Offense)
define column names Change all column names to lowercases	base	colnames(bridges17) <- names_field_17 (I have the vector needed) colnames(bridges17) <-tolower(colnames(bridges17))
Clean and column names that contain spaces and such to lower case names separated with underscore	janitor	read_csv(df) %>% clean_names()	make_clean_names() operates on character vectors and can be used during data import. e.g. wi_principals <- list.files(principals, pattern = “.xls”, full.names = TRUE) %>% map(read_xls, skip = 3, .name_repair = make_clean_names)	https://github.com/sfirke/janitor
Specify column types when reading	readr	col_types = cols( x = col_double(), y = col_date(format = ""), z=col_character()		https://blog.rstudio.com/2015/04/09/readr-0-1-0/
change to date	lubridate	parse_usa_date<-function(x,…) { parse_date(x,format=“%m/%d/%Y”,…) } OR as_date(x, …)		https://github.com/irworkshop/accountability_datacleaning/blob/campfin/R_campfin/ct/expends/docs/ct_expends_diary.md
convert to column types	dplyr	as.numeric(df$col) as.character(df$col)
convert excel numbers to date	janitor	excel_numeric_to_date(df$col, date_system = “modern”)

Ordering

function	package	syntax	notes
sort a vector or factor into a descending or ascending order	base	sort(x, decreasing = FALSE, …)
Sorting column of a data frame with descending order	dplyr	arrange(desc(total_spent))	arrange() takes a dataframe as argument
reposition columns by index or column names	base	data <- data[c(“A”, “B”, “C”)]
Get the first n rows of a dataframe	dplyr	df %>% top_n(10)
Get the first 10 and bottom 10 elements, quick inspection	utils	head()/tail()

Reshape data frame

function	package	syntax	notes	references
turn wide tables to long tables	dplyr	gather()	Andrew’s tutorial says it all.	https://learn.r-journalism.com/en/wrangling/tidyr_joins/tidyr-joins/
turn long tables to wide tables	dplyr	spread()		https://learn.r-journalism.com/en/wrangling/tidyr_joins/tidyr-joins/
seperat one columns into two	dplyr	separate(data, col, into, sep = “[^[:alnum:]] +”, remove = TRUE, convert = FALSE, extra = “warn”, fill = “warn”, …)		https://rstudio.com/wp-content/uploads/2019/01/Cheatsheets_2019.pdf#page=9
unite two columns into one	dplyr	unite()		https://rstudio.com/wp-content/uploads/2019/01/Cheatsheets_2019.pdf#page=9
separate elements in a column into rows with each individual element	dplyr	separate_rows()
Makes each element of the list-column into its own row.	janitor	ia_lobby_cl <- ia_lobby_cl %>% mutate(new_lobbyists = str_split(lobbyists, pattern = “,”)) %>% unnest_longer(new_lobbyists)

Data table summary

function	package	syntax	notes
Get all the unique/distinct values of a column	dplyr, base	unique(df$column) OR distinct(df$column)
count the number of instances for each distinct value	dplyr	n_distinct(df$col, na.rm = TRUE)	many functions have na.rm = T/F arguments. Here it’s set to TRUE as a demo, but in many cases you would want to set it to F too, depending on if you wish to include the NAs.
A gilmpse of your data	tibble,base	glimpse(df) OR str(df)	glimpse() usually offers better and more complete printout.
find out unique values and frequencies of a vector.	janitor, base	table()/tabyl()	basically gives you a frequency table. See count() for application in a data frame
find out unique values and frequencies of a column in a dataframe.	dplyr	count(df, column, sort = T)	When sort = T, will return the list in descending order
min, max, mean, quantiles	base	summary()
add a column counting the observations of another column	dplyr	mtcars %>% add_count(cyl)	add_count() is a short-hand for group_by() + add_tally() Also you won’t need mutate()
Get the means and sums of rows and columns	base	filter(rowSums(!is.na(wi_lobby))>=3)	This syntax helps filter columns that contain a certain number of NAs.
pivot table	dplyr	pivot_table <- df %>% group_by(column) %>% summarize (mean = mean(another_column), count = n())	group_by(df, column) %>% summarize( count = n(), mean = mean()) frequently is combined with %>% arrange(desc()) and specify the new summary column you created. the n() achieves similar effect as df %>% count(column).

ggplot2

function	package	syntax	notes	references
ggplot_point() for geolocating with some groupings	ggplot2	geom_point(data=stations, aes( x= lon, y = lat, size=staff, color=DESCRIPTION), fill=“white”, shape=1)	shape=1: dots shape=2: triangles shape=3: cross shape=4: X shape=5: square	http://learn.r-journalism.com/en/mapping/geolocating/geolocating/
display integer only on axes	ggplot2	scale_y_continuous(breaks=c(1,3,7,10))	my-project scientific=F, in 04_chunk, Rmd	https://stackoverflow.com/questions/15622001/how-to-display-only-integer-values-on-an-axis-using-ggplot2
x axis labels too long! Auto wrapping of labels	str_wrap + ggplot2/scales	scale_x_discrete(labels = function(x) str_wrap(x, width = 10) or scale_x_discrete(labels = wrap_format(10))		https://stackoverflow.com/questions/21878974/auto-wrapping-of-labels-via-labeller-label-wrap-in-ggplot2

File Paths

function	package	syntax	notes	references
List all the files under a directory	fs, base	zip_files <- dir_ls(raw_dir, glob = "*.zip" regexp = “expends.+”) OR contrib_files <- list.files(raw_dir, pattern = “.txt”, recursive = TRUE, full.names = TRUE)	Recursive is very important! It determines if the search will go deeper into sub-directories. glob is also an important notion. A wildcard aka globbing pattern (e.g. *.csv) passed on to grep() to filter paths The products of dir_ls() are “fs_path” named“character” objects while list.files() returns vectors. The defaults are also different.	https://github.com/irworkshop/accountability_datacleaning/blob/campfin/R_campfin/pa/expends/pa_expends_diary.md
Make a new directory	here, fs	raw_dir <- here(“pa”, “contribs”, “data”, “raw”) dir_create(raw_dir)	dir_create() is powerful when combined with the here package. here() doesn’t really create the directory	https://github.com/r-lib/here
Construct the path to a file from components in a platform-independent way	base	file.path(‘testdir2’, ‘testdir3’)	That way you don’t really care about the “" or”/"
Get the file info	fs	file.info(“your_file”)$mtime	This syntax gives you the last modified time of a file

Set operations & joins

function	package	syntax	notes	references
Test if an element is in a vector	base	x %in% valid_city	passing in two vectors will give you a logial vector. It is useul in specifying conditions for filtering. Similar to the IN statement in SQL.
Test if an element is not in a vector	campfin	x %out% valid_city	it is !%in%
find out how many elements of a vector are in or out of another vector	campfin	count_in(x,y, na.rm = T) count_out(x,y, na.rm = F) prop_in(x,y, na.rm = T) prop_out(x,y, na.rm = T)	campfin is a package written by my colleague Kiernan Nicholls at IRW, yay! count_in() returns the number, and prop_in() returns the percentage. The package incorporate smany data insepction functionalities wrapped in easy functions like this one. It also does a lot of heavy lifting for normalizing address, city, state, zip and so on.
find out elements in x that are not in y	base, dplyr	setdiff(x, y) OR anti_join(x,y, by = (“col1”,“col2”))	setdiff() removes duplicates! Order is important!
join two dataframes based on matching column or columns	dplyr	ia_lobby_cl <- ia_lobby_cl %>%left_join(zipcodes, by = c(“zip_norm” = “zip”, “city_norm” = “city”))	left_join(), right_join(), inner_join(), full_join(). Order is important. if the columns in two dfs have the same name, the by statement can just be by = “state”.	http://www.datasciencemadesimple.com/join-in-r-merge-in-r/ https://learn.r-journalism.com/en/wrangling/tidyr_joins/tidyr-joins/
vlookup - to look for corresponding values in a separate vector or data frame		lookup(df1$term, df2[c(‘term’,‘key’)]) OR df1$term %l% df2[c(‘term’,‘key’)]	The lookup() / %l% (percent-letter L -percent)functions have a lot of restrictions, including the dataframe to be matched (by default, should be two columns). Watch out for multiple keys in df2 matching a single term in df1.reassign let you map the results to whatever vector you wish to assign to the results. The example here appears to be more about reassigning and mapping than looking up for matching values. See documentation for more info.	https://www.rdocumentation.org/packages/qdapTools/versions/1.3.3/topics/lookup
Join two dataframes based on fuzzy matching	fuzzyjoin	stringdist_inner_join()	https://cran.r-project.org/web/packages/fuzzyjoin/README.html https://www.r-bloggers.com/fuzzy-string-matching-a-survival-skill-to-tackle-unstructured-information/	https://github.com/dgrtwo/fuzzyjoin

Conditions

function	package	syntax	notes	references
For a set of logical vectors, evaluate if at least one condition is met/evaluated to TRUE?	base	locality_position <- lapply(list, unlist,recursive = T) %>% map(str_detect, “locality”) %>% map_lgl(any)	Similar to the “OR” operator. The syntax returns locality_position, aka the index of a list to finds out the position of the string “locality” was in the list.
Change columns based on certain conditions (data type)	dplyr	mutate_if(is_character, str_to_upper)	The mutate_if() variants apply a predicate function (a function that # returns TRUE or FALSE) to determine the relevant subset of # columns.	https://stackoverflow.com/questions/42052078/correct-syntax-for-mutate-if
Test if characters are part of a string	base	grepl(value,chars, fixed=TRUE)		https://stackoverflow.com/questions/10128617/test-if-characters-are-in-a-string
count how many cases satisfy a condition	dplyr	sum(pa$STATE != “PA”, na.rm = TRUE) or pa %>% filter(STATE != “PA”) %>% count()	basically counting logical vector values. count cases usually rely on the tallying of a logical vector.
find elements in vector X that are not in Y	base	x[! (x %in% y)]	Be especially cautious with NAs! The rows evaluated to NAs would be retained!	https://statisticsglobe.com/setdiff-r-function/ https://www.youtube.com/watch?v=8hSYEXIoFO8
Change variable strings based on conditions	dplyr	mutate (variables = case_when (variable ==, ~))	case_when() only supports turning certain cells into a set string, but if the cell is dependent on other variables, use an if_else() or ifelse() function is probably the best bet.	http://learn.r-journalism.com/en/mapping/static_maps/static-maps/
Modify character strings based on variable conditions	dplyr, base	wy <- wy %>% mutate(city_clean = if_else(condition = match_distance <=2, true = city_swap, false = city_raw)) OR ifelse()

Batch processing

function	package	syntax	notes	references
Read multiple csv into one master csv	vroom	vroom::vroom(files)
Mutate multiple columns at once	purrr	aklr <- aklr %>% mutate_at( .vars = vars(ends_with(“address”)), .funs = list(norm = normal_address), add_abbs = usps_street, na_rep = TRUE )	mutate_at() lets you deal with multiple columns at the same time, very useful when combined with the vars() function with similar semantics to select().
apply a function to multiple elements	purrr	dir_ls(path = raw_dir, glob = "*.csv") %>% map( read_delim, args)	map() returns a list, but you can specify vector types with map_dbl() or map_char(). Map_depth() is very helpful, especially if you are working with nested list and wish to flatten lists	https://r4ds.had.co.nz/iteration.html#mapping-over-multiple-arguments https://rstudio.com/wp-content/uploads/2019/01/Cheatsheets_2019.pdf#page=14

R Studio Shortcuts

Here’s a comprehensive R Studio IDE cheat sheet.

what_it_does	on_the_screen	you_will_type
switch between source editor and console		ctrl + 1 - source ctrl + 2 - console ctrl+3 - help/viewer/plots/packages/files ctrl+4 - history/environment
access shortcuts cheatsheet in R Studio		shift + option/alt + K
pipe	%>%	shift + cmd/CTRL + M
assignment operator	->	option/Alt + (-)
multiline comment	#	CTRL/cmd + SHIFT + C
run current chunk		shift + cmd/CTRL + Enter
run selected lines of code		cmd/ctrl + Enter. This shortcut moves the cursor to the next line. To execute without moving the cursor, press alt/opt + Enter
stop current command		esc
access a list of previous commands in the console		CTRL/cmd + uparrow (this also work when you have type a few words and this shortcut will let you search in the history
move the selected code up/down		alt/opt + uparrow/downarrow
rename variables in this scope		cmd/ctrl + alt/opt + shift + M
replace with search results		shift + cmd/ctrl + J
auto fill arguments		tab. Move the cursor after function name will give generate a popup of function documentaion. Press F1 (for Mac users, Fn + F1) at this time has the same effect of typing ?function anew in the console and the documentatoin will show up in the Help window
search for file or function		ctrl + .
search for tab	>> in the top right corner of the source pane	shift + ctrl+ .
switch between tabs		ctrl + tab goes forward. to go backward, + shift
Unfold/fold outlines (the Rmd structure)		shift + cmd/CTRL + O
fold comments		option/Alt + cmd/CTRL+ L. To uncollapse, + shift
Jump to chunk (start of line)		cmd/ctrl + shift + option/alt + J
collapse all headers	## Header 2 <––>	option/Alt + cmd/CTRL+ O. To uncollapse, + shift
Next chunk		cmd/ctrl+pagedown
Knit		shift + cmd/ctrl + K

Good reads

R Studio Shortcuts: Part 1, Part 2
RMarkdown: knitr syntax. This cheat sheet was originally a spreadsheet, and you can learn about how I turned it into a rendered html page with the package knitr in the Rmd.
Regex: regexplain package: a handy tool for using regex in RStudio
General R: R for Data Science: Needs no introduction? All the bread and butter of R.
Simplify directory management: Ode to the here package
GitHub: Using Git with RStudio

Feel free to use the cheat sheet however you want. It is definitely imperfect and not remotely all-encompassing or error-free. Please don’t hesitate to point out any mistakes in the explanation.If there’s any R function you wish to add that could be helpful for data cleaning, please fill out this google form and I’ll add them to this file.

Many thanks to Kiernan Nicholls and Prof. Michael Kearney for teaching me their R skills, and Yan Wu for helping me with cutomized CSS of this webpage.

My Data Cleaning in R Cheat Sheet

Yanqi Xu

2019-10-24