Concordancer function with data.table

1 June 2021

Sometimes in corpus linguistics, it is very helpful to create your own KWIC displays or keep track of words before or after the tokens you are interested in. My corpus data is usually tabular, so using data.table, a helpful R package, is my preferred way of working.

Imagine a spreadsheet of data where every word in a corpus is a row (or check out my Matukar Panau corpus at PARADISEC or ELAR):

DIY Concordancer for Matukar Panau data: initial data where ‘w’ is word.

utterance	word	index	speaker	text
Tamat	tamat	1	KG	sb14.eaf
ngalep yabai	ngalep	2	KG	sb14.eaf
ngalep yabai	yabai	3	KG	sb14.eaf
ngalap tamadi main	ngalap	4	BC	sb14.eaf
ngalap tamadi main	tamadi	5	BC	sb14.eaf
ngalap tamadi main	main	6	BC	sb14.eaf
main ngahau	main	7	KA	sb15.eaf
main ngahau	ngahau	8	KA	sb15.eaf

Table 1

It’s fairly straightfoward to use the shift() function in data.table to grab the information in the rows before and after each instance. My lagleadR() function is based on shift(), but allows for customisation of the names of the column and to make sure that you can adjust for file breaks or breaks for participants. This is because sometimes you may want to avoid calculating probabilities or assigning bigrams when they are not actually co-occurring together. Having a function can be helpful if you need several kinds of ‘shifting’ while using your corpus.

DIY Concordancer for Matukar Panau data: after code for concordancer, where w_pre1 is first word before the token, w_post2 is second word after the token, etc. with breaks (groups) for different files/texs

utterance	w_pre3	w_pre2	w_pre1	word	w_post1	w_post2	w_post3	index	speaker	text
tamat	NA	NA	NA	tamat	ngalep	yabai	ngalap	1	KG	sb14.eaf
ngalep yabai	NA	NA	tamat	ngalep	yabai	ngalap	tamadi	2	KG	sb14.eaf
ngalep yabai	NA	tamat	ngalep	yabai	ngalap	tamadi	main	3	KG	sb14.eaf
ngalap tamadi main	tamat	ngalep	yabai	ngalap	tamadi	main	NA	4	BC	sb14.eaf
ngalap tamadi main	ngalep	yabai	ngalap	tamadi	main	NA	NA	5	BC	sb14.eaf
ngalap tamadi main	yabai	ngalap	tamadi	main	NA	NA	NA	6	BC	sb14.eaf
main ngahau	NA	NA	NA	main	ngahau	NA	NA	7	KA	sb15.eaf
main ngahau	NA	NA	main	ngahau	NA	NA	NA	8	KA	sb15.eaf

Table 2

The same table but with breaks (groupings) for both text and speaker columns

utterance	w_pre3	w_pre2	w_pre1	word	w_post1	w_post2	w_post3	index	speaker	text
tamat	NA	NA	NA	tamat	ngalep	yabai	NA	1	KG	sb14.eaf
ngalep yabai	NA	NA	tamat	ngalep	yabai	NA	NA	2	KG	sb14.eaf
ngalep yabai	NA	tamat	ngalep	yabai	NA	NA	NA	3	KG	sb14.eaf
ngalap tamadi main	NA	NA	NA	ngalap	tamadi	main	NA	4	BC	sb14.eaf
ngalap tamadi main	NA	NA	ngalap	tamadi	main	NA	NA	5	BC	sb14.eaf
ngalap tamadi main	NA	ngalap	tamadi	main	NA	NA	NA	6	BC	sb14.eaf
main ngahau	NA	NA	NA	main	ngahau	NA	NA	7	KA	sb15.eaf
main ngahau	NA	NA	main	ngahau	NA	NA	NA	8	KA	sb15.eaf

Table 3

Here’s the function:

lagleadR02<-function(DT,V1,lab,NUM,GRP1=NULL,GRP2=NULL){ #specify arguments: data.table, column to concordance, how many words bf/after, by group variables, null means they don’t need to be specified, up to 2 by group vars possible with this function
require(“data.table”) #need the data.table package for this to work
pcols = as.character(seq( from = NUM, to = 1, by = -1 )) #get a sequence of numbers, high to low
panscols = paste(lab, pcols, sep=”_pre”) #create variable names for preceding words
fcols = as.character(seq( from = 1, to = NUM, by = 1 )) #get a sequence of numbers, low to high
fanscols = paste(lab, fcols, sep=”_post”) #create variable names for following words
if( (is.null(GRP1)) & (is.null(GRP2)) ){ #if both by group vars are null (unspecified/missing)
DT[, (panscols) := shift(get(V1), NUM:1,type= “lag”)] #preceding words, data.table syntax: paste names of new columns, shift, get() needed because it’s RHS/char
DT[, (fanscols) := shift(get(V1), 1:NUM,type=”lead”)] #following words
} else {
if( is.null(GRP2) ) { #if just one by group variable
DT[, (panscols) := shift(get(V1), NUM:1,type= “lag”),by=list(get(GRP1))] #adding by argument, get() needed because it’s potentially RHS/char
DT[, (fanscols) := shift(get(V1), 1:NUM,type=”lead”),by=list(get(GRP1))]
} else { #if both group variables used
DT[, (panscols) := shift(get(V1), NUM:1,type= “lag”),by=list(get(GRP1),get(GRP2))]
DT[, (fanscols) := shift(get(V1), 1:NUM,type=”lead”),by=list(get(GRP1),get(GRP2))]
}
}
return(DT)
}

#Data to produce table 1

library(data.table)
corpus<- data.table(
utterance = c(“Tamat”,”ngalep yabai”,”ngalep yabai”,”ngalap tamadi main”,”ngalap tamadi main”,”ngalap tamadi main”,”main ngahau”,”main ngahau”),
word = c(“Tamat”,”ngalep”,”yabai”,”ngalap”,”tamadi”,”main”,”main”,”ngahau”),
index = c(1:8),
speaker = c(“KG”,”KG”,”KG”,”BC”,”BC”,”BC”,”KA”,”KA”),
text = c(“sb14.eaf”,”sb14.eaf”,”sb14.eaf”,”sb14.eaf”,”sb14.eaf”,”sb14.eaf”,”sb15.eaf”,”sb15.eaf”),
stringsAsFactors = F
)
corpus # see data.table

Running the function

lagleadR02(corpus, “word”, “w”, 3, “text”) # produces second table example

lagleadR02(corpus, “word”, “w”, 3, “text”, “speaker”) # produces third table example

Share this:

Related