1 June 2021

Sometimes in corpus linguistics, it is very helpful to create your own KWIC displays or keep track of words before or after the tokens you are interested in. My corpus data is usually tabular, so using data.table, a helpful R package, is my preferred way of working.

Imagine a spreadsheet of data where every word in a corpus is a row (or check out my Matukar Panau corpus at PARADISEC or ELAR):

DIY Concordancer for Matukar Panau data: initial data where ‘w’ is word.

utterancewordindexspeakertext
Tamattamat1KGsb14.eaf
ngalep yabaingalep2KGsb14.eaf
ngalep yabaiyabai3KGsb14.eaf
ngalap tamadi mainngalap4BCsb14.eaf
ngalap tamadi maintamadi5BCsb14.eaf
ngalap tamadi mainmain6BCsb14.eaf
main ngahaumain7KAsb15.eaf
main ngahaungahau8KAsb15.eaf
Table 1

It’s fairly straightfoward to use the shift() function in data.table to grab the information in the rows before and after each instance. My lagleadR() function is based on shift(), but allows for customisation of the names of the column and to make sure that you can adjust for file breaks or breaks for participants. This is because sometimes you may want to avoid calculating probabilities or assigning bigrams when they are not actually co-occurring together. Having a function can be helpful if you need several kinds of ‘shifting’ while using your corpus.

DIY Concordancer for Matukar Panau data: after code for concordancer, where w_pre1 is first word before the token, w_post2 is second word after the token, etc. with breaks (groups) for different files/texs

utterancew_pre3w_pre2w_pre1wordw_post1w_post2w_post3indexspeakertext
tamatNANANAtamatngalepyabaingalap1KGsb14.eaf
ngalep yabaiNANAtamatngalepyabaingalaptamadi2KGsb14.eaf
ngalep yabaiNAtamatngalepyabaingalaptamadimain3KGsb14.eaf
ngalap tamadi maintamatngalepyabaingalaptamadimainNA4BCsb14.eaf
ngalap tamadi mainngalepyabaingalaptamadimainNANA5BCsb14.eaf
ngalap tamadi mainyabaingalaptamadimainNANANA6BCsb14.eaf
main ngahauNANANAmainngahauNANA7KAsb15.eaf
main ngahauNANAmainngahauNANANA8KAsb15.eaf
Table 2

The same table but with breaks (groupings) for both text and speaker columns

utterancew_pre3w_pre2w_pre1wordw_post1w_post2w_post3indexspeakertext
tamatNANANAtamatngalepyabaiNA1KGsb14.eaf
ngalep yabaiNANAtamatngalepyabaiNANA2KGsb14.eaf
ngalep yabaiNAtamatngalepyabaiNANANA3KGsb14.eaf
ngalap tamadi mainNANANAngalaptamadimainNA4BCsb14.eaf
ngalap tamadi mainNANAngalaptamadimainNANA5BCsb14.eaf
ngalap tamadi mainNAngalaptamadimainNANANA6BCsb14.eaf
main ngahauNANANAmainngahauNANA7KAsb15.eaf
main ngahauNANAmainngahauNANANA8KAsb15.eaf
Table 3

Here’s the function:

lagleadR02<-function(DT,V1,lab,NUM,GRP1=NULL,GRP2=NULL){ #specify arguments: data.table, column to concordance, how many words bf/after, by group variables, null means they don’t need to be specified, up to 2 by group vars possible with this function
require(“data.table”) #need the data.table package for this to work
pcols = as.character(seq( from = NUM, to = 1, by = -1 )) #get a sequence of numbers, high to low
panscols = paste(lab, pcols, sep=”_pre”) #create variable names for preceding words
fcols = as.character(seq( from = 1, to = NUM, by = 1 )) #get a sequence of numbers, low to high
fanscols = paste(lab, fcols, sep=”_post”) #create variable names for following words
if( (is.null(GRP1)) & (is.null(GRP2)) ){ #if both by group vars are null (unspecified/missing)
DT[, (panscols) := shift(get(V1), NUM:1,type= “lag”)] #preceding words, data.table syntax: paste names of new columns, shift, get() needed because it’s RHS/char
DT[, (fanscols) := shift(get(V1), 1:NUM,type=”lead”)] #following words
} else {
if( is.null(GRP2) ) { #if just one by group variable
DT[, (panscols) := shift(get(V1), NUM:1,type= “lag”),by=list(get(GRP1))] #adding by argument, get() needed because it’s potentially RHS/char
DT[, (fanscols) := shift(get(V1), 1:NUM,type=”lead”),by=list(get(GRP1))]
} else { #if both group variables used
DT[, (panscols) := shift(get(V1), NUM:1,type= “lag”),by=list(get(GRP1),get(GRP2))]
DT[, (fanscols) := shift(get(V1), 1:NUM,type=”lead”),by=list(get(GRP1),get(GRP2))]
}
}
return(DT)
}

#Data to produce table 1

library(data.table)
corpus<- data.table(
utterance = c(“Tamat”,”ngalep yabai”,”ngalep yabai”,”ngalap tamadi main”,”ngalap tamadi main”,”ngalap tamadi main”,”main ngahau”,”main ngahau”),
word = c(“Tamat”,”ngalep”,”yabai”,”ngalap”,”tamadi”,”main”,”main”,”ngahau”),
index = c(1:8),
speaker = c(“KG”,”KG”,”KG”,”BC”,”BC”,”BC”,”KA”,”KA”),
text = c(“sb14.eaf”,”sb14.eaf”,”sb14.eaf”,”sb14.eaf”,”sb14.eaf”,”sb14.eaf”,”sb15.eaf”,”sb15.eaf”),
stringsAsFactors = F
)
corpus # see data.table

Running the function

lagleadR02(corpus, “word”, “w”, 3, “text”) # produces second table example

lagleadR02(corpus, “word”, “w”, 3, “text”, “speaker”) # produces third table example