1 June 2021
Sometimes in corpus linguistics, it is very helpful to create your own KWIC displays or keep track of words before or after the tokens you are interested in. My corpus data is usually tabular, so using data.table, a helpful R package, is my preferred way of working.
Imagine a spreadsheet of data where every word in a corpus is a row (or check out my Matukar Panau corpus at PARADISEC or ELAR):
DIY Concordancer for Matukar Panau data: initial data where ‘w’ is word.
utterance | word | index | speaker | text |
Tamat | tamat | 1 | KG | sb14.eaf |
ngalep yabai | ngalep | 2 | KG | sb14.eaf |
ngalep yabai | yabai | 3 | KG | sb14.eaf |
ngalap tamadi main | ngalap | 4 | BC | sb14.eaf |
ngalap tamadi main | tamadi | 5 | BC | sb14.eaf |
ngalap tamadi main | main | 6 | BC | sb14.eaf |
main ngahau | main | 7 | KA | sb15.eaf |
main ngahau | ngahau | 8 | KA | sb15.eaf |
It’s fairly straightfoward to use the shift() function in data.table to grab the information in the rows before and after each instance. My lagleadR() function is based on shift(), but allows for customisation of the names of the column and to make sure that you can adjust for file breaks or breaks for participants. This is because sometimes you may want to avoid calculating probabilities or assigning bigrams when they are not actually co-occurring together. Having a function can be helpful if you need several kinds of ‘shifting’ while using your corpus.
DIY Concordancer for Matukar Panau data: after code for concordancer, where w_pre1 is first word before the token, w_post2 is second word after the token, etc. with breaks (groups) for different files/texs
utterance | w_pre3 | w_pre2 | w_pre1 | word | w_post1 | w_post2 | w_post3 | index | speaker | text |
tamat | NA | NA | NA | tamat | ngalep | yabai | ngalap | 1 | KG | sb14.eaf |
ngalep yabai | NA | NA | tamat | ngalep | yabai | ngalap | tamadi | 2 | KG | sb14.eaf |
ngalep yabai | NA | tamat | ngalep | yabai | ngalap | tamadi | main | 3 | KG | sb14.eaf |
ngalap tamadi main | tamat | ngalep | yabai | ngalap | tamadi | main | NA | 4 | BC | sb14.eaf |
ngalap tamadi main | ngalep | yabai | ngalap | tamadi | main | NA | NA | 5 | BC | sb14.eaf |
ngalap tamadi main | yabai | ngalap | tamadi | main | NA | NA | NA | 6 | BC | sb14.eaf |
main ngahau | NA | NA | NA | main | ngahau | NA | NA | 7 | KA | sb15.eaf |
main ngahau | NA | NA | main | ngahau | NA | NA | NA | 8 | KA | sb15.eaf |
The same table but with breaks (groupings) for both text and speaker columns
utterance | w_pre3 | w_pre2 | w_pre1 | word | w_post1 | w_post2 | w_post3 | index | speaker | text |
tamat | NA | NA | NA | tamat | ngalep | yabai | NA | 1 | KG | sb14.eaf |
ngalep yabai | NA | NA | tamat | ngalep | yabai | NA | NA | 2 | KG | sb14.eaf |
ngalep yabai | NA | tamat | ngalep | yabai | NA | NA | NA | 3 | KG | sb14.eaf |
ngalap tamadi main | NA | NA | NA | ngalap | tamadi | main | NA | 4 | BC | sb14.eaf |
ngalap tamadi main | NA | NA | ngalap | tamadi | main | NA | NA | 5 | BC | sb14.eaf |
ngalap tamadi main | NA | ngalap | tamadi | main | NA | NA | NA | 6 | BC | sb14.eaf |
main ngahau | NA | NA | NA | main | ngahau | NA | NA | 7 | KA | sb15.eaf |
main ngahau | NA | NA | main | ngahau | NA | NA | NA | 8 | KA | sb15.eaf |
Here’s the function:
lagleadR02<-function(DT,V1,lab,NUM,GRP1=NULL,GRP2=NULL){ #specify arguments: data.table, column to concordance, how many words bf/after, by group variables, null means they don’t need to be specified, up to 2 by group vars possible with this function
require(“data.table”) #need the data.table package for this to work
pcols = as.character(seq( from = NUM, to = 1, by = -1 )) #get a sequence of numbers, high to low
panscols = paste(lab, pcols, sep=”_pre”) #create variable names for preceding words
fcols = as.character(seq( from = 1, to = NUM, by = 1 )) #get a sequence of numbers, low to high
fanscols = paste(lab, fcols, sep=”_post”) #create variable names for following words
if( (is.null(GRP1)) & (is.null(GRP2)) ){ #if both by group vars are null (unspecified/missing)
DT[, (panscols) := shift(get(V1), NUM:1,type= “lag”)] #preceding words, data.table syntax: paste names of new columns, shift, get() needed because it’s RHS/char
DT[, (fanscols) := shift(get(V1), 1:NUM,type=”lead”)] #following words
} else {
if( is.null(GRP2) ) { #if just one by group variable
DT[, (panscols) := shift(get(V1), NUM:1,type= “lag”),by=list(get(GRP1))] #adding by argument, get() needed because it’s potentially RHS/char
DT[, (fanscols) := shift(get(V1), 1:NUM,type=”lead”),by=list(get(GRP1))]
} else { #if both group variables used
DT[, (panscols) := shift(get(V1), NUM:1,type= “lag”),by=list(get(GRP1),get(GRP2))]
DT[, (fanscols) := shift(get(V1), 1:NUM,type=”lead”),by=list(get(GRP1),get(GRP2))]
}
}
return(DT)
}
#Data to produce table 1
library(data.table)
corpus<- data.table(
utterance = c(“Tamat”,”ngalep yabai”,”ngalep yabai”,”ngalap tamadi main”,”ngalap tamadi main”,”ngalap tamadi main”,”main ngahau”,”main ngahau”),
word = c(“Tamat”,”ngalep”,”yabai”,”ngalap”,”tamadi”,”main”,”main”,”ngahau”),
index = c(1:8),
speaker = c(“KG”,”KG”,”KG”,”BC”,”BC”,”BC”,”KA”,”KA”),
text = c(“sb14.eaf”,”sb14.eaf”,”sb14.eaf”,”sb14.eaf”,”sb14.eaf”,”sb14.eaf”,”sb15.eaf”,”sb15.eaf”),
stringsAsFactors = F
)
corpus # see data.table
Running the function
lagleadR02(corpus, “word”, “w”, 3, “text”) # produces second table example
lagleadR02(corpus, “word”, “w”, 3, “text”, “speaker”) # produces third table example