Measure English Proficiency


Codes for Measuring English Proficiency by Computing Moving-Average Type-Token Ratio for Lexical Complexity and Verbal Density for Morpho-Syntactic Complexity and by Reading in Pre-coded Accuracy Data

Published by Haerim Hwang

language model proficiency morpho-Syntactic complexity lexical complexity (coded) accuracy natural language processing python

7 min READ

  • This script outputs final proficiency z-scores for the English poduction data based on the three measures: (a) morpho-syntactic complexity (verbal density), (b) lexical complexity (Moving-Average Type-Token Ratio), and (c) morphological/syntactic/lexical accuracy (pre-coded by human annotators and then inputted to the codes).

    • Morpho-syntactic complexity was measured in terms of verbal density by dividing the number of finite verbs plus the number of nonfinite verbs (infinitives, gerunds, and participles) by the total number of T-units (see K.-S. Park, 2014, p. 157).

    • For lexical complexity , the Moving Average Type-Token Ratio (MATTR; Covington & McFall, 2010) was computed by calculating the average of the type-token ratio for every moving text sequence of 15 consecutive words.

    • Morphological, syntactic, and lexical accuracy (or errors) were manually coded by two English native speakers. For example, morphological errors included errors in subject–verb agreement (e.g., The girl brush the teeth.), tense agreement (e.g., The bear woke up and say …), and adjectives (e.g., sleep for asleep). Syntactic errors included errors in the use of overt determiners (e.g., So Ø boy argued that the book is too close to her.) and voice (e.g., After that, the boy was waked and was afraid again.). Lexical errors involved non-target-like use of target forms with respect to their meaning or function, such as non-target-like use of lexical items (e.g., … she visited her mom and dad to stay with her.). For the full details about the error coding procedure, see K.-S. Park (2014, pp. 164–168).

    • Note that your predoded data for accuracy should be saved as a CSV file. Click this file for example.

  • Codes

    • Step 1: Import modules

      from __future__ import division # translate the / and /= operators to true division opcodes; use print function without using the parentheses
      import string # import "string" to determine the format of the string from its contents
      from collections import Counter # create frequency lists
      from operator import itemgetter # for sorting
      import glob # find and open all the files that match the rules which will be set by the user later
      import csv # allow Python to import or export spreadsheets and databases 
      from itertools import groupby # group categories and apply a function to the categories (https://www.kaggle.com/crawford/python-groupby-tutorial)
      from textblob import TextBlob # process text data
      import pandas as pd # analyze data
      from lexicalrichness import LexicalRichness # compute lexical richness (https://pypi.org/project/lexicalrichness/#description)
      import nltk # import the natural language toolkit
      import nltk_tgrep # search matching patterns
      from nltk.tree import ParentedTree # create a subclass of tree that automatically maintains parent pointers for single-parented tree 
      from nltk.tokenize import sent_tokenize # tokenize text into sentences
      import benepar # constituency parser
      parser = benepar.Parser("benepar_en") # parse English sentences
      


    • Step 2: Split your CSV file to multiple textfiles by participant code (i.e., row[0]); you can download this CSV file for example.

      for key, rows in groupby(csv.reader(open("proficiency_data.csv", encoding="utf-8-sig", errors="ignore")), 
                                lambda row: row[0]): # group the data based on the first column (i.e., participant)
          with open("data/%s.txt" % key, "w") as output: # under the "data" folder
              for row in rows: # iterate through rows
                  output.write("".join(str(row[1])) + "\t" + (str(row[2])) + "\n") # combine the info under the second column (i.e., utterances) and the info under the third column (i.e., accuracy coding)
      


    • Step 3: Measure (a) Moving-Average Type-Token Ratio and (2) verbal density

      ## Open text files under the working directory (e.g., L2 adult group)
      filenames = glob.glob("data/*.txt")
              
      output = [] 
      output_pos = []
              
      ## Set a punctuation list
      punct = [",",".","?","'",'"',"!",":",";","(",")","''","``","--", "``", "''"]
              
      for filename in filenames: # iterate through text files in the list
          text = open(filename, "r").read().split("\n") # open the file
              
          utterance_list = [] # create a holder for utterances
          accuracy_coding_list = [] # create a holder for accuracy codings
              
          utt_counter = 0
              
          ## MATTR, Accuracy
          for line in text[:-1]: # iterate through sentences in the text; the last line is blank
              
              line_sep = line.split("\t") # split each line by tab
              utterance_list.append(line_sep[0]) # first part: utterance
              accuracy_coding_list.append(line_sep[1]) # second part: accuracy coding
              utt_counter += 1
              
          ppt_code = filename[5:-4] # get a ppt code
              
              
          accuracy_int = [int(x) for x in accuracy_coding_list] # get an array of integers; one for each line of input
              
              
          accuracy = sum(accuracy_int)/len(accuracy_int) # sum of the integers divided by the number of codings
              
          utterances = ' '.join(utterance_list) # make the utterances into text
              
          #output.append([ppt_code, mattr, accuracy]) # append ppt_code, mattr, and accuracy to the "output" list
              
          ## Verbal density
          ## Count the number of verbs
          for z in punct: # replace punctuations with nothing 
              utterances_no_punct = utterances.replace(z, "")
              
          text_nltk = nltk.word_tokenize(utterances_no_punct)# segment a text into basic units--or tokens--such as words and punctuation
          text_pos = nltk.pos_tag(text_nltk) # assign part of speech tags to Brown corpus (note that it must be tokenized first)
             
          verb = []
          verb_counter = 0 
              
          for i, y in enumerate(text_pos):
              #print(i, y)
              if (text_pos[i][1]=='VBD' or text_pos[i][1]=='VBZ') and (text_pos[i+1][1]=='VBG' or text_pos[i+1][1]=='VBN'):
                  #print(text_pos[i], text_pos[i+1])
                  continue
              elif (text_pos[i][1]=='VB') or (text_pos[i][1]=='VBG') or (text_pos[i][1]=='VBN') or (text_pos[i][1]=='VBP') or (text_pos[i][1]=='VBZ') or (text_pos[i][1]=='VBD'):
                  # if the second item in a tuple (pos tag) is one of the verb POS categories,
                  #print(text_pos[i])
                  #verb.append(text_pos[i][0]) # append the tuple to the verb; to check in case
                  verb_counter += 1 # add 1 to the counter    
              
          tunit = len(utterance_list) 
              
          verbal_density = verb_counter/tunit
              
          lex = LexicalRichness(str(utterance_list)) # instantiate new text object (use use_TextBlob=True argument to use the textblob tokenizer)
              
        mattr = lex.mattr(window_size = 15)
              
          output.append([ppt_code, mattr, accuracy, verb_counter, tunit, verbal_density]) # put all info into the "output" list
          output = sorted(output, key=itemgetter(0)) # sort the "output" list basaed on the ppt_code
              
      print(output)
      


    • Step 4: Convert raw scores to z-scores

      ## Convert raw_scores to z_scores
      df = pd.DataFrame(output, columns = ["participant", "MATTR", "accuracy", "verb", "tunit", "verbal density"]) # set headers for the data frame
              
      ## Leave necessary info only
      #df = df[["participant", "verbal_density", "mattr", "accuracy"]]
              
      ## Compute z-scores
      cols = list(df.columns) # get column names
      cols.remove('participant') # remove the "participant" column for computing z-scores
      cols.remove('verb') # remove the "verb" column for computing z-scores
      cols.remove('tunit') # remove the "tunit" column for computing z-scores
      df[cols] # check what columns now you have
              
      for col in cols: #iterate through columns
          col_zscore = col + '_zscore' # create column for z-scores
          df[col_zscore] = (df[col] - df[col].mean())/df[col].std(ddof=0) # put zscores
              
      df['proficiency'] = df.iloc[:, df.columns.str.contains('_zscore')].sum(1) # sum z-scores to get a final proficiency score
      df # check the data frame
      


    • Step 5: Save the output in CSV

      df.to_csv("output.csv", sep=',', encoding='utf-8')
      



  • When you use this script, please cite:
    Hwang, H. (2020). A contrast between VP-ellipsis and Gapping in English: L1 acquisition, L2 acquisition, and L2 processing (Unpublished doctoral dissertation). University of Hawai'i, Honolulu, HI.


  • References:
    Park, K.-S. (2014). Information structure and dative word‑order alternations in English and Korean: L1 children, L2 children, and L2 adults (Unpublished doctoral dissertation). University of Hawai‘i at Mānoa, Honolulu, HI.

    Covington, M. A., & McFall, J. D. (2010). Cutting the Gordian knot: The moving-average type-token ratio (MATTR). Journal of quantitative linguistics, 17, 94–100.