Build Probability and Surprisal Model for English Data

This script builds language models to compute probability and surprisal based on n-gram counts.

Build Text Classification Models (LSTM, BERT, etc.) for English data

This script builds a couple of machine learning models to classify documents by pre-set categories. We use Google Colab for the script, which allows us to use GPU instead of CPU.

Create GloVe Model for Korean data

This script creates a GloVe model to analyze semantic similarities between Korean words. We use Google Colab for the script, which allows us to use GPU instead of CPU.

Create Word2Vec Model for English data

This script creates a Word2Vec model to analyze semantic similarities between English words.

Create Word2Vec Model for Korean data

This script creates a Word2Vec model to analyze semantic similarities between Korean words. We use Google Colab for the script, which allows us to use GPU instead of CPU.

Create a Korean-to-English translator

This script creates a Korean-to-English translator using a Hugging Face resource.

Finetune BERT Models for Korean data

This script creates finetuned BERT models to conduct sentiment analysis, summarization, mask prediction, text generation, and question answering. We use Google Colab for the script, which allows us to use GPU instead of CPU.

Identify English VP-elliipsis and Gapping instances

This script helps identify VP-ellipsis and Gapping instances in English data in Python using benepar and spaCy.

Measure English Proficiency

This script outputs final proficiency z-scores for the English poduction data based on the three measures: (a) morpho-syntactic complexity (verbal density), (b) lexical complexity (Moving-Average Type-Token Ratio), and (c) morphological/syntactic/lexical accuracy (pre-coded by human an...

Move Files from One Directory to Another

This script moves multiple files that match a certain criterion from one directory to another.

Split CSV to Multiple Text Files

This script splits a large CSV dataset into multiple text files based on the first column. The name of each file comes from the first column and the content of each file will come from all the cells in the second column.

Tag English Words for Part-of-Speech

This script marks up each word in a text as corresponding to a particular part of speech.

Topic Modeling for English Data

This script builds a semantic network and classifies documents based on topics (using the Latent Dirichlet Allocation (LDA) algorithm). We use Google Colab for the script.

Train a GPT2 text generating model for English

This script creates a GPT2 text generating model. We use Google Colab for the script, which allows us to use GPU instead of CPU.

Train a Sentence Generating Model for Korean Data

This script creates a finetuned BERT model to generate sentences using the NSMC (Naver Sentiment Movie Corpus) corpus. We use Google Colab for the script, which allows us to use GPU instead of CPU.

Use a Pre-trained GloVe Model for English data

This script analyzes semantic similarities between English words by using a pre-trained GloVe Model. We use Google Colab for the script, which allows us to use GPU instead of CPU.