Statistical packages/software

NER Pipeline

One of the big projects I worked on at History Lab was Named Entity Recognition/Linking on our millions of documents. I set up a pipeline to train a spaCy model in Python for the particular characteristics of our documents and then created a Knowledge Base to identify specific entities within the documents and link them to their Wikidata ID. For entities with the same name, I wrote up a script that would distinguish the entities based on the other entities mentioned in the document.

The repository with the different scripts is available on GitHub.

HLSTM

At History Lab, we used the Structural Topic Model package for R to run topic modeling on all our collections. I wrote some functions and created an R package to make it easier to run the analysis across all our corpora.

Note that these functions are largely wrappers for functions already in the STM package.

Causal Mediation Analysis

Mediation for Stata estimates the role of particular causal mechanisms that mediate a relationship between treatment and outcome variables. The command calculates causal mediation effects and direct effects for models with continuous or binary dependent variables using methods presented in Imai et al 2010. It also calculates sensitivity analyses for mediation effects that are necessary due to non-random assignment of mediating variable. Our package replaces earlier approaches like the “Baron-Kenny” method and “Sobel test” for the case of continuous mediator and outcome variables, producing identical results as these earlier methods but not put in a causal inference framework and with sensitivity analyses to the key identification assumption. For models with binary mediators or outcomes, correct calculation of mediation effects are implemented that take into account the use of non-linear models such as probit.

The package is available from ssc and can be installed in Stata by typing ssc install mediation.

Stata utilities

I have written several utility commands for Stata. While I was at Princeton and helping political scientists merge different datasets, I got tired of trying to keep track of different country coding schemes. So I wrote ccode to translate between different coding schemes: IMF, World Bank, Correlates of War, Banks Cross National Time Series, and country name. I wrote ctyfind to look up a country name based on one of the classification codes or vice versa. For scholars who use Dropbox, I wrote dropbox which looks for the Dropbox directory on a computer. Because different individuals had Dropbox in different locations, the command was designed to ease collaboration on do files.

To install any of the packages, put the files in the ado/plus/c directory. The location can be found by typing sysdir in Stata and looking for the PLUS location. You might have to add a “c” directory.