• June - July
    • Efficiently store +100gb of data
    • Handle JavaScript-based navigation
    • Additional libraries used
      • Selenium X PhantomJS
      • zipfile, tqdm
  • May - June
    • Prioritize on low-hanging fruits
    • Python libraries used:
      • BeautifulSoup4 X requests
      • pandas, os, sys

Official CJARS Site

Project Overview

        The Criminal Justice Administrative Records System (CJARS) is a new data infrastructure initiative at the University of Michigan in cooperation with partners at the U.S. Census Bureau. As a long-term project, CJARS is working to create a nationally integrated repository of data that tracks each individual arrestee through every criminal episode, from arrest through discharge, with the explicit goal of supporting public administration and program evaluation. In turn, CJARS will compile aggregated statistical reports for the participating agencies as to provide a more holistic context to better understand both precursors to involvement in the criminal justice system and implications that reach beyond criminal justice data.

        In effect, CJARS' partnership with the U.S. Census Bureau allows for a merger between criminal justice records and external non-criminal datasets including, but not limited to, information on arrestee's employment history, earnings, use of government programs, and family status. However, the coordination between these two institutions does not mean that the statistical reports prepared by CJARS will entail private information on individual reports. Instead, by relying on probabilistic record linkage to achieve identity resolution and episode resolution, CJARS compiles anonymized statistical reports that take into account the compositional differences between demographic groups in comparable terms.

        Without a doubt, I am very thankful for this opportunity to work for CJARS as their Research Programmer since May 2017, and truly feel grateful for the incredible learning experiences that my three primary responsibilities have provided me with:

1). To develop and to maintain a system of web crawlers to retrieve data on defendants and offenders from publicly available criminal justice databases.

2). To create a physical archive of the raw data comprising of original HTML pages and tabular data in CSV format.

3). To implement an automated system for annually retrieving and updating CJARS datasets.


As of end of July:

  • HTML Crawler
    • Completed 3 agencies
    • 5 more crawlers deployed
  • CSV Compiler
    • Finished with 2 agencies
    • Remaining compilers on standby

Moving Forward

        Although BeautifulSoup was sufficient at first for handling simple HTML pages with sequential URL for each offender, I was soon challenged with building programs for crawling more complicated processes that involved JavaScript. As a result, I began learning about the web browser automation framework Selenium. However, when I come back to CJARS this upcoming Fall, I will begin migrating the old system of Selenium crawlers to that of the more powerful Scrapy web crawling framework. Lastly, once I become sufficiently comfortable in working with Scrapy crawlers, I hope to develop a Django web application for automating data retrieval and updates.