Algorithm Access for All: Information Processing Democratization via GitHub

Dr. Richard M. Crowley

rcrowley@smu.edu.sg / @prof_rmc / https://rmc.link/
Slides @ rmc.link/GitHub

Research question and background

Increasingly digitized information

Digitizing information has drastically increased the volume of information available to any one person in the market

Left image is from https://www.loc.gov/item/2016871394/

Increasing availability of Algorithms

  • Algorithmic access and processing is increasing over time
  • Technology barriers can be a limitation (Blankespoor et al. 2014; Gomez 2024)
    • Certain knowledge and resources required to implement

Why examine open source in the context of capital markets?

  • Open source software lowers the resources and knowledge needed to implement an algorithm
    • These algorithms can help with any part of the Blankespoor et al. (2020) framework
      • Awareness: Knowing that the data exists
      • Acquisition: Gather the data and extract information
      • Integration: Analyze the implications of the information

Left figure is from Cao et al. 2023 RFS

What we do

We examine the democratization of information processing technology, in the context of financial markets

  1. Does this democratization lead to increased usage of data shortly after release? And does it broaden the audience?
  2. What are the market consequences of this democratization?

Data and methods

  1. We leverage a unique data set of metadata of all GitHub projects through early 2021 covering 190 million projects
  2. We utilize a hybrid Man + ML approach to identify all information processing projects for SEC filings
    • We find nearly 900 unique projects
  3. We hand code certain features of these projects (what they do, the data they access, author information, AI usage)

Empirics

  1. Examine the relationship between GitHub projects and machine downloads of SEC filings shortly after release
  2. Examine how the user base of SEC filings changes, especially among residential and smaller institution users
  3. Examine market movement: trading activity, liquidity, and stock price formation

Results

1 additional GitHub project leads to 4.3% more machine downloads

  • Identification: GitHub projects handling specific filing types \(\rightarrow\) downloads by filing type
  • Cross sectional evidence: Driven by high quality projects
  • Robust for various windows; speed tests; different samples; author groups
  • Staggered DiD approach: No pre-trend

Mechanism: Democratization of information processing technology

  • More users access filings after GitHub projects released
    • Concentrated among residential users and small/medium sized financial institutions

Market effects

Increased retail and algorithmic trade, improved liquidity and price formation; robust to a 2-stage approach

Contribution

  1. Introduce the concept of democratization of information processing technology
    • The spread of information processing technology by individuals
      • Decreases the cost for others to use such methods
    • Parallels “democratization of information” or “information liberalization”
  2. Add to the literature on investor processing constraints
    • Prior literature: Small investors face information disadvantages1
    • Prior literature: Broad dissemination and intermediaries can help 2
    • We show that open source technologies can also help
      • These technologies enable some investors to access data themselves
  3. Broader set of technology relevant to the FinTech literature
    • Granular evidence on the types of algorithms available for information processing
    • Relevant to regulators as they discuss the impact of algorithms in financial markets

Background

Provision versus Processing Technology

Democratization of…

  • Information Provision
    • Message boards (Antweiler and Frank 2004; Das and Chen 2007)
    • Social media (Bartov et al. 2018; 2023)
    • Seeking Alpha (Chen et al. 2014; Farrell et al. 2022)
  • Information Processing Technology
    • Open source software
    • Other software
      • AI: ChatGPT as an increase in AI access (Chang et al. 2023)

Key differences

  • Similar flavor: Investors/others aiding investors to lower processing costs
  • Underlying theory is quite different
    • Democratization of info. provision: Provide important information in a transformed manner that is easier to process
    • Democratization of info. processing: Provide tools to aid investors in gathering the original disseminated information

GitHub

  • Launched publicly in April 2008
  • Allows developers to store, collaborate on, and disseminate their work
    • Repositories of code, scripts, and computer programs
  • Over 100 million developers
  • Over 280 million projects
  • Over 200k academic papers cite GitHub (Silver 2018)

What does a GitHub project look like?

  1. There is a description of the project under “About”
  2. The project has received multiple stars and forks
  3. The source code is accessible in the main panel
  4. A variety of other statistics on the project are also available

Annecdotal evidence

There is discussion online of users seeking tools to access SEC filings

Methods and GitHub project data

Human input + ML for classification

Work flow from King, Lam and Roberts 2017 AJPS (KLR)

  • Key idea is to iterate between human and machine classification
    • Gives a more refined classification than a dictionary
    • Avoids the large Type I error classifiers have with sparse classification problems
    • Many possible ML classifiers to choose
      • We use an ensemble of Naïve Bayes + logistic regression
  • Preprocessing steps
    1. Restricting to English language projects (using spacy-langdetect)
    2. Lemmatizing words (using spacy)
    3. Constructing both unigram and bigram sets for project descriptions
      • This is what we feed to the KLR algorithm

Workflow: Applying KLR

  1. Construct an initial dictionary (“SEC”, “EDGAR”, “10-K”, “10-Q”, “8-K”)
  2. Pull all projects matching these keywords
  3. Hand classify into 2 sets: correct (“reference set”) and incorrect (remove from sample)
  4. Ask the ML algorithm to classify all remaining projects based on words and bigrams
  5. ML algorithm backs out new dictionary from classification
  6. We refine the list, removing terms that are too noisy
  7. Pull all remaining projects matching the new dictionary
  8. Have RAs classify these projects by hand
  • We additionally add other, less commonly targeted SEC filings as a second dictionary

Total flagged projects: 7,430. After hand filtering and sample cut: 831.

What filings do the projects cover?

  • 557 are generic projects related to SEC filings
  • 207 target one or more filing types of interest
    • 168 target Form 10-K
    • 45 target Form 10-Q
    • 17 target Form 8-K
    • 13 or fewer target each of: 3, 4, 5, SC 13D, S-1

Projects over time

Steady increase in projects over time

Package authors

Most projects are by CS or academic sources

Use cases over time

All stages of the Blankespoor et al. (2020) framework are present

Sample projects

What type of libraries are used? (Python projects only)

There is a wide variety of underlying technology that is being disseminated

New or old technology?

Most projects use relatively old underlying technology, though some push technological boundaries each year

Empirical setup

Data and sample

  • EDGAR filing data
    • Downloads and IP addresses from the SEC EDGAR server logs on days [0, +1]
      • Machine downloads following Lee et al. (2015) and Cao et al. (2023)
      • Human downloads following Ryans (2017)
    • Size, fog, sentiment from WRDS SEC analytics
  • Observations need to have SEC EDGAR log data: March 2003 through June 2017
    • We exclude “lost or damaged” periods (see Loughran and McDonald 2017)
  • Match to WRDS SEC analytics, Compustat, and CRSP
    • CRSP share code 10 or 11; main exchanges only, $5 or higher share price
  • Filing types on GitHub by the end of 2016: 10-K, 10-Q, 13-F, 3, 4, 5, 8-K, or D
  • Additional filing types on GitHub by 2020 as controls: Schedule 13D and S-1

Final sample: 2,634,693 filings

Empirical strategy

  1. We compute the measure \(GitHub~Projects\) at the filing type level
    • E.g., the # of projects for processing Form 10-K as of the prior month
    • Identification comes from having 8 filing types with differently timed shocks
  2. We include firm, form, and year-month fixed effects
    • We additionally include firm fixed effects as well
  3. We cluster standard errors by firm and year-month

How to interpret this

Our regressions effectively measure the effect of increases in project counts across years within each form type, where the increase is not a common shock across all form types.

Univariate statistics

GitHub projects and machine downloads

Regression specification

\[ \text{log}\left(\mathbb{E}\left(Machine~Downloads_{i,j,t} \mid IVs\right)\right) = \alpha + \beta GitHub~Projects_{j,t}\\ \qquad\qquad\qquad\qquad\quad+ \gamma Controls_{i, j, t} +\lambda_{firm} + \lambda_{form} + \lambda_{year,month} + \epsilon_{i,j,t} \]

  • As machine downloads is a count, we use Poisson regression (Cohn et al. 2022 JFE)
    • Implement it as a PPML regression (Correia et al. 2020 SJ)
      • Poisson adjusted to handle sparsity and HDFE
  • Dependent variable: # of machine downloads of filing \(j\)
  • Independent variable: # of GitHub projects for filing \(j\)’s form type at end of prior year
  • Controls:
    • File measures: size, fog, LM sentiment
    • Human downloads
    • Time-varying firm characteristics: firm size, ROA, leverage, inst. own., growth, idiosyncratic volatility, segments, Tobin’s Q, analyst coverage, turnover, past return

Downloads (T3A & B)

Machine Downloads Machine Downloads Quality: Star Quality: Fork
GitHub Projects 0.047*** 0.042***
High quality 0.125*** 0.125***
Low quality -0.028 0.008
Filing, Firm controls No Yes Yes Yes
Firm, Form, YM FEs Yes Yes Yes Yes
Pseudo R-sq 0.792 0.799 0.799 0.799

More GitHub projects \(\Rightarrow\) more SEC filing downloads

  • The effect is robust to a large battery of controls
  • Variation in GitHub Projects across year and filing type aids identification
  • Higher quality projects drive the impact

Effect size is equivalent to a 4.3% increase in downloads

Robustness: Sample cuts

Non-Financial authors Excluding 10-K After GitHub After first project
GitHub Projects 0.065*** 0.065*** 0.039*** 0.035**
Filing, Firm controls Yes Yes Yes Yes
Firm, Form, YM FEs Yes Yes Yes Yes
Pseudo R-sq 0.797 0.805 0.691 0.550

Robust to various other samples

  • The effect isn’t caused by the presence of financial authors
  • Not driven by any specific form type (can drop any)
  • Not driven by control sample structure

Other robustness tests (untabulated)

  • Robust to replacing Form FE with Form-month FE (rules out seasonality in attention to form types)
  • Robust to replacing Firm FE with Firm-month FE (rules out seasonality in firms’ disclosure attention)
  • Robust to dropping any class projects

Staggered Diff-in-Diff Analysis

No pre-trend: Helps to rule out reverse causality

Machine acquisition speed

Machine Downloads: 15 min Machine Downloads: 1 hour Machine Downloads: 1 day Download speed: 15 min/day Download speed: 1 hour/day
GitHub Projects 0.215*** 0.206*** 0.039*** 0.008*** 0.009***
Filing, Firm controls Yes Yes Yes Yes Yes
Firm, Form, YM FEs Yes Yes Yes Yes Yes
Pseudo R-sq 0.864 0.861 0.813
Adj. R-sq 0.673 0.752

The effect is quite fast, with much higher magnitude in the first hour

Mechanism

Democratization as the mechanism: Set up

For this test, we directly test the # of users machine downloading each filing:

\[ \text{log}\left(\mathbb{E}\left(Unique \ IP_{i,j,t} \mid IVs\right)\right) = \alpha + \beta GitHubProject_{j,\ t} + \gamma Controls_{i, j, t} \\\qquad\qquad\qquad\quad+ \lambda_{firm} + \lambda_{form} + \lambda_{year,month} + \epsilon_{i,j,t} \]

  • DV is count of unique users \(\Rightarrow\) PPML regression
  • We split IP counts by various characteristics:
    • Telecom companies
      • Likely to be residential IP addresses \(\Rightarrow\) retail investor oriented
    • Financial institutions
      • Institutional investors
      • Split on small vs. large: Varying processing constraints

Democratization as the mechanism: Results

Unique IP: All Unique IP: Residential Unique IP: Institutions Unique IP: Small Institutions Unique IP: Large Institutions
GitHub Projects 0.035*** 0.047*** 0.013*** 0.013** 0.008
Filing, Firm controls Yes Yes Yes Yes Yes
Firm, Form, YM FEs Yes Yes Yes Yes Yes
Pseudo R-sq 0.848 0.712 0.469 0.542 0.533

Who appear to be influenced by GitHub projects?

  • Investors who likely have processing constraints change behavior following GitHub projects
    • Residential users
    • Small institutions
  • Investors with lower constraints do not respond to GitHub projects
    • Large financial institutions

Market implications

Regression specification

\[ Outcome_{i,t} = \alpha + \beta GitHub~Projects_{j,t}\\ \qquad\qquad\qquad\qquad\quad+ \gamma Controls_{i, j, t} +\lambda_{firm} + \lambda_{form} + \lambda_{year,month} + \epsilon_{i,j,t} \]

  • Use a standard HDFE linear regression structure
  • Dependent variables:
    • Trading activity: Volume, abnormal retail volume, abnormal odd lot volume
    • Liquidity: Abnormal spread, abnormal depth
    • Price formation: |CAR|, IPT
  • Additional specification: 2 stage model using predicted machine downloads
    • We use the linear predictor from the Poisson model to avoid scaling issues with the linear regression

Trading activity and liquidity

Abnormal Volume Abnormal Retail Abnormal OddLots Abnormal Spread Abnormal Depth
GitHub Projects 0.010*** 0.007*** 0.016*** 0.038 0.065***
Filing, Firm controls Yes Yes Yes Yes Yes
Firm, Form, YM FEs Yes Yes Yes Yes Yes
Adj. R-sq 0.093 0.078 0.137 0.195 0.092

Association between GitHub projects and trading activity

  • Higher retail trading, consistent with democratization result
  • Higher algorithmic trading
  • Increase in volume and depth

Price formation

|CAR| [0,+1] IPT [0,+5]
GitHub Projects 0.014*** 0.004***
Filing, Firm controls Yes Yes
Firm, Form, YM FEs Yes Yes
Adj. R-sq 0.209 0.015

Association between GitHub projects and market reaction

  • Market reaction to SEC filings increases following the release of GitHub projects

Two stage model

Abnormal Volume Abnormal Retail Abnormal OddLots Abnormal Spread Abnormal Depth |CAR| IPT
Predicted Machine Downloads 1.913*** 1.483*** 1.286*** -0.348** 1.053*** 1.710*** 0.494***
Filing, Firm controls Yes Yes Yes Yes Yes Yes Yes
Firm, Form, YM FEs Yes Yes Yes Yes Yes Yes Yes
Adj. R-sq 0.093 0.078 0.137 0.195 0.092 0.212 0.016

Downloads from GitHub projects lead to the market effects

Additional analysis

Human downloads

Human Downloads Human Downloads
GitHub Projects -0.008*** -0.010***
Filing, Firm controls No Yes
Firm, Form, YM FEs Yes Yes
Pseudo R-sq 0.642 0.656

GitHub project usage as a supplement for human downloading

After GitHub projects release, the total number of human downloads drops

Conclusion

Findings and takeaways

  • GitHub projects on specific SEC filing types are followed by more machine downloads of those filing types in a short window after filing release
    • Driven by high quality projects
    • Not driven by specific authors, and no pre-trend
  • Mechanism: Democratization of information processing technology
    • Affected users are largely retail or smaller financial institutions
  • The market appears to be affected by the provision of this technology
    • More retail and algorithmic trading
    • Improved liquidity and depth; faster price formation

Takeaways

  • Democratization of information processing technology leads to greater information acquisition
    • This can benefit the market and help level the playing field
    • A distinct mechanism from information provision
  • Our paper speaks to the benefits of open source software in an age where open source is hotly debated in AI tech sector

Thanks!


Dr. Richard M. Crowley
rcrowley@smu.edu.sg
@prof_rmc
rmc.link/

Packages used for these slides

  • dplyr
  • ggplot2
  • kableExtra
  • knitr
  • quarto
    • material-icons
  • revealjs
  • scales

Other Prior literature

Framing GitHub in the prior literature

Information dissemination

  • The SEC EDGAR system was piloted in 1993
    • Trades become more informative (Gao and Huang 2020 RFS), improved liquidity and lower CoC (Goldstein et al. 2023 JAR)
  • Use by constituents:
    • Institutional investors (Chen et al. 2020)
    • IRS (Bozanic et al. 2017)
    • Financial analysts (Gibbons et al. 2021)
    • Hedge funds (Crane et al. 2023)
  • Less use by retail investors (LM 2017)

Information access

  • Online trading from 1995 onward (Barber and Odean 2001 JEP)
  • Internet access as a “democratization of finance” (Hvide et al. 2022 Working)
  • 3G data facilitates trading (Brown et al. 2015 JAR)
    • But may be a distraction (Brown et al. 2022 Working)

Democratization of information provision

  • Information content of:
    • Message boards (Antweiler and Frank 2004 JF; Das and Chen 2007 MS)
    • Seeking Alpha (Chen et al. 2014 RFS; Farrell et al. 2022 JFE)
    • Twitter (Bartov et al. 2018 TAR; Bartov et al. 2023 MS)

Key differences

  • Similar flavor to democratization of information processing: Investors/others aiding investors to lower processing costs
  • Underlying theory is quite different
    • Democratization of info. provision: Provide important information in a transformed manner that is easier to process
    • Democratization of info. processing: Provide tools to aid investors in gathering the original disseminated information

Download methods and other stats

Machine downloads

Following Cao et al. 2023 RFS:

  1. If an IP address downloads filings from >50 firms in 1 day, label all downloads from that IP address on that day as machine downloads (Originally from Lee et al. 2015 JFE)
  2. If a download request is labeled as coming from a web crawler in the SEC log file, label that download as a machine download

Human downloads

Following Ryans 2017 Working

  1. Start with all downloads
  2. Identify the IP address of each download
    • If the IP address accessed >25 filings in 1 minute: drop
    • If the IP address accessed 3 or more companies in 1 minute: drop
    • If the IP address accessed 500+ filings in 1 day: drop
  3. Tag downloads from the remaining IP addresses as human downloads

First projects

Main sample: 30 Jan 2012

Including generic projects: 25 Feb 2011

Most starred projects

Main sample: 76 stars

Including generic projects: 340 stars

Full tables

Table 1

Table 2

Table 3

Table 4

Table 5

Table 6

Table 7

Table 8

Table 9

Table 10

Table 11

Table 12