Algorithm Access for All: Information Processing Democratization via GitHub

Dr. Richard M. Crowley

With Stephanie Cheng, Pengkai Lin, and Yuan Zhao

rcrowley@smu.edu.sg / @prof_rmc / https://rmc.link/
Slides @ rmc.link/GitHub

Research question and background

Increasingly digitized information

Digitizing information has drastically increased the volume of information available to any one person in the market

Left image is from https://www.loc.gov/item/2016871394/

Increasing availability of Algorithms

Algorithmic access and processing is increasing over time
Technology barriers can be a limitation (Blankespoor et al. 2014; Gomez 2024)
- Certain knowledge and resources required to implement

Why examine open source in the context of capital markets?

Open source software lowers the resources and knowledge needed to implement an algorithm
- These algorithms can help with any part of the Blankespoor et al. (2020) framework
  - Awareness: Knowing that the data exists
  - Acquisition: Gather the data and extract information
  - Integration: Analyze the implications of the information

Left figure is from Cao et al. 2023 RFS

What we do

We examine the democratization of information processing technology, in the context of financial markets

Does this democratization lead to increased usage of data shortly after release? And does it broaden the audience?
What are the market consequences of this democratization?

Data and methods

We leverage a unique data set of metadata of all GitHub projects through early 2021 covering 190 million projects
We utilize a hybrid Man + ML approach to identify all information processing projects for SEC filings
- We find nearly 900 unique projects
We hand code certain features of these projects (what they do, the data they access, author information, AI usage)

Empirics

Examine the relationship between GitHub projects and machine downloads of SEC filings shortly after release
Examine how the user base of SEC filings changes, especially among residential and smaller institution users
Examine market movement: trading activity, liquidity, and stock price formation

Results

1 additional GitHub project leads to 4.3% more machine downloads

Identification: GitHub projects handling specific filing types $\rightarrow$ downloads by filing type
Cross sectional evidence: Driven by high quality projects
Robust for various windows; speed tests; different samples; author groups
Staggered DiD approach: No pre-trend

Mechanism: Democratization of information processing technology

More users access filings after GitHub projects released
- Concentrated among residential users and small/medium sized financial institutions

Market effects

Increased retail and algorithmic trade, improved liquidity and price formation; robust to a 2-stage approach

Contribution

Introduce the concept of democratization of information processing technology
- The spread of information processing technology by individuals
  - Decreases the cost for others to use such methods
- Parallels “democratization of information” or “information liberalization”
Add to the literature on investor processing constraints
- Prior literature: Small investors face information disadvantages¹
- Prior literature: Broad dissemination and intermediaries can help ²
- We show that open source technologies can also help
  - These technologies enable some investors to access data themselves
Broader set of technology relevant to the FinTech literature
- Granular evidence on the types of algorithms available for information processing
- Relevant to regulators as they discuss the impact of algorithms in financial markets

Background

Provision versus Processing Technology

Democratization of…

Information Provision
- Message boards (Antweiler and Frank 2004; Das and Chen 2007)
- Social media (Bartov et al. 2018; 2023)
- Seeking Alpha (Chen et al. 2014; Farrell et al. 2022)
Information Processing Technology
- Open source software
- Other software
  - AI: ChatGPT as an increase in AI access (Chang et al. 2023)

Key differences

Similar flavor: Investors/others aiding investors to lower processing costs
Underlying theory is quite different
- Democratization of info. provision: Provide important information in a transformed manner that is easier to process
- Democratization of info. processing: Provide tools to aid investors in gathering the original disseminated information

GitHub

Launched publicly in April 2008
Allows developers to store, collaborate on, and disseminate their work
- Repositories of code, scripts, and computer programs
Over 100 million developers
Over 280 million projects
Over 200k academic papers cite GitHub (Silver 2018)

What does a GitHub project look like?

There is a description of the project under “About”
The project has received multiple stars and forks
The source code is accessible in the main panel
A variety of other statistics on the project are also available

Annecdotal evidence

There is discussion online of users seeking tools to access SEC filings

Methods and GitHub project data

Human input + ML for classification

Work flow from King, Lam and Roberts 2017 AJPS (KLR)

Key idea is to iterate between human and machine classification
- Gives a more refined classification than a dictionary
- Avoids the large Type I error classifiers have with sparse classification problems
- Many possible ML classifiers to choose
  - We use an ensemble of Naïve Bayes + logistic regression
Preprocessing steps
1. Restricting to English language projects (using spacy-langdetect)
2. Lemmatizing words (using spacy)
3. Constructing both unigram and bigram sets for project descriptions
  - This is what we feed to the KLR algorithm

Workflow: Applying KLR

Construct an initial dictionary (“SEC”, “EDGAR”, “10-K”, “10-Q”, “8-K”)
Pull all projects matching these keywords
Hand classify into 2 sets: correct (“reference set”) and incorrect (remove from sample)
Ask the ML algorithm to classify all remaining projects based on words and bigrams
ML algorithm backs out new dictionary from classification
We refine the list, removing terms that are too noisy
Pull all remaining projects matching the new dictionary
Have RAs classify these projects by hand

We additionally add other, less commonly targeted SEC filings as a second dictionary

Total flagged projects: 7,430. After hand filtering and sample cut: 831.

What filings do the projects cover?

557 are generic projects related to SEC filings
207 target one or more filing types of interest
- 168 target Form 10-K
- 45 target Form 10-Q
- 17 target Form 8-K
- 13 or fewer target each of: 3, 4, 5, SC 13D, S-1

Projects over time

Steady increase in projects over time

Package authors

Most projects are by CS or academic sources

Use cases over time

All stages of the Blankespoor et al. (2020) framework are present

Sample projects

What type of libraries are used? (Python projects only)

There is a wide variety of underlying technology that is being disseminated

New or old technology?

Most projects use relatively old underlying technology, though some push technological boundaries each year

Empirical setup

Data and sample

EDGAR filing data
- Downloads and IP addresses from the SEC EDGAR server logs on days [0, +1]
  - Machine downloads following Lee et al. (2015) and Cao et al. (2023)
  - Human downloads following Ryans (2017)
- Size, fog, sentiment from WRDS SEC analytics
Observations need to have SEC EDGAR log data: March 2003 through June 2017
- We exclude “lost or damaged” periods (see Loughran and McDonald 2017)
Match to WRDS SEC analytics, Compustat, and CRSP
- CRSP share code 10 or 11; main exchanges only, $5 or higher share price
Filing types on GitHub by the end of 2016: 10-K, 10-Q, 13-F, 3, 4, 5, 8-K, or D
Additional filing types on GitHub by 2020 as controls: Schedule 13D and S-1

Final sample: 2,634,693 filings

Empirical strategy

We compute the measure $GitHub~Projects$ at the filing type level
- E.g., the # of projects for processing Form 10-K as of the prior month
- Identification comes from having 8 filing types with differently timed shocks
We include firm, form, and year-month fixed effects
- We additionally include firm fixed effects as well
We cluster standard errors by firm and year-month

How to interpret this

Our regressions effectively measure the effect of increases in project counts across years within each form type, where the increase is not a common shock across all form types.

Univariate statistics

GitHub projects and machine downloads

Regression specification

\[ \text{log}\left(\mathbb{E}\left(Machine~Downloads_{i,j,t} \mid IVs\right)\right) = \alpha + \beta GitHub~Projects_{j,t}\\ \qquad\qquad\qquad\qquad\quad+ \gamma Controls_{i, j, t} +\lambda_{firm} + \lambda_{form} + \lambda_{year,month} + \epsilon_{i,j,t} \]

As machine downloads is a count, we use Poisson regression (Cohn et al. 2022 JFE)
- Implement it as a PPML regression (Correia et al. 2020 SJ)
  - Poisson adjusted to handle sparsity and HDFE
Dependent variable: # of machine downloads of filing $j$
Independent variable: # of GitHub projects for filing $j$’s form type at end of prior year
Controls:
- File measures: size, fog, LM sentiment
- Human downloads
- Time-varying firm characteristics: firm size, ROA, leverage, inst. own., growth, idiosyncratic volatility, segments, Tobin’s Q, analyst coverage, turnover, past return

Downloads (T3A & B)

	Machine Downloads	Machine Downloads	Quality: Star	Quality: Fork
GitHub Projects	0.047***	0.042***
High quality			0.125***	0.125***
Low quality			-0.028	0.008
Filing, Firm controls	No	Yes	Yes	Yes
Firm, Form, YM FEs	Yes	Yes	Yes	Yes
Pseudo R-sq	0.792	0.799	0.799	0.799

More GitHub projects $\Rightarrow$ more SEC filing downloads

The effect is robust to a large battery of controls
Variation in GitHub Projects across year and filing type aids identification
Higher quality projects drive the impact

Effect size is equivalent to a 4.3% increase in downloads

Robustness: Sample cuts

	Non-Financial authors	Excluding 10-K	After GitHub	After first project
GitHub Projects	0.065***	0.065***	0.039***	0.035**
Filing, Firm controls	Yes	Yes	Yes	Yes
Firm, Form, YM FEs	Yes	Yes	Yes	Yes
Pseudo R-sq	0.797	0.805	0.691	0.550

Robust to various other samples

The effect isn’t caused by the presence of financial authors
Not driven by any specific form type (can drop any)
Not driven by control sample structure

Other robustness tests (untabulated)

Robust to replacing Form FE with Form-month FE (rules out seasonality in attention to form types)
Robust to replacing Firm FE with Firm-month FE (rules out seasonality in firms’ disclosure attention)
Robust to dropping any class projects

Staggered Diff-in-Diff Analysis

No pre-trend: Helps to rule out reverse causality

Machine acquisition speed

	Machine Downloads: 15 min	Machine Downloads: 1 hour	Machine Downloads: 1 day	Download speed: 15 min/day	Download speed: 1 hour/day
GitHub Projects	0.215***	0.206***	0.039***	0.008***	0.009***
Filing, Firm controls	Yes	Yes	Yes	Yes	Yes
Firm, Form, YM FEs	Yes	Yes	Yes	Yes	Yes
Pseudo R-sq	0.864	0.861	0.813
Adj. R-sq				0.673	0.752

The effect is quite fast, with much higher magnitude in the first hour

Mechanism

Democratization as the mechanism: Set up

For this test, we directly test the # of users machine downloading each filing:

\[ \text{log}\left(\mathbb{E}\left(Unique \ IP_{i,j,t} \mid IVs\right)\right) = \alpha + \beta GitHubProject_{j,\ t} + \gamma Controls_{i, j, t} \\\qquad\qquad\qquad\quad+ \lambda_{firm} + \lambda_{form} + \lambda_{year,month} + \epsilon_{i,j,t} \]

DV is count of unique users $\Rightarrow$ PPML regression
We split IP counts by various characteristics:
- Telecom companies
  - Likely to be residential IP addresses $\Rightarrow$ retail investor oriented
- Financial institutions
  - Institutional investors
  - Split on small vs. large: Varying processing constraints

Democratization as the mechanism: Results

	Unique IP: All	Unique IP: Residential	Unique IP: Institutions	Unique IP: Small Institutions	Unique IP: Large Institutions
GitHub Projects	0.035***	0.047***	0.013***	0.013**	0.008
Filing, Firm controls	Yes	Yes	Yes	Yes	Yes
Firm, Form, YM FEs	Yes	Yes	Yes	Yes	Yes
Pseudo R-sq	0.848	0.712	0.469	0.542	0.533

Who appear to be influenced by GitHub projects?

Investors who likely have processing constraints change behavior following GitHub projects
- Residential users
- Small institutions
Investors with lower constraints do not respond to GitHub projects
- Large financial institutions

Market implications

Regression specification

\[ Outcome_{i,t} = \alpha + \beta GitHub~Projects_{j,t}\\ \qquad\qquad\qquad\qquad\quad+ \gamma Controls_{i, j, t} +\lambda_{firm} + \lambda_{form} + \lambda_{year,month} + \epsilon_{i,j,t} \]

Use a standard HDFE linear regression structure
Dependent variables:
- Trading activity: Volume, abnormal retail volume, abnormal odd lot volume
- Liquidity: Abnormal spread, abnormal depth
- Price formation: |CAR|, IPT
Additional specification: 2 stage model using predicted machine downloads
- We use the linear predictor from the Poisson model to avoid scaling issues with the linear regression

Trading activity and liquidity

	Abnormal Volume	Abnormal Retail	Abnormal OddLots	Abnormal Spread	Abnormal Depth
GitHub Projects	0.010***	0.007***	0.016***	0.038	0.065***
Filing, Firm controls	Yes	Yes	Yes	Yes	Yes
Firm, Form, YM FEs	Yes	Yes	Yes	Yes	Yes
Adj. R-sq	0.093	0.078	0.137	0.195	0.092

Association between GitHub projects and trading activity

Higher retail trading, consistent with democratization result
Higher algorithmic trading
Increase in volume and depth

Price formation

	\|CAR\| [0,+1]	IPT [0,+5]
GitHub Projects	0.014***	0.004***
Filing, Firm controls	Yes	Yes
Firm, Form, YM FEs	Yes	Yes
Adj. R-sq	0.209	0.015

Association between GitHub projects and market reaction

Market reaction to SEC filings increases following the release of GitHub projects

Two stage model

	Abnormal Volume	Abnormal Retail	Abnormal OddLots	Abnormal Spread	Abnormal Depth	\|CAR\|	IPT
Predicted Machine Downloads	1.913***	1.483***	1.286***	-0.348**	1.053***	1.710***	0.494***
Filing, Firm controls	Yes	Yes	Yes	Yes	Yes	Yes	Yes
Firm, Form, YM FEs	Yes	Yes	Yes	Yes	Yes	Yes	Yes
Adj. R-sq	0.093	0.078	0.137	0.195	0.092	0.212	0.016

Downloads from GitHub projects lead to the market effects

Additional analysis

Human downloads

	Human Downloads	Human Downloads
GitHub Projects	-0.008***	-0.010***
Filing, Firm controls	No	Yes
Firm, Form, YM FEs	Yes	Yes
Pseudo R-sq	0.642	0.656

GitHub project usage as a supplement for human downloading

After GitHub projects release, the total number of human downloads drops

Conclusion

Findings and takeaways

GitHub projects on specific SEC filing types are followed by more machine downloads of those filing types in a short window after filing release
- Driven by high quality projects
- Not driven by specific authors, and no pre-trend
Mechanism: Democratization of information processing technology
- Affected users are largely retail or smaller financial institutions
The market appears to be affected by the provision of this technology
- More retail and algorithmic trading
- Improved liquidity and depth; faster price formation

Takeaways

Democratization of information processing technology leads to greater information acquisition
- This can benefit the market and help level the playing field
- A distinct mechanism from information provision
Our paper speaks to the benefits of open source software in an age where open source is hotly debated in AI tech sector

Thanks!

Dr. Richard M. Crowley
rcrowley@smu.edu.sg
@prof_rmc
rmc.link/

Packages used for these slides

dplyr
ggplot2
kableExtra
knitr
quarto
- material-icons
revealjs
scales

Other Prior literature

Framing GitHub in the prior literature

Information dissemination

The SEC EDGAR system was piloted in 1993
- Trades become more informative (Gao and Huang 2020 RFS), improved liquidity and lower CoC (Goldstein et al. 2023 JAR)
Use by constituents:
- Institutional investors (Chen et al. 2020)
- IRS (Bozanic et al. 2017)
- Financial analysts (Gibbons et al. 2021)
- Hedge funds (Crane et al. 2023)
Less use by retail investors (LM 2017)

Information access

Online trading from 1995 onward (Barber and Odean 2001 JEP)
Internet access as a “democratization of finance” (Hvide et al. 2022 Working)
3G data facilitates trading (Brown et al. 2015 JAR)
- But may be a distraction (Brown et al. 2022 Working)

Democratization of information provision

Information content of:
- Message boards (Antweiler and Frank 2004 JF; Das and Chen 2007 MS)
- Seeking Alpha (Chen et al. 2014 RFS; Farrell et al. 2022 JFE)
- Twitter (Bartov et al. 2018 TAR; Bartov et al. 2023 MS)

Key differences

Similar flavor to democratization of information processing: Investors/others aiding investors to lower processing costs
Underlying theory is quite different
- Democratization of info. provision: Provide important information in a transformed manner that is easier to process
- Democratization of info. processing: Provide tools to aid investors in gathering the original disseminated information

Download methods and other stats

Machine downloads

Following Cao et al. 2023 RFS:

If an IP address downloads filings from >50 firms in 1 day, label all downloads from that IP address on that day as machine downloads (Originally from Lee et al. 2015 JFE)
If a download request is labeled as coming from a web crawler in the SEC log file, label that download as a machine download

Human downloads

Following Ryans 2017 Working

Start with all downloads
Identify the IP address of each download
- If the IP address accessed >25 filings in 1 minute: drop
- If the IP address accessed 3 or more companies in 1 minute: drop
- If the IP address accessed 500+ filings in 1 day: drop
Tag downloads from the remaining IP addresses as human downloads

First projects

Main sample: 30 Jan 2012

Including generic projects: 25 Feb 2011

Most starred projects

Main sample: 76 stars

Including generic projects: 340 stars

Full tables

Table 1

Table 2

Table 3

Table 4

Table 5

Table 6

Table 7

Table 8

Table 9

Table 10

Table 11

Table 12

Algorithm Access for All: Information Processing Democratization via GitHub

Research question and background

Increasingly digitized information

Increasing availability of Algorithms

What we do

Results

Contribution

Background

Provision versus Processing Technology

GitHub

What does a GitHub project look like?

Annecdotal evidence

Methods and GitHub project data

GitHub and related data

Human input + ML for classification

Workflow: Applying KLR

What filings do the projects cover?

Projects over time

Package authors

Use cases over time

Sample projects

What type of libraries are used? (Python projects only)

New or old technology?

Empirical setup

Data and sample

Empirical strategy

Univariate statistics

GitHub projects and machine downloads

Regression specification

Downloads (T3A & B)

Robustness: Sample cuts

Staggered Diff-in-Diff Analysis

Machine acquisition speed

Mechanism

Democratization as the mechanism: Set up

Democratization as the mechanism: Results

Market implications

Regression specification

Trading activity and liquidity

Price formation

Two stage model

Additional analysis

Human downloads

Conclusion

Findings and takeaways

Thanks!

Dr. Richard M. Crowleyrcrowley@smu.edu.sg@prof_rmcrmc.link/

Packages used for these slides

Other Prior literature

Framing GitHub in the prior literature

Democratization of information provision

Download methods and other stats

Machine downloads

Human downloads

First projects

Most starred projects

Full tables

Table 1

Table 2

Table 3

Table 4

Table 5

Table 6

Table 7

Table 8

Table 9

Table 10

Table 11

Table 12

Dr. Richard M. Crowley
rcrowley@smu.edu.sg
@prof_rmc
rmc.link/