We created a website (link above) to showcase our final project for FIN 377 - Data Science for Finance course at Lehigh University. This website was made using streamlit.io to create a dashboard for our visual analysis.
For our final project we decided to explore how the supply chain network has reformed and changed following the pandemic based on customer/supplier contracts from the Compustat Customer Supplier dataset. The main goal of this project was to compare the number of contracts as well as compare the growth of technology companies within the supply chain network.
The question we proposed at the start of the project was how has the supply chain industry changed following the pandemic? One of our initial hypotheis was that the amount of technology based companies would increase in 2022. The other hypothesis we had was that the amount of contracts involving healthcare/pharmaceutical companies would also increase.
The COVID-19 pandemic has had a significant impact on the global economy and disrupted supply chains across various industries. As a result, we conducted an analysis of the financial factors affecting several seller companies. We focused on companies that are in the S&P 500 as of 2022 and used data from financial reports in 2019 and 2022 to gain insights into how the pandemic has affected their financial performance.
To create and run our file we used various imports such as pandas
, numpy
, os
, and seaborn
. In addition to those general python imports, we also used insufficient_but_starting_eda
from eda.py
which is located in the community codebook. We used three different datasets to create one final dataset to be displayed visually through analysis on our dashboard.
S&P 500: We scraped data from Wikipedia to create this dataset to filter for SP500 companies.
Compustat Customer Supplier: This dataset contained raw data provided by Dr. Bowen that showed filings between customers and suppliers between 2019 and 2022.
Accounting 2018-2022: This dataset was also provided by Dr. Bowen and is comprised of accounting variables we requested.
comp = pd.read_csv('inputs/cust_supply_2019_2022.csv')
sp500 = pd.read_csv('inputs/sp500_2022.csv')
acct_raw = pd.read_csv("inputs/acct_data.csv"
S&P 500: There were 503 firms in 2022.
Compustat Customer Supplier: We first ran EDA on the raw compustat data. We found:
These variables describe the seller firm a row of data:
These variables describe the customer of the firm and say something about the relationship type:
We found that even though more than 50% of geographic area code (gareac
) and geographic area type (gareat
) are missing, these values wouldn’t be needed for our analysis and were dropped. We also were not concerned about segment type (stype
) because we used GICS Sector
to describe the seller. Another variable, sales salecs
had 12.4% missing observations. We saw that when salecs
were NaN, the buyer was also “not reported”; therefore we dropped the missing observations.
Based off of our EDA, we first dropped all of the observations where the customer name was not reported, this also corresponded to sellers not in the SP500. Next, we dropped any observations where there were no sales reported.
For this dataset we are going to consider firms that filed in 2019 are providing information for the 2019 fiscal year. We know that this can lead to some inaccuracies when firms don’t file in 2019 for the 2019 fiscal year. For instance if a firm files in January of 2020 our analysis is that this data will correspond with the fiscal year of 2020, when in reality the data most likely corresponds with the 2019 fiscal year. Any mention of gvkey
or firm, is in reference to the seller.
comp2 = comp
comp2 = comp2[comp2['cnms'] != 'Not Reported']
comp3 = comp2.dropna(subset=['salecs'])
Accounting 2018-2022: We provided Dr. Bowen with a list of unique gvkeys
along with a request for data for the following accounting variables:
With this new dataset we then performed EDA. After running that we found that:
Based off of our EDA we decided to use ni
as the variable to represent net income as it had the least missing values. We used the same logic to decide to use capx
over capxv
. In turn we were then able to drop the other variables that represented net income acominc
, oibdp
, and ib
, as well as capxv
. After dropping those variables we had a dataset ready that we could add to by creating variables to show the growth (return) between 2019 and 2022 firms for each accounting variables before merging with the SP500 and Compustat dataset.
We merged comp
with sp500
to create merged
which merged the two on CIK
. This final dataset gave us 385 unique firms.
comp3 = comp3.rename(columns = {'cik': 'CIK'})
merged = comp3.merge(sp500, on='CIK', how = 'inner')
Next, we dropped the filings that were not in 2019 or 2022. We used the indices of the filtered dates (01/01/2020 to 12/31/2021) to be be dropped. This left us with 355 unique firms.
merged['date'] = pd.to_datetime(merged['srcdate'])
dates = merged.sort_values(by='srcdate')
start_date = '2020-01-01'
end_date = '2021-12-31'
filtered_df = merged.query('@start_date <= date <= @end_date')
filtered_indices = filtered_df.index
filtered_out_df = merged.drop(filtered_indices)
We created a column fyear
to simplify later merging.
filtered_out_df['fyear'] = pd.to_datetime(filtered_out_df['srcdate']).dt.year
Lastly, we then filtered out any firms that didn’t have filings in both 2019 and 2022. We also noticed that there was no 2019 Compustat data for gvkey = 25313
, so we dropped this firm. This left us with 88 unique firms. This is a number we would be able to use to check on our final merge.
filtered = filtered_out_df.groupby('gvkey').filter(lambda x: x['fyear'].max() == 2022)
gvkey
to Obtain Accounting Data With the filtered_out_df
dataset we were able to extract the list of unique gvkey
to a .csv file to obtain the accounting from Dr. Bowen.
listkeys = pd.DataFrame(filtered_out_df['gvkey'].unique())
listkeys.to_csv('inputs/listkeys.csv', index=False)
With the cleaned accounting data we were able to merge listkeys
and acct_raw
on gvkey
. There we had 354 unique firms.
listkeys2 = listkeys.rename(columns={0: 'gvkey'})
merged_acct_raw_keys = pd.merge(listkeys2,acct_raw,how = 'inner', on='gvkey')
The next step was to take that new merged DataFrame and filter it to fyear
to be 2019 or 2022. We then dropped the variables we deemed unnecessary during data cleaning to make our dataset concise.
acct_df = merged_acct_raw_keys.query('fyear == 2019 or fyear == 2022')
acct_df = acct_df[['gvkey', 'fyear', 'ap', 'at', 'capx', 'cogs', 'epsfx', 'gp', 'invt', 'ni', 'rect', 'sale']]
Before merging with the filtered
dataset, we created variables for each accounting variable (ap
, capx
, ni
, etc) to show the growth between 2019 and 2022.
prev_row = acct_df.iloc[0]
for index, row in acct_df.iloc[1:].iterrows():
if row.values[1] == 2019.0:
prev_row = row
elif row.values[1] == 2022.0:
calc_row = (row - prev_row) / prev_row * 100
calc_row.rename({k: f"calc_{k}" for k in calc_row.index}, inplace=True)
acct_df.loc[index, calc_row.index] = calc_row
For our final dataset we merged elements from the Compustat/SP500 dataset with the accounting dataset. We took filtered
and kept our desired columns of gvkey
, fyear
, conm
, Symbol
, CIK
. The next thing we did was drop duplicates so that we could be able to match firms in the accounting dataset. The final merge left us with 88 unique firms. (The same amount we found earlier in filtered
).
cleaned_comp = filtered[['gvkey', 'fyear', 'conm', 'Symbol', 'CIK']]
cleaned_comp = cleaned_comp.drop_duplicates()
accounting = pd.merge(cleaned_comp,acct_df, how='left',on=['gvkey','fyear'], indicator=True)
We then used to_csv
to save the final dataset for the dashboard.
# accounting data
accounting.to_csv("outputs/accounting_final.csv", index = False)
# compustat data
filtered.to_csv("outputs/compustat_final.csv", index = False)
For the full analysis please check out our dashboard here.
The python packaged used were: pandas
, numpy
, ploty.express
, altair
, and streamlit
.
Our final results include 88 companies. We dropped companies in real estate, utilities and energy sectors because of the lack of data from the 2019 and 2022 financial reports. This could be a possible limitation for our analysis, however, we feel that the sample size is still strong enough to be used as a representation of how the supply chain was impacted by the COVID-19 pandemic.
Looking at our results, we see that except for the IT industry and healthcare industry, most industries experienced a decrease in sales. Take these two companies for instance. NVIDIA Corporation, an IT company, had a significant increase in the demand for computer-related products, including GPUs, due to the shift towards remote work, distance learning, and increased usage of video streaming services. They been successful in expanding its reach into new markets, such as data centers, autonomous vehicles and AI[1]. NVIDIA saw a 147.06% increase in net sales from 2019 to 2022. CATALENT INC, a healthcare company, provided advanced delivery technologies, development, and manufacturing services for drugs, biologics, and consumer health products[2]. They especially collaborated with several pharmaceutical companies to support the development and production of vaccines and therapies. From 2019 to 2002, CATALENT saw a 91.74% increase in net sales. Overall, despite the challenging business environment subsequent to the COVID-19 pandemic, the IT industry was able to maintain its performance and even improve its sales. This underscores the resilience of the industry and the importance of digitalization in the current business landscape. We can see from our visualization that the IT industry increased in sales by about half a million dollars from 2019 to 2022. These findings support the need for companies to adapt to the changing business landscape by embracing digitalization and other innovative strategies to remain competitive in today’s economy.
Kyra Grodman: BS in Finance and Business Infomation Systems
Jersey Krupp: MS in Financial Engineering
Qiyu Yang: MS in Financial Engineering
The snippets of code are from our Analysis Github Repository.