Tutorial on textual analysis of firm disclosures, such as earnings conference calls
Here's the code discussed in the tutorial
Unstructured data, such as the language used in corporate disclosures, news and social media, legal filings, investor discussions, and executive interviews, provides a large reservoir of additional business intelligence and novel insights into the operations of a broader range of public and private firms and organizations. They can be analyzed easily with some basic programming skills. The technique used to parse and analyze such data and extract these insights are collectively called Natural Language Processing (NLP). I'll use the example of Wu (2023) to walk you through an example of textually analyzing one of such data sources: the transcripts of firms quarterly earnings conference calls. \
The data source for the transcript texts is S&P Machine Readable Transcripts database (formerly known as Capital IQ Transcripts). Subscribe from S&P as a datafeed. Here are the specific steps to process the data:
1. Once datafeed is subscribed, download all transcripts from 2008 onwards. Each of these is simply a long text string with the transcript id as the file name. Load all transcripts into a Python pickle object (.pkl) with the texts in one column and the transcript ID in the second column. The sample_transcripts.pkl file in the transcript_processing folder contains excerpts from two sample transcripts and shows the correct structure of the pickle object. Saving them into a flat .csv or .json also works.
Download the sample transcripts here
2. Load the pickle object into Python and first perform a few simple preprocessing tasks:
-- Remove all numbers. Convert all letters to lowercase. Then tokenize each transcript into a list of words (“tokens”). Here's the code:
#### some preliminariestokenizer=RegexpTokenizer(r'\w+')def remove_num(text): num_removed="".join([k for k in text if not k.isnumeric()]) return num_removed
#### detech component text data for countingtranscripts=pd.read_pickle('/andydiwu/sites/data/sample_transcripts.pkl')thead=transcripts['transcriptid'].reset_index(drop=True)ttext=transcripts['text']
#### Process and tokenize the main texts; Count total tokenst_lower=[x.lower() for x in ttext]t_nonum=[remove_num(x) for x in t_lower]t_tokenized=[tokenizer.tokenize(x) for x in t_nonum]
As you can see from the above above, you can use many available tokenizers. This example uses the the regex tokenizer from the NLTK package (pip install nltk), which tokenizes the texts based on whitespaces. Note that this will take a long time if you do not use cloud computing. The code below includes simple parallelization steps (in comments) to speed up execution.
#pool = mp.Pool(mp.cpu_count()-1)
#t_nonum = pool.map(remove_num, [x for x in t_lower])
#t_tokenized = pool.map(tokenizer.tokenize, [x for x in t_nonum])
3. Count the total number of words in each transcript. Be sure to save the total count as a pandas dataframe and merge it back with the transcript ids. These total word counts are used to scale the word counts to create the final measure.
transcript_wordcount=[len(x) for x in t_tokenized]
transcript_total=pd.concat([thead,pd.DataFrame(transcript_wordcount, columns=['totalwordcount'])], axis=1)
4. Count the instances of supplier words within a distance of 10 words of any instances of risk words. My recommendation is that this is best done in the form of word groups as follows:
First, for each tokenized transcript, convert it into rolling groups of 21 words from the beginning to the end.
Then, identify the middle word within each 21-word group. Naturally this word would then have 10 words before and 10 words after it. Then, for each middle-word i, use a script that flags i ONLY if (1) i is a supplier word, (2) there are risk words in the 21-word group. Then count the occurrence of these instances.
Here's the code that achieves this:
############ Define the necessary functions
window_length=10 # length to search before/after each appearing supplier word for risk words
samplesupplierwords=['supply','supplier','suppliers','component']
sampleriskwords=['risk','risks','uncertainty','disruptions','shortages']
def getwordgroup(k):
wg=list(zip(*[k[i:] for i in range(2*window_length+1)]))
return wg
def riskcount(initem):
rct=[initem.count(x) for x in samplesupplierwords]
return rct
def supplierriskcount(ingroup):
middleword=[window[window_length] for window in ingroup] # identifies the middle word within the 21-word grouping
k=[i for i, j in zip(middleword, ingroup) if i in samplesupplierwords and any(ele in j for ele in sampleriskwords)]
r=[riskcount(x) for x in k]
rpd=pd.DataFrame(r, columns=samplesupplierwords).sum(axis=0).values.tolist()
return rpd
5. Save the count output as a pandas dataframe and left-merge it with the total word counts by transcriptid. You can then save the risk scores and merge back into the main transcript file. Here's the code that achieves this:
#### Compute scores from the counted instances
tokenized_texts=t_tokenized
wordgroups = [getwordgroup(k) for k in tokenized_texts]
sccount=[supplierriskcount(x) for x in wordgroups]
rawcount_sum=pd.DataFrame(sccount).sum(axis=1)
transcript_ct=pd.concat([thead,rawcount_sum], axis=1)
transcript_ct.columns=['transcriptid','sccount']
#### Now scale by total length of the transcripts
scscore=transcript_ct.merge(transcript_total, on='transcriptid', how='left')
scscore['score']=scscore['sccount']/(scscore['totalwordcount'])
scscore.drop(columns=['sccount', 'totalwordcount']).to_csv('/andydiwu/sites/data/sample_transcripts_scores.csv', index=False)
There, you have computed a set of risk scores from these transcripts by counting supply chain related words within a 10-word distance of risk-related words. This can then be used as outcome or independent variables in empirical analyses.
Datasets used in other published and working papers can be downloaded here:
10-K word grouping data based on Loughran & McDonald (2011): Positive Words; Negative Words