Skip to content

harin/title-length-citation-causal

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Introduction

Recently, I was working on a project for Exploratory Data Analysis and Visualization class, where we have chosen to analyze the trends and relationship between computer science paper from the DBLP database. One finding we found is that there is a negative correlation between title length and the number of citations. Since I recently took the Causal Inference class, this is a good opportunity to apply them and see whether shorter title length in fact causes more citations.

Note that the data was taken from https://aminer.org/citation.

import pandas as pd
import numpy as np
from causality.estimation.parametric import PropensityScoreMatching
import statsmodels.api as sm
from graphviz import Digraph
%matplotlib inline
df = pd.read_pickle('dblp.pkl')
df['title_len_words'] = df.title.apply(lambda x: len(x.split(' ')))
df.sample(100000).plot('title_len_words', 'n_citation', kind='scatter');

png

To investigate whether title length causes number of citation to increase, we first need to come up with a causal graph we believe represents the system. One of the simplest graph we can have is one in which the title length effects the number of citation, but is confounded by the author. This is represented by the graph below:

g = Digraph()
g.node('D', 'Title length')
g.node('Z', 'Author')
g.node('Y', 'Citations')
g.edges(['DY', 'ZD', 'ZY'])
g

svg

df.shape
(3079007, 8)

There is two problems with our causal variable "Title length" is that it is not a binary variable, this is a departure from what I've learned, so the result might not be correct. Second, "Author" is a categorical data, so we have to convert them to dummy variables, which would limit how many authors we can include in our analysis.

Data Mangling

df.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
authors id n_citation references title venue year title_len_words
0 [Makoto Satoh, Ryo Muramatsu, Mizue Kayama, Ka... 00127ee2-cb05-48ce-bc49-9de556b93346 0 [51c7e02e-f5ed-431a-8cf5-f761f266d4be, 69b625b... Preliminary Design of a Network Protocol Learn... international conference on human-computer int... 2013 26
1 [Gareth Beale, Graeme Earl] 001c58d3-26ad-46b3-ab3a-c1e557d16821 50 [10482dd3-4642-4193-842f-85f3b70fcf65, 3133714... A methodology for the physically accurate visu... visual analytics science and technology 2011 11
2 [Altaf Hossain, Faisal Zaman, Mohammed Nasser,... 001c8744-73c4-4b04-9364-22d31a10dbf1 50 [2d84c0f2-e656-4ce7-b018-90eda1c132fe, a083a1b... Comparison of GARCH, Neural Network and Suppor... pattern recognition and machine intelligence 2009 14
3 [Jea-Bum Park, Byungmok Kim, Jian Shen, Sun-Yo... 00338203-9eb3-40c5-9f31-cbac73a519ec 0 [8c78e4b0-632b-4293-b491-85b1976675e6, 9cdc54f... Development of Remote Monitoring and Control D... 2011 16
4 [Giovanna Guerrini, Isabella Merlo] 0040b022-1472-4f70-a753-74832df65266 2 NaN Reasonig about Set-Oriented Methods in Object ... 1998 7
df.shape
(3079007, 8)
is_list = df.authors.apply(lambda x: isinstance(x, list))
df[~is_list]
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
authors id n_citation references title venue year title_len_words
1105528 NaN 634390c1-d4cf-4192-b55f-8ade3af72a08 0 [11e73009-08b3-4361-bd8d-89312b7cc7fa, 866260c... Elastogram estimation using adaptive-length Sa... biomedical engineering and informatics 2011 7
1174531 NaN 736b86b6-715c-4b3b-8d38-d6cd03709bb2 1 NaN Pose Invariant Face Recognition by Face Synthe... british machine vision conference 2000 7
2594452 NaN 9c4cf6a4-3d7a-4892-9acd-dc30336c73f1 1 [1a6ecea3-bad3-4139-8c15-9a35247b8be4, 93cffd7... An efficient intra-mode decision method for HEVC Signal, Image and Video Processing 2016 7
2881547 NaN 81d297e4-0b71-4f14-81e6-7fe51abd1d31 3 NaN Durchgängiges modellbasiertes Engineering von ... Automatisierungstechnik 2016 5
# remove unused columns and remove rows without authors
df = df[['authors', 'n_citation', 'title_len_words']] 
df = df[is_list]
df.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
authors n_citation title_len_words
0 [Makoto Satoh, Ryo Muramatsu, Mizue Kayama, Ka... 0 26
1 [Gareth Beale, Graeme Earl] 50 11
2 [Altaf Hossain, Faisal Zaman, Mohammed Nasser,... 50 14
3 [Jea-Bum Park, Byungmok Kim, Jian Shen, Sun-Yo... 0 16
4 [Giovanna Guerrini, Isabella Merlo] 2 7
df['num_authors'] = df.authors.apply(len)
df.num_authors.sum()
9476165
df = df.reset_index()
# unroll the authors column
rows = []
for _, row in df[is_list].iterrows():
    for author in row.authors:
        rows.append([author, row['index']])
C:\Users\harinsa\Anaconda3\lib\site-packages\ipykernel_launcher.py:3: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
  This is separate from the ipykernel package so we can avoid doing imports until
author_df = pd.DataFrame(rows)
author_df.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
0 1
0 Makoto Satoh 0
1 Ryo Muramatsu 0
2 Mizue Kayama 0
3 Kazunori Itoh 0
4 Masami Hashimoto 0
author_df.columns = ['author', 'index']
author_df = author_df.set_index('index') 
author_df.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
author
index
0 Makoto Satoh
0 Ryo Muramatsu
0 Mizue Kayama
0 Kazunori Itoh
0 Masami Hashimoto
df = author_df.join(df)[['author', 'title_len_words', 'n_citation', 'num_authors']].reset_index()
df.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
index author title_len_words n_citation num_authors
0 0 Makoto Satoh 26.0 0.0 8.0
1 0 Ryo Muramatsu 26.0 0.0 8.0
2 0 Mizue Kayama 26.0 0.0 8.0
3 0 Kazunori Itoh 26.0 0.0 8.0
4 0 Masami Hashimoto 26.0 0.0 8.0
# Convert author to integers
df.author = df.author.astype('category')
df.author = df.author.cat.rename_categories(np.arange(len(df.author.cat.categories)))
df.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
index author title_len_words n_citation num_authors
0 0 967189 26.0 0.0 8.0
1 0 1344877 26.0 0.0 8.0
2 0 1091158 26.0 0.0 8.0
3 0 831977 26.0 0.0 8.0
4 0 1021921 26.0 0.0 8.0
paper_count = df[['author', 'n_citation']].groupby('author').count()
# filter out author with only 1 paper, not sure if this is necessary because we are not doing any matching, but
# intuitively, it is probably better to have author with paper with multiple paper length to regress on
valid_authors = paper_count[paper_count.n_citation > 1]
# use only author with more than one paper
df = df[df.author.isin(valid_authors.index)]
# remove rows without data
df = df.dropna()
df.shape
(8577232, 5)
df.author.cat.categories
Int64Index([      0,       1,       2,       3,       4,       5,       6,
                  7,       8,       9,
            ...
            1766536, 1766537, 1766538, 1766539, 1766540, 1766541, 1766542,
            1766543, 1766544, 1766545],
           dtype='int64', length=1766546)

Still over a million author, would be impossible to do matching. Let's perform matching on a sample of authors instead.

def regress(n):
    author_sample = df.author.cat.categories.to_series().sample(n)
    df_sample = df[df.author.isin(author_sample)]
    df_sample.author = df_sample.author.astype(int).astype('category')
    df_sample.author = df_sample.author.cat.rename_categories(np.arange(len(df_sample.author.cat.categories)))

    X = pd.concat([pd.get_dummies(df_sample.author, prefix='author'), df_sample['title_len_words']], axis=1)
    sm.add_constant(X)
    y = df_sample['n_citation']
    model = sm.OLS(y, X)

    fit = model.fit()
    return fit
regress(100).conf_int().loc['title_len_words']
C:\Users\harinsa\Anaconda3\lib\site-packages\pandas\core\generic.py:4401: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self[name] = value





0   -2.130604
1    0.512479
Name: title_len_words, dtype: float64
regress(1000).conf_int().loc['title_len_words']
C:\Users\harinsa\Anaconda3\lib\site-packages\pandas\core\generic.py:4401: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self[name] = value





0   -3.208329
1    0.215463
Name: title_len_words, dtype: float64

Conclusion

If OLS is a valid way to remove confounding between "Title length" and "Number of Citations", then this experiment have not quite proven that shorter title length results in more citation, as the confidence interval of the coefficient of the title length have shown that it may still be due to randomness at a 95% confidence interval.

More importantly, we first need to verify whether OLS is in fact a good estimator for the causal effect in presence of non-binary causal state and categorical confounder. Furthermore, here we are using only a subset of the data we have due to the limit imposes by the categorical data. Each sample we run give conflicting results, some successfully rejects the null, while other did not.

About

Causal Analysis Experiment of Academic Paper's Title length on Number of Citation

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published