Getting citation data from openAlex by DOI

The code shown in this notebook is a simplified version of a coding class at the University of Copenhagen Library, showing how to retrieve total citations per paper and citations per year for a set of DOI's.

In this example, we hard-code the DOI's, ideally these should be obtained from a different source, e.g. a research information system, database or similar.

Step 1. Prepare the system and ready the DOI's

In this example, I picked five completely random papers from my reading library, and list their DOI's in a hard-coded variable.

In [26]:
import requests
import matplotlib.pyplot as plt

dois = ['10.1371/journal.pone.0073381',

# Example of how to read in DOI's from a raw text file, one DOI per line. Uncomment (remove #) to use
#with open("doi.txt") as file:
#  dois = file.readlines()
#  dois = [doi.rstrip() for doi in dois]

Step 2. Get citation data per DOI

We are interested in citations per paper (saved in cites and citations per year (total for all papers), saved in cites_by_year. We use the single-entity retrieval method from openAlex, using the DOI as ID.

If a DOI does not exist in openAlex, the requests-query returns a 404 response, which we could use to report better on the missing coverage, however, for this simple example, we just stick to a try-except solution and report the number of errors, e.

In [27]:
cites_by_year = {}
cites = []
e = 0

for doi in dois:
        response = requests.get("" + doi)
        result = response.json()
        cbys = result["counts_by_year"]
        for cby in cbys:
            y = cby["year"]
            c = cby["cited_by_count"]
            if y in cites_by_year:
                cites_by_year[y] = cites_by_year[y] + c
                cites_by_year[y] = c
        e = e + 1
print("DOI's with error: " + str(e))
DOI's with error: 0

Just checking the results, to see if they are making any kind of sense:

In [28]:
[52, 39, 116, 23, 37]
{2022: 1, 2021: 19, 2020: 12, 2019: 25, 2018: 21, 2017: 20, 2016: 35, 2015: 23, 2014: 16, 2013: 22, 2012: 11}

Looks like everything works as intended. We could end here, but:

Step 3. A quick visualization of the results

First citations per year:

In [29]:'seaborn-whitegrid')
cby = dict(sorted(cites_by_year.items()))
x = list(cby.keys())
y = list(cby.values())

plt.plot(x, y, '-o', color='#276FBF');

And now citations per paper, ranked by total citations:

In [31]:
cpp = sorted(cites, reverse = True)
x = list(range(1,len(cpp)+1)),cpp)
<BarContainer object of 5 artists>
In [ ]: