Machine learning can accurately predict a scientist’s gender based on citation data alone

Date:

Collective effect Gender differences in citation networks may be due to a “rich get richer” effect where better-known researchers get more credit. (Courtesy: Shutterstock/aelitta)

Women and men have such different citation patterns that it is possible to accurately predict a scientist’s gender from such data alone. That’s the finding of a new study that investigates how men and women cite – and are cited by – their communities (Proc. Natl. Acad. Sci 119 e2206070119).

Led by network scientist Kristina Lerman from the University of Southern California, the authors studied 766 members of the US National Academy of Sciences (NAS), which included 120 women. They matched the scholars to their profiles on Microsoft Academic Graph, which contains metadata on over 150 million academic publications.

After identifying the scientists’ genders by checking pronouns on the individual’s biographies, the researchers created an “ego citation network” for each scientist. This contained “directional links”, indicating which other scientists – represented by nodes – the individual had cited, and which scientists had cited them.

It is well known that female scientists receive fewer citations than their male counterparts, but the new study reveals that women reciprocate a significantly higher fraction of citations than men do. A woman’s network also has more “connectedness”, suggesting that women tend to work in more closely knit research communities.

The study found as well that women have fewer peers – though these tend to be highly productive colleagues – and that women have a greater proportion of female scientists in their networks.

Rich get richer

The researchers then trained a machine learning algorithm on 75% of the data that was randomly selected. Using the other 25% to test the system, they found that the algorithm can accurately predict a scientist’s gender based on the citation networks – correctly doing so about 80% of the time.

The citation networks showed few significant differences based on the prestige of an author’s affiliated institution, although NAS membership is highly skewed towards more prestigious institutes. The researchers also found that women are under-represented across all seven fields they looked at. Just 8% of NAS physicists were women – the lowest proportion of all the fields studied.

Lerman thinks the gender differences in citation networks could be down to two aspects. “There is a preference by both genders to cite men, and preferential attachment — or the ‘rich get richer’ effect — is the well-known mechanism of rewards in science, where the already better-known researchers get more credit,” she says. “We are now working on a manuscript that shows how a large gender disparity can emerge from these components.”

Latest Intelligence