Incident 135: University of Texas at Austin’s Algorithm to Evaluate Graduate Applications, GRADE, Allegedly Exacerbated Existing Inequality for Marginalized Applicants, Prompting Tool Suspension

Description: The University of Texas at Austin's Department of Computer Science's use of an assistive algorithm to assess PhD applicants raised concerns about exacerbating historical inequalities of marginalized groups, prompting its suspension.

Suggested citation format

Hall, Patrick. (2012-12-01) Incident Number 135. in McGregor, S. (ed.) Artificial Intelligence Incident Database. Responsible AI Collaborative.

Incident Stats

Incident ID
135
Report Count
2
Incident Date
2012-12-01
Editors
Sean McGregor, Khoa Lam

Tools

New ReportNew ReportNew ResponseNew ResponseDiscoverDiscover

Incident Reports

A university announced it had ditched its machine-learning tool, used to filter thousands of PhD applications, right as the software's creators were giving a talk about the code and drawing public criticism.

The GRADE algorithm was developed by a pair of academics at the University of Texas at Austin, and it was used from 2013 to this year to assess those applying for a PhD at the US college's respected computer-science department. The software was trained using the details of previously accepted students, the idea being to teach the system to identify people the school would favor, and to highlight them to staff who would make the final call on the applications. It's likely the program picked up biases against applicants of certain backgrounds excluded from that historical data.

Hopefuls were assigned a score from zero to five by the code, and those with high scores were pushed forward to university staff by GRADE. The software, according to its creators in a paper describing the technology, "reduced the number of full reviews required per applicant by 71 percent and, by a conservative estimate, cut the total time spent reviewing files by at least 74 percent.” That means poorly scored applicants were given less attention by staff.

The compsci department has now distanced itself from the GRADE algorithm, first saying the code had the potential to pick up unfair biases, and later saying it was difficult to maintain. “The University of Texas at Austin’s Department of Computer Science stopped using the graduate admissions evaluator (GRADE) in early 2020,” a spokesperson told The Register in a statement on Monday.

“The system was used to organize graduate admissions in the Department of Computer Science between the 2013 and 2019 academic years. Researchers developed the statistical system in response to a high volume of applicants for graduate programs in the department. It was never used to make decisions to admit or reject prospective students, as at least one person in the department directly evaluates applicants at each stage of the review process.

“Changes in the data and software environment made the system increasingly difficult to maintain, and its use was discontinued. The graduate school works with graduate programs and faculty members across campus to promote efficient and effective holistic application reviews.”

The decision earlier this year to stop using GRADE to screen computer-science PhD candidates, however, was only announced by the department on Twitter last week after plasma physicist Yasmeen Musthafa drew attention to potential flaws in the statistical machine-learning software. Musthafa tweeted their widely shared criticism on November 30, the day before the creators of GRADE were due to give a presentation about their code at a virtual event arranged by the University of Maryland’s Department of Physics. On the day of the lecture, UT Austin tweeted it had abandoned the software:

TXCS is deeply committed to addressing the lack of diversity in our field. We are aware of the potential to encode bias into ML-based systems like GRADE, which is why we have phased out our reliance on GRADE and are no longer using it as part of our graduate admissions process.

— Computer Science at UT Austin (@UTCompSci) December 1, 2020

In fact, this damage-limitation move was made when GRADE's designers – Austin Waters and Risto Miikkulainen – were still presenting their work on the software to colleagues via Zoom. Although the presentation is not generally available, technical details and the effects of GRADE were shared in the form of a paper published in AI Magazine in 2014.

GRADE is trained on various features to rank applicants, including their GPA, the universities previously attended, letters of recommendation, area of research interest, and the faculty advisor they wish to study under. The algorithm then compares this information to PhD students the department has previously accepted to predict whether an applicant is likely to be granted a place or not. GRADE is designed to weed out weaker prospective students so that the university wastes less time by having to consider every application in full. In other words, it acts as a screening process helping the department focus on students that seem to be more promising.

“While every application is still looked at by a human reviewer," the 2014 paper noted, "GRADE makes the review process much more efficient. This is for two reasons. First, GRADE reduces the total number of full application reviews the committee must perform. Using the system’s predictions, reviewers can quickly identify a large number of weak candidates who will likely be rejected and a smaller number of exceptionally strong candidates."

UT Austin's computer-science department is ranked in the top ten of its ilk, and thousands of students fight for a place in its graduate programs.

When The Register asked if applicants were explicitly told that their applications were screened by an algorithm, and that the university stored their data to retrain and improve its system for the following year, UT Austin declined to answer the question. GRADE does not appear to have been rolled out for other departments nor at other universities.

Professor Miikkulainen, who helped invent the GRADE algorithm, said the tool was not biased against race or gender.

“To the degree we could measure bias, we found that the process did not add biases,” he told The Register. "Back in 2013, bias was not yet a mainstream topic in AI, and there were few techniques available, but our choice of learning method created an opportunity: The logistic regression model learns to assign weights on features according to how important they are in decision making.

“We did a separate experiment where we included gender and ethnic origin, and found that GRADE assigned zero weights to them – in other words, these features had no predictive power, ie: reviewers had not used them in making decisions. So to the extent it was possible to measure then, GRADE was unbiased in those respects.”

Nevertheless, the university has pledged to stop using GRADE in its graduate admissions process over fears that it could be biased, and that opinion is echoed by other academics.

“I was listening in the talk, and it is true that during the talk the UT Austin compsci department tweeted that because of concerns of fairness, they would no longer be using it,” Steve Rolston, a physics professor at the University of Maryland, told The Register.

Concerned about GRADE’s potential to damage a student’s application, he sent an email out assuring students that the system would not be rolled out at Maryland. “According to the speakers, the point of GRADE was to replicate the decisions of their admissions committee, and in fact it was trained on previous admissions committee data," he said.

"While it is possible that it was successful at that specific task, it would simply be replicating any biases that existed in the committees decisions, let alone the fact that [machine-learning] algorithms do not really give one any guidance on how they are classifying things. When they used GRADE, its results were always checked by a human, but I would be concerned that if you are told the algorithm rated someone low, it would inevitably color your opinion and was thus not necessarily a good check on the system.

“While [machine learning] is fine to do image classification for example, I think it is very dangerous to use it for things such as hiring or admissions. When we are admitting someone to the graduate program we are evaluating their potential to be successful, given a limited set of input data, much of which is subjective, [for example] letters of recommendation. There is no quantitative process to make such identifications, so an algorithm is unlikely to be helpful.”

Miikkulainen confirmed to El Reg that UT Austin has no plans to deploy another machine-learning algorithm to process applications in the future.

Uni revealed it killed off its PhD-applicant screening AI – just as its inventors gave a lecture about the tech

U of Texas at Austin has stopped using a machine-learning system to evaluate applicants for its Ph.D. in computer science. Critics say the system exacerbates existing inequality in the field.

In 2013, the University of Texas at Austin’s computer science department began using a machine-learning system called GRADE to help make decisions about who gets into its Ph.D. program -- and who doesn’t. This year, the department abandoned it.

Before the announcement, which the department released in the form of a tweet reply, few had even heard of the program. Now, its critics -- concerned about diversity, equity and fairness in admissions -- say it should never have been used in the first place.

“Humans code these systems. Humans are encoding their own biases into these algorithms,” said Yasmeen Musthafa, a Ph.D. student in plasma physics at the University of California, Irvine, who rang alarm bells about the system on Twitter. “What would UT Austin CS department have looked like without GRADE? We’ll never know.”

GRADE (which stands for GRaduate ADmissions Evaluator) was created by a UT faculty member and UT graduate student in computer science, originally to help the graduate admissions committee in the department save time. GRADE predicts how likely the admissions committee is to approve an applicant and expresses that prediction as a numerical score out of five. The system also explains what factors most impacted its decision.

The UT researchers who made GRADE trained it on a database of past admissions decisions. The system uses patterns from those decisions to calculate its scores for candidates.

For example, letters of recommendation containing the words “best,” “award,” “research” or “Ph.D.” are predictive of admission -- and can lead to a higher score -- while letters containing the words “good,” “class,” “programming” or “technology” are predictive of rejection. A higher grade point average means an applicant is more likely to be accepted, as does the name of an elite college or university on the résumé. Within the system, institutions were encoded into the categories “elite,” “good” and “other,” based on a survey of UT computer science faculty.

Every application GRADE scored during the seven years it was in use was still reviewed by at least one human committee member, UT Austin has said, but sometimes only one. Before GRADE, faculty members made multiple review passes over the pool. The system saved the committee time, according to its developers, by allowing faculty to focus on applicants on the cusp of admission or rejection and review applicants in descending order of quality.

For what it’s worth, GRADE did appear to successfully save the committee time. In the 2012 and 2013 application seasons, developers said in a paper about their work, it reduced the number of full reviews per candidate by 71 percent and cut the total time reviewing files by 74 percent. (One full review typically takes 10 to 30 minutes.) Between the years 2000 and 2012, applications to the computer science Ph.D. program grew from about 250 to nearly 650, though the number of faculty able to review those applications remained mostly constant. In the years since 2012, the number of applications has reached over 1,200.

The university’s use of the technology escaped attention for a number of years, until this month, when the physics department at the University of Maryland at College Park held a colloquium talk with the two creators of GRADE.

The talk gained attention on Twitter as graduate students accused GRADE’s creators of further disadvantaging underrepresented groups in the computer science admissions process.

“We put letters of recommendation in to try to lift people up who have maybe not great GPAs. We put a personal statement in the graduate application process to try to give marginalized folks a chance to have their voice heard,” said Musthafa, who is also a member of the Physics and Astronomy Anti-Racism Coalition. “The worst part about GRADE is that it throws that out completely.”

Advocates have long been concerned about the potential for human biases to be baked into or exacerbated by machine-learning algorithms. Algorithms are trained on data. When it comes to people, what those data look like is a result of historical inequity. Preferences for one type of person over another are often the result of conscious or unconscious bias.

That hasn’t stopped institutions from using machine-learning systems in hiring, policing and prison sentencing for a number of years now, often to great controversy.

“Every process is going to make some mistakes. The question is, where are those mistakes likely to be made and who is likely to suffer as a result of them?” said Manish Raghavan, a computer science Ph.D. candidate at Cornell University who has researched and written about bias in algorithms. “Likely those from underrepresented groups or people who don’t have the resources to be attending elite institutions.”

Though many women and people who are Black and Latinx have had successful careers in computer science, those groups are underrepresented in the field at large. In 2017, whites, Asians and nonresident aliens received 84 percent of degrees awarded for computer science in the United States.

At UT, nearly 80 percent of undergraduates in computer science in 2017 were men.

Raghavan said he was surprised that there appeared to be no effort to audit the impacts of GRADE, such as how scores differ across demographic groups.

GRADE’s creators have said that the system is only programmed to replicate what the admissions committee was doing prior to 2013, not to make better decisions than humans could. The system isn’t programmed to use race or gender to make its predictions, they’ve said. In fact, when given those features as options to help make its predictions, it chooses to give them zero weight. GRADE’s creators have said this is evidence that the committee’s decisions are gender and race neutral.

Detractors have countered this, arguing that race and gender can be encoded into other features of the application that the system uses. Women’s colleges and historically Black universities may be undervalued by the algorithm, they’ve said. Letters of recommendation are known to reflect gender bias, as recommenders are more likely to describe female students as “caring” rather than “assertive” or “trailblazing.”

In the Maryland talk, faculty raised their own concerns. What a committee is looking for might change each year. Letters of recommendation and personal statements should be thoughtfully considered, not turned into a bag of words, they said.

“I’m kind of shocked you did this experiment on your students,” Steve Rolston, chair of the physics department at Maryland, said during the talk. “You seem to have built a model that builds in whatever bias your committee had in 2013 and you’ve been using it ever since.”

In an interview, Rolston said graduate admissions can certainly be a challenge. His department receives over 800 graduate applications per year, which takes a good deal of time for faculty to evaluate. But, he said, his department would never use a tool like this.

“If I ask you to do a classifier of images and you’re looking for dogs, I can check afterwards that, yes, it did correctly identify dogs,” he said. “But when I’m asking for decisions about people, whether it's graduate admissions, or hiring or prison sentencing, there’s no obvious correct answer. You train it, but you don’t know what the result is really telling you.”

Rolston said having at least one faculty member review each application was not a convincing safeguard.

“If I give you a file and say, ‘Well, the algorithm said this person shouldn’t be accepted,’ that will inevitably bias the way you look at it,” he said.

UT Austin has said GRADE was used to organize admissions decisions, rather than make them.

"It was never used to make decisions to admit or reject prospective students, as at least one faculty member directly evaluates applicants at each stage of the review process," a spokesperson for the Graduate School said via email.

Despite the criticism around diversity and equity, UT Austin has said GRADE is being phased out because it is too difficult to maintain.

“Changes in the data and software environment made the system increasingly difficult to maintain, and its use was discontinued,” the spokesperson said via email. “The Graduate School works with graduate programs and faculty members across campus to promote holistic application review and reduce bias in admissions decisions.”

For Musthafa, the fact that GRADE may be gone for good does not impact the existing inequity in graduate admissions.

“The entire system is steeped in racism, sexism and ableism,” they said. “How many years of POC computer science students got denied [because of this]?”

Addressing that inequity -- as well as the competitiveness that led to the creation of GRADE -- may mean expanding committees, paying people for their time and giving Black and Latinx graduate students a voice in those decisions, they said. But automating cannot be part of that decision making.

“If we automate this to any extent, it’s just going to lock people out of academia,” Musthafa said. “The racism of today is being immortalized in the algorithms of tomorrow.”

The Death and Life of an Admissions Algorithm

Similar Incidents

By textual similarity

Did our AI mess up? Flag the unrelated incidents