NSF #2246174 CRII Grant, National Science Foundation, $169,989, April 2023 to March 2026
6
Publications
$170K
NSF Funding
3
Dialect Groups
10+
AI Models Tested

When a news event breaks, thousands of people tweet about it simultaneously — in African American English, Hispanic aligned Language, White aligned English, and everything in between. If you ask an AI to summarize those tweets, it should represent all of those voices. It doesn't. Summarization models consistently over-represent some dialect groups and under-represent others, even when every group contributes equally to the input.

This research program asks why that happens and how to fix it. Over six papers, the lab built new datasets, exposed specific mechanisms of bias (like input ordering effects), developed algorithms that achieve equal representation without losing summary quality, and uncovered a deeper problem in how LLMs use context at all.

The Problem, Visualized
Input: 90 Tweets
(30 per group, equal)
White aligned English
African American English
Hispanic aligned Language
AI Summarizer
Output: 6 sentence summary
⚠ Ordered input
White aligned: 4 African Am: 1 Hispanic: 1
ΔFair = 0.23
✓ Shuffled input
White aligned: 2 African Am: 2 Hispanic: 2
ΔFair = 0.04
Same AI model, same tweets, only the order changes. Grouping tweets by dialect causes the model to over represent whichever group appears first. Shuffling the input largely eliminates the bias.
"An important yet overlooked dimension of quality is the breadth of diverse perspectives that a summary encapsulates." (NSF Award Abstract #2246174)
Research Story and Publications

Six papers over four years, each building on the last.

2022
Foundation
Analyzing the Dialect Diversity in Multi Document Summaries
COLING 2022, Olabisi, Hudson, Jetter & Agrawal
The lab built DivSumm, the first summarization dataset designed around linguistic diversity: 2,250 tweets across African American English, Hispanic aligned Language, and White aligned Language on 25 topics, with human-written reference summaries.
Key finding: Existing models showed measurable bias toward White aligned English even when all three groups contributed equally to the input.
2022
Measurement
Assessing Inter metric Correlation for Multi Document Summarization Evaluation
GEM Workshop @ EMNLP 2022, Ridenour, Agrawal & Olabisi
How do you measure summarization quality when metrics disagree with each other? This study tested 16 evaluation metrics across 5 datasets and found that in multi-document summarization, metrics frequently contradict each other, especially across different domains like news, tweets, and peer reviews.
Key finding: Reference based and reference free metrics almost never agree in MDS. Any trustworthy evaluation must use metrics from both families, a finding that shaped the evaluation methodology in every subsequent paper in this program.
2023
NSF Grant
NSF Awards $169,989 to Advance Socially Diverse Summarization
NSF Award #2246174, CRII: Robust Intelligence Program
The NSF awarded Dr. Agrawal a CRII grant (a program for early-career researchers) to fund three years of work on benchmarks and algorithms for socially diverse summarization.
Program goal: Build the datasets and algorithms needed so that multi-document summarization is fair by design.
2024
Discovery
Understanding Position Bias Effects on Fairness in Social Multi Document Summarization
VarDial Workshop @ NAACL 2024, Olabisi & Agrawal
The order you feed tweets to an AI changes whose voice ends up in the summary. When tweets from different dialect groups are grouped together rather than shuffled, the model draws up to 3× more content from whichever group appears first. Standard quality metrics don't flag this at all.
Key finding: Quality and fairness can move in completely opposite directions. A fluent, high scoring summary can be deeply unfair, and no standard metric will flag it. Shuffling the input largely eliminates the effect.
2025
Solution
Fair Summarization: Bridging Quality and Diversity in Extractive Summaries
C3NLP Workshop @ NAACL 2025, Bagheri Nezhad, Bandyapadhyay & Agrawal
Two new methods: FairExtract uses mathematical fairlet decomposition to guarantee equal group representation through clustering. FairGPT guides GPT 3.5 with structured prompts and validates that outputs are balanced. Both achieve fairness scores of F=1.0 while maintaining competitive summary quality.
Key finding: Achieving F=1.0 (equal representation) is possible without sacrificing quality, but only with methods specifically designed for both goals.
2025
Application
Towards Personalized Explanations for Health Simulations: A Mixed Methods Framework for Stakeholder Centric Summarization
AAAI Symposium Series 2025, Giabbanelli & Agrawal
Health simulation models produce complex outputs that need to make sense to very different audiences — a doctor needs different details than a patient or a policymaker. This interdisciplinary paper (with Old Dominion University) proposes a framework for using LLMs to tailor summaries to each audience.
Significance: Takes the lab's summarization work into health informatics, where how you communicate findings can directly affect medical decisions.
2025
Grounding
"Lost-in-the-Later": Framework for Quantifying Contextual Grounding in Large Language Models
ICDMW (IEEE) 2025, Tao, Hiatt, Seetharaman & Agrawal
The position bias work raised a bigger question: do LLMs actually use the context you give them? This paper introduces CoPE, a framework that measures how much of a model's response comes from the provided context versus its own internal memory. Tested across six models and three languages using a new dataset (MultiWikiAtomic, 15,000 atomic sentences).
Key findings: LLMs cap contextual grounding at roughly 70%, consistently deprioritizing information that appears later in a context , a phenomenon named lost-in-the-later. Chain of Thought prompting makes this worse, not better. A targeted CK prompt strategy increases grounding by ~8 points and reduces hallucination, confirmed in a multi document summarization case study using QMSum and DivSum.
Six Key Findings
01
Quality and fairness are independent dimensions. An AI can produce a fluent, coherent, high scoring summary that is deeply unfair to certain groups. Standard metrics do not capture this. Dedicated fairness metrics are required.
02
Input order is a major, hidden source of bias. Grouping dialect communities together in the input, rather than shuffling, causes models to draw up to 3× more content from whichever group appears first. This is consistent across 10 different AI models.
03
Equal representation is achievable without sacrificing quality. Both FairExtract and FairGPT achieve F=1.0, equal representation across groups, while keeping quality competitive with state of the art models. The quality fairness tradeoff is not inevitable when methods are designed for both goals.
04
LLMs systematically neglect the later parts of their own context. Across all models and languages tested, recall from the first quartile of a context is roughly double that of the last quartile , the lost-in-the-later effect. This is not a long context failure: it appears in inputs as short as 10 sentences and persists even when sentence order is randomised, pointing to a pretraining level bias toward earlier tokens.
05
Chain of Thought prompting reduces contextual grounding, not increases it. CoT outputs are over 50% shorter than standard responses and recall less from later context segments, exacerbating the lost in the later effect. Reasoning models (GPT-o3, Qwen 3 235B) show the lowest contextual grounding of all, plateauing around 50-55 CK regardless of how much context is provided.
06
A targeted CK prompt is the most effective fix. Instructing models to use only the provided context and draw evenly from all parts raises CK scores by ~8 points and produces significantly more uniform recall across quartiles. Applied to multi document summarization (QMSum and DivSum), the CK prompt improves NLI based alignment with reference summaries by 4 to 11 points with no loss in coherence or fluency.
Broader Impact
Public Dataset & Tools
DivSumm, FairExtract, and FairGPT are all publicly available. Other researchers can use them to test their own summarization systems for dialect bias.
Healthcare Communication
The health simulations work shows how the same underlying research applies when a doctor and a patient need different versions of the same information.
Student Training
The NSF grant funds graduate and undergraduate researchers at Portland State working on language technology and social equity.
Cross-disciplinary Reach
The collaboration with Old Dominion University's health simulation lab takes this work outside of NLP and into public health and policy modeling.
Evidence & Primary Sources

For detailed experimental results, data tables, and figures, refer to the original papers below.

2022 COLING
Olabisi, Hudson, Jetter & Agrawal
Introduces the DivSumm dataset and documents measurable bias toward White aligned English across summarization models.
2022 GEM Workshop @ EMNLP
Ridenour, Agrawal & Olabisi
Tests 16 evaluation metrics across 5 datasets and finds that reference-based and reference-free metrics rarely agree in MDS.
2024 VarDial Workshop @ NAACL
Olabisi & Agrawal
Reveals that input ordering dramatically shifts whose voice gets represented, with fairness data across seven models and multiple orderings.
2025 C3NLP Workshop @ NAACL
Bagheri Nezhad, Bandyapadhyay & Agrawal
Presents FairExtract and FairGPT, both achieving perfect fairness (F=1.0) with competitive quality scores.
2025 AAAI Symposium Series
Giabbanelli & Agrawal
Extends the research into health informatics, proposing a framework for tailoring simulation summaries to different stakeholder groups.
2025 ICDMW (IEEE)
Tao, Hiatt, Seetharaman & Agrawal
Introduces the CoPE framework and MultiWikiAtomic dataset, measuring how LLMs deprioritize later context across six models and three languages.