PortNLP - NLP Lab at Portland State: Research Story

NSF #2246174 Toward Socially Diverse Multi Document Summarization

NSF #2246174 CRII Grant, National Science Foundation, $169,989, April 2023 to March 2026

Publications

$170K

NSF Funding

Dialect Groups

10+

AI Models Tested

When a news event breaks, thousands of people tweet about it simultaneously — in African American English, Hispanic aligned Language, White aligned English, and everything in between. If you ask an AI to summarize those tweets, it should represent all of those voices. It doesn't. Summarization models consistently over-represent some dialect groups and under-represent others, even when every group contributes equally to the input.

This research program asks why that happens and how to fix it. Over six papers, the lab built new datasets, exposed specific mechanisms of bias (like input ordering effects), developed algorithms that achieve equal representation without losing summary quality, and uncovered a deeper problem in how LLMs use context at all.

The Problem, Visualized

Input: 90 Tweets
(30 per group, equal)

White aligned English

African American English

Hispanic aligned Language

Output: 6 sentence summary

⚠ Ordered input

White aligned: 4 African Am: 1 Hispanic: 1

ΔFair = 0.23

✓ Shuffled input

White aligned: 2 African Am: 2 Hispanic: 2

ΔFair = 0.04

Same AI model, same tweets, only the order changes. Grouping tweets by dialect causes the model to over represent whichever group appears first. Shuffling the input largely eliminates the bias.

"An important yet overlooked dimension of quality is the breadth of diverse perspectives that a summary encapsulates." (NSF Award Abstract #2246174)

Research Story and Publications

Six papers over four years, each building on the last.

2022

Foundation

Analyzing the Dialect Diversity in Multi Document Summaries

COLING 2022, Olabisi, Hudson, Jetter & Agrawal

The lab built DivSumm, the first summarization dataset designed around linguistic diversity: 2,250 tweets across African American English, Hispanic aligned Language, and White aligned Language on 25 topics, with human-written reference summaries.

Key finding: Existing models showed measurable bias toward White aligned English even when all three groups contributed equally to the input.

2022

Measurement

Assessing Inter metric Correlation for Multi Document Summarization Evaluation

GEM Workshop @ EMNLP 2022, Ridenour, Agrawal & Olabisi

How do you measure summarization quality when metrics disagree with each other? This study tested 16 evaluation metrics across 5 datasets and found that in multi-document summarization, metrics frequently contradict each other, especially across different domains like news, tweets, and peer reviews.

Key finding: Reference based and reference free metrics almost never agree in MDS. Any trustworthy evaluation must use metrics from both families, a finding that shaped the evaluation methodology in every subsequent paper in this program.

2023

NSF Grant

NSF Awards $169,989 to Advance Socially Diverse Summarization

NSF Award #2246174, CRII: Robust Intelligence Program

The NSF awarded Dr. Agrawal a CRII grant (a program for early-career researchers) to fund three years of work on benchmarks and algorithms for socially diverse summarization.

Program goal: Build the datasets and algorithms needed so that multi-document summarization is fair by design.

2024

Discovery

Understanding Position Bias Effects on Fairness in Social Multi Document Summarization

VarDial Workshop @ NAACL 2024, Olabisi & Agrawal

The order you feed tweets to an AI changes whose voice ends up in the summary. When tweets from different dialect groups are grouped together rather than shuffled, the model draws up to 3× more content from whichever group appears first. Standard quality metrics don't flag this at all.

Key finding: Quality and fairness can move in completely opposite directions. A fluent, high scoring summary can be deeply unfair, and no standard metric will flag it. Shuffling the input largely eliminates the effect.

2025

Solution

Fair Summarization: Bridging Quality and Diversity in Extractive Summaries

C3NLP Workshop @ NAACL 2025, Bagheri Nezhad, Bandyapadhyay & Agrawal

Two new methods: FairExtract uses mathematical fairlet decomposition to guarantee equal group representation through clustering. FairGPT guides GPT 3.5 with structured prompts and validates that outputs are balanced. Both achieve fairness scores of F=1.0 while maintaining competitive summary quality.

Key finding: Achieving F=1.0 (equal representation) is possible without sacrificing quality, but only with methods specifically designed for both goals.

2025

Application

Towards Personalized Explanations for Health Simulations: A Mixed Methods Framework for Stakeholder Centric Summarization

AAAI Symposium Series 2025, Giabbanelli & Agrawal

Health simulation models produce complex outputs that need to make sense to very different audiences — a doctor needs different details than a patient or a policymaker. This interdisciplinary paper (with Old Dominion University) proposes a framework for using LLMs to tailor summaries to each audience.

Significance: Takes the lab's summarization work into health informatics, where how you communicate findings can directly affect medical decisions.

2025

Grounding

"Lost-in-the-Later": Framework for Quantifying Contextual Grounding in Large Language Models

ICDMW (IEEE) 2025, Tao, Hiatt, Seetharaman & Agrawal

The position bias work raised a bigger question: do LLMs actually use the context you give them? This paper introduces CoPE, a framework that measures how much of a model's response comes from the provided context versus its own internal memory. Tested across six models and three languages using a new dataset (MultiWikiAtomic, 15,000 atomic sentences).

Key findings: LLMs cap contextual grounding at roughly 70%, consistently deprioritizing information that appears later in a context , a phenomenon named lost-in-the-later. Chain of Thought prompting makes this worse, not better. A targeted CK prompt strategy increases grounding by ~8 points and reduces hallucination, confirmed in a multi document summarization case study using QMSum and DivSum.

Six Key Findings

Quality and fairness are independent dimensions. An AI can produce a fluent, coherent, high scoring summary that is deeply unfair to certain groups. Standard metrics do not capture this. Dedicated fairness metrics are required.

Input order is a major, hidden source of bias. Grouping dialect communities together in the input, rather than shuffling, causes models to draw up to 3× more content from whichever group appears first. This is consistent across 10 different AI models.

Equal representation is achievable without sacrificing quality. Both FairExtract and FairGPT achieve F=1.0, equal representation across groups, while keeping quality competitive with state of the art models. The quality fairness tradeoff is not inevitable when methods are designed for both goals.

LLMs systematically neglect the later parts of their own context. Across all models and languages tested, recall from the first quartile of a context is roughly double that of the last quartile , the lost-in-the-later effect. This is not a long context failure: it appears in inputs as short as 10 sentences and persists even when sentence order is randomised, pointing to a pretraining level bias toward earlier tokens.

Chain of Thought prompting reduces contextual grounding, not increases it. CoT outputs are over 50% shorter than standard responses and recall less from later context segments, exacerbating the lost in the later effect. Reasoning models (GPT-o3, Qwen 3 235B) show the lowest contextual grounding of all, plateauing around 50-55 CK regardless of how much context is provided.

A targeted CK prompt is the most effective fix. Instructing models to use only the provided context and draw evenly from all parts raises CK scores by ~8 points and produces significantly more uniform recall across quartiles. Applied to multi document summarization (QMSum and DivSum), the CK prompt improves NLI based alignment with reference summaries by 4 to 11 points with no loss in coherence or fluency.

Broader Impact

Public Dataset & Tools

DivSumm, FairExtract, and FairGPT are all publicly available. Other researchers can use them to test their own summarization systems for dialect bias.

Healthcare Communication

The health simulations work shows how the same underlying research applies when a doctor and a patient need different versions of the same information.

Student Training

The NSF grant funds graduate and undergraduate researchers at Portland State working on language technology and social equity.

Cross-disciplinary Reach

The collaboration with Old Dominion University's health simulation lab takes this work outside of NLP and into public health and policy modeling.

Evidence & Primary Sources

For detailed experimental results, data tables, and figures, refer to the original papers below.

2022 COLING

Analyzing the Dialect Diversity in Multi Document Summaries

Olabisi, Hudson, Jetter & Agrawal

Introduces the DivSumm dataset and documents measurable bias toward White aligned English across summarization models.

2022 GEM Workshop @ EMNLP

Assessing Inter-metric Correlation for Multi Document Summarization Evaluation

Ridenour, Agrawal & Olabisi

Tests 16 evaluation metrics across 5 datasets and finds that reference-based and reference-free metrics rarely agree in MDS.

2024 VarDial Workshop @ NAACL

Understanding Position Bias Effects on Fairness in Social Multi Document Summarization

Olabisi & Agrawal

Reveals that input ordering dramatically shifts whose voice gets represented, with fairness data across seven models and multiple orderings.

2025 C3NLP Workshop @ NAACL

Fair Summarization: Bridging Quality and Diversity in Extractive Summaries

Bagheri Nezhad, Bandyapadhyay & Agrawal

Presents FairExtract and FairGPT, both achieving perfect fairness (F=1.0) with competitive quality scores.

2025 AAAI Symposium Series

Towards Personalized Explanations for Health Simulations: A Mixed-Methods Framework for Stakeholder-Centric Summarization

Giabbanelli & Agrawal

Extends the research into health informatics, proposing a framework for tailoring simulation summaries to different stakeholder groups.

2025 ICDMW (IEEE)

"Lost-in-the-Later": Framework for Quantifying Contextual Grounding in Large Language Models

Tao, Hiatt, Seetharaman & Agrawal

Introduces the CoPE framework and MultiWikiAtomic dataset, measuring how LLMs deprioritize later context across six models and three languages.