Events
Past Event
WED@NICO SEMINAR: Asma Ghandeharioun, Google DeepMind "Model Interpretability: from Illusions to Opportunities"
Northwestern Institute on Complex Systems (NICO)
12:00 PM
//
Lower Level, Chambers Hall
Details
Speaker:
Asma Ghandeharioun, Senior Research Scientist, People + AI Research Team, Google DeepMind
Title:
Model Interpretability: from Illusions to Opportunities
Abstract:
While the capabilities of today’s large language models (LLMs) are reaching—and even surpassing—what was once thought impossible, concerns remain regarding their misalignment, such as generating misinformation or harmful text, which continues to be an open area of research. Understanding LLMs’ internal representations can help explain their behavior, verify their alignment with human values, and mitigate instances where they produce errors. In this talk, I begin by challenging common misconceptions about the connections between LLMs' hidden representations and their downstream behavior, highlighting several “interpretability illusions.” For example, I demonstrate that, counterintuitively, localizing and editing facts within an LLM’s hidden representations can be disconnected; model failure and success in the wild cannot necessarily be predicted based on a relatively faithful proxy at training time; and even within the same architecture, representation similarity is not always indicative of prediction similarity.
Next, I introduce Patchscopes, a new framework that leverages the model itself to explain its internal representations in natural language. I’ll show how it can be used to answer a wide range of questions about an LLM's computation. I also demonstrate that many prior interpretability methods—based on projecting representations into the vocabulary space and intervening in LLM computation—can be viewed as instances of this framework. Furthermore, several of their shortcomings, such as difficulty inspecting early layers or lack of expressivity, can be mitigated by Patchscopes. Beyond unifying prior inspection techniques, Patchscopes opens up new possibilities, such as using a more capable model to explain the representations of a smaller model and multihop reasoning error correction.
Finally, I discuss a few failure cases in today’s most capable LLMs and show how Patchscopes can shed light on their mechanics and suggest mitigation strategies. For example, we observe that safety-tuned models may still divulge harmful information, and whether they do so often depends significantly on who they are talking to—what we refer to as the user persona. Using Patchscopes, we show that harmful content can persist in hidden representations and can be easily extracted. Additionally, we demonstrate that certain user personas can induce the model to form more charitable interpretations of otherwise dangerous queries
Speaker Bio:
Asma Ghandeharioun, Ph.D., is a senior research scientist with the People + AI Research team at Google DeepMind. She works on aligning AI with human values through better understanding [1] and controlling (language) models [2], uniquely by demystifying their inner workings [3] and correcting collective misconceptions along the way [4, 5]. While her current research is mostly focused on machine learning interpretability, her previous work spans conversational AI, affective computing, and, more broadly, human-centered AI. She holds a doctorate and master’s degree from MIT and a bachelor’s degree from the Sharif University of Technology. She has been trained as a computer scientist/engineer and has research experience at MIT, Google Research, Microsoft Research, Ecole Polytechnique Fédérale de Lausanne (EPFL), to name a few.
Her work has been published in premier peer-reviewed machine learning venues such as ICLR, NeurIPS, ICML, EMNLP, AAAI, ACII, and AISTATS. She has received awards at NeurIPS and her work has been featured in Wired, Wall Street Journal, and New Scientist.
Location:
In person: Chambers Hall, 600 Foster Street, Lower Level
Remote option: https://northwestern.zoom.us/j/91475935376
Passcode: NICO24
About the Speaker Series:
Wednesdays@NICO is a vibrant weekly seminar series focusing broadly on the topics of complex systems, data science and network science. It brings together attendees ranging from graduate students to senior faculty who span all of the schools across Northwestern, from applied math to sociology to biology and every discipline in-between. Please visit: https://bit.ly/WedatNICO for information on future speakers.
Time
Wednesday, October 9, 2024 at 12:00 PM - 1:00 PM
Location
Lower Level, Chambers Hall Map
Contact
Calendar
Northwestern Institute on Complex Systems (NICO)
WED@NICO SEMINAR: Lightning Talks with NU Scholars!
Northwestern Institute on Complex Systems (NICO)
12:00 PM
//
Lower Level, Chambers Hall
Details
May 20th Speakers:
Yulin Yu, Postdoctoral Fellow, Kellogg School of Management
Feihong Xu, PhD Candidate, McCormick School of Engineering
Maalvika Bhat, PhD Student, School of Communication
Rochana Chaturvedi, Postdoctoral Fellow, Kellogg School of Management
NICO Lightning talks are open to Northwestern graduate students, postdoctoral fellows, and visiting scholars! If you are interested in signing up for a future session, please fill out this short survey.
Talk Titles and Abstracts:
Yulin Yu
Postdoctoral Fellow
Kellogg School of Management &
Northwestern Institute on Complex Systems
Human–AI Creative Pathways: How People and Machines Differ in Creative Strategy
Generative AI offers the promise of amplifying creativity by recombining knowledge at a scale far beyond human capacity, yet humans still hold key advantages in flexibility and contextual reasoning. To understand how each achieves novelty, we analyzed more than 5,000 responses to the Divergent Association Task from both humans and AI systems using network-based methods. We find that while individual humans use fewer and simpler conceptual categories than machines, the collective diversity of human ideas is substantially higher. Human creative pathways tend to follow a one-directional but highly unpredictable trajectory, whereas AI systems rely on repetitive, back-and-forth exploration patterns. Finally, both humans and machines show anchoring effects—early ideas shape later responses—but in opposite ways: humans anchor low, while machines anchor high.
Feihong Xu
PhD Candidate
Engineering Sciences & Applied Mathematics
McCormick School of Engineering
A Well-Calibrated Model Similarity Measure for Arbitrary Neural Networks
Deep learning approaches have transformed biological and biomedical image analysis, but model opacity and fragility remain major obstacles to trustworthy use. One barrier is the lack of a well-calibrated measure of similarity across arbitrary neural networks trained with different architectures, checkpoints, random initializations, and training strategies. Existing notions of model similarity span functional and representational domains, often rely on heuristic assumptions, and are susceptible to spurious signals introduced by probing samples, making principled cross-model meta-analysis difficult. Here, we clarify prevailing notions of deep neural network similarity and benchmark their robustness under extensive out-of-distribution perturbations. We then introduce the Ahmad RV coefficient on chain weight matrices (wARV), a theoretically grounded weight-space similarity measure that combines chain-normalized weights with the RV coefficient. wARV is sample-agnostic, symmetric, computationally efficient, and better calibrated than current measures. Across benchmarks varying random initialization, training checkpoint, architecture, and training strategy, wARV more faithfully tracks functional similarity while avoiding confounding effects from probing data. Applying wARV to deep neural network models on both generic and medical image classification tasks, we uncover substantial learning heterogeneity and instability even among models with similar predictive performance.
Maalvika Bhat
PhD Student
Technology and Social Behavior
School of Communication &
McCormick School of Engineering
Scholars See Clickbait as a Greater Threat to Science Than to Their Own Work
As scientific research competes for attention in a media landscape driven by sensationalism, the risks of misrepresentation grow. This study examines whether academics, while widely recognizing clickbait as a threat to science broadly, tend to downplay its relevance to their own work. Surveying 5,603 U.S.-based researchers, we find a consistent perception gap between systemic and personal risk, one that varies by career stage and disciplinary context. Early-career scholars show a pronounced version of this asymmetry: they express heightened concern about clickbait’s harms to science while rating its relevance to their own work as comparatively lower, a pattern that leaves them most exposed at a stage when reputational stakes are highest.
Rochana Chaturvedi
Postdoctoral Fellow
Kellogg School of Management &
Northwestern Institute on Complex Systems
Who Gets the Callback? Generative Artificial Intelligence and Gender Bias
Large language models are increasingly embedded in hiring workflows, raising concerns about their potential to amplify societal biases — yet how these biases manifest within and across occupations, and the role of model 'personality' in shaping these biases, remains unexplored. We introduce a three-part attribution framework applied to 332,044 real-world job ads, measuring gender-based callback bias, associations of skills and traits with gendered stereotypes in LLMs, and the effect of simulated recruiter personas. We find that LLMs systematically favor men, especially in higher-wage roles, with their decisions tracking traditional gendered language cues in job postings. Notably, assigning a low-agreeableness persona reduces model bias, implicating sycophancy as a mechanism reinforcing societal stereotypes; at the same time, controversial personas trigger internal guardrails leading to more cautious and less-biased outputs. These findings highlight how alignment choices in AI-driven hiring systems shape bias, with important implications for fairness and diversity.
Location:
In person: Chambers Hall, 600 Foster Street, Lower Level
Remote option: https://northwestern.zoom.us/j/98031689779
About the Speaker Series:
Wednesdays@NICO is a vibrant weekly seminar series focusing broadly on the topics of complex systems, networks, and artificial intelligence. It brings together attendees ranging from graduate students to senior faculty who span all of the schools across Northwestern, from applied math to sociology to biology and every discipline in-between. Please visit: https://bit.ly/WedatNICO for information on future speakers.
Time
Wednesday, May 20, 2026 at 12:00 PM - 1:00 PM
Location
Lower Level, Chambers Hall Map
Contact
Calendar
Northwestern Institute on Complex Systems (NICO)