What Navy Federal found in its synthetic data pilot

Active-duty service members are busy. They rarely have free time, and it’s rarer still they want to spend that free time answering surveys.

That’s a problem for Kathleen Myers, assistant vice president of research strategy at Navy Federal Credit Union. The financial institution exclusively serves the military, veterans and their families, and it needs member feedback to know how to best meet their needs.

For research teams analyzing hard-to-reach constituents, synthetic data – artificial intelligence-generated data meant to mimic the real world – is an attractive alternative.

Research teams are already using synthetic data, and interest is only growing. Qualtrics found that 41% of market researchers are using synthetic data to supplement or replace human respondents, and 62% of respondents say they intend to use it.

“Active-duty people are really, really hard to find,” Myers said. “The thought of being able – once you're certain that the models behind it are kind of tuned appropriately – to use that for quick feedback or feedback at greater scale, I think is a huge opportunity for us.”

The credit union partnered with Qualtrics on a pilot project to to understand how synthetic data performs compared to third-party human panels.

“Our intention was just to really test and learn more than walk away with some solid [key performance indicators] that we were going to use for decision-making,” Myers said. “This really was about innovation.”

Synthetic data has a plethora of benefits – it can save time, reduce bias and costs less. But whether the technology is ready for primetime is still a matter of debate among experts.

For Myers, it’s a matter of what synthetic data is being used for.

“It's just being judicious as to where you use it,” Myers said. “We saw that the synths are very good at making rational judgments, answering very functional-oriented questions. When it gets into emotion, then it gets a little trickier.”

What Navy Federal and Qualtrics found

Navy Federal and Qualtrics used synthetic data to test a general population’s attitudes toward trust and financial services for a potential credit card package. While the financial service provider sees potential benefits in using synthetic data for its military population, it wanted to take a broader approach for its initial test.

The pilot set out to determine the level of similarity between traditional and synthetic responses, areas of variance, the time and cost savings, and whether – and how often – the same decisions would be reached.

Navy Federal provided a survey, which it reviewed and revised with Qualtrics. Qualtrics then fielded the survey to 501 traditional third-party panelists and used it to model 498 synthetic responses.

Across all questions, responses appeared consistent between human and synthetic. Mean scores between the two groups were within 0.25%.

But there were differences: Synthetic responses were more likely to over-index on selecting efficiency or financial benefits, but underestimate the need for empathy.

On the other hand, synthetic data can reduce human bias, Qualtrics found. Likert-scale questions – those that ask to what degree you agree on a scale – tend to suffer from an acquiescence bias in which humans are more likely to agree with a statement whether or not they actually agree.

People are also susceptible to social desirability bias, making them more likely to avoid admitting to risky behaviors. Synthetic responses offered more differentiated responses.

Humans say they are reticent to apply for credit cards, fearing rejection

Percentage that agree with the following statement: "I'm reluctant to apply for a (an additional) card because I might get rejected."

In the pilot, synths and humans were asked to what degree they agree with the statement “I'm reluctant to apply for a (an additional) card because I might get rejected.” Such a question would likely suffer from both acquiescence bias and social desirability bias. Navy Federal saw a 20-percentage-point spread in answers, with 41% of human respondents agreeing and only 21% of synthetic responses agreeing.

With synthetic data, “there's less bias in terms of the social satisfying," Myers said. “They're likely to be more honest in their answers, there's no feelings of shame or embarrassment, so you may get some very candid answers if you have a sensitive subject.”

Challenges of synthetic data

Humans are not always logical. They’re emotional and impulse-driven. And that’s where synthetic responses don’t always keep up.

While synths might be good at reducing bias, they aren’t as accurate at mirroring human emotions and when those emotions impact action.

Take the pilot survey’s responses on trust in a variety of industries. The synthetic responses were far more likely to trust in all industries, while humans were more discerning between industries.

Another place where there was large differentiation was on consumer concerns. Synths were far more likely to select cybercrime as a top consumer concern, with 24% of synthetic responses selecting that compared to just 6% of human respondents. The synthetic responses, however, more closely mirror research from Pew that shows about one in five adults lost money from an online scam or attack.

Synthetics provide logical responses but underestimate emotions

Human and synthetic responses on “the top three things that keeps you up at night.”

Synthetic responses favored efficiency or financial benefits but underestimated emotions, Myers said.

“We’ve observed that synthetic data will offer more rational responses to questions we pose, navigating functional trade-offs logically and effectively,” Myers said.

Andy Pierce, a member of Bain & Co.’s customer strategy and marketing practice, has seen the same.

“One thing that we found with the [large language models] is that they're highly rational, so if you, for instance, trained up an LLM and you asked it to make some sort of trade-offs with price versus features, you got very linear relationships between price and quality, or price and volume,” said Pierce, who also serves as global lead for value proposition innovation and design at Bain.

The LLMs follow the principle that more is better or that cheaper is better. But LLMs can learn.

“Through really smart prompt engineering, as well as introducing other kinds of data, we actually have learned how to train up the LLMs to become, frankly, more empathetic and more and more understanding that things like brands, which are highly emotional and sort of irrational in human behavior, can be reflected in an LLM in the synthetic world,” Pierce said.

There are also risks of acting on synthetic data without proper guardrails.

Forrester, for example, expects at least two major scandals this year from firms acting on AI-led customer research, as overstretched customer-experience teams rely on such applications as synthetic audiences and AI-moderated interviews. The analyst firm urges leaders to look to skilled research to pressure-test AI-generated insights.

Deciding when to use synthetic data

With interest in synthetic growing, brands need to decide how and when to use it. Experts point to roadblocks that synthetic responses can overcome: hard-to-reach audiences, survey fatigue, expensive and time-consuming surveys.

“I think it's great for if there's a low-incidence audience, so if there's a group that's very hard to find the needle in the haystack,” Myers said.

It’s also helpful for reducing survey fatigue – synthetics don’t get tired answering long surveys.

For speed and scale, synthetic responses are an advantage, but at some point in that process, it’s necessary to include “humans for the depth and the nuance,” Myers said.

In short, get “breadth from the synthetics, but depth from the human,” Myers said.

Human panels are also preferable for questions that invoke human emotion.

“At this point in time, if we know an area of inquiry will involve emotion and self-identity, maybe it’s brand-related, involving loyalty or aesthetic preferences, human respondents may be preferred,” Myers said.

The speed of synthetic research is one of its biggest attractions. Instead of taking five days for a traditional panel, the synthetic data was collected and scrubbed in four hours, according to Qualtrics.

“You might do this because you're trying to pursue rapid time to market, so in some cases, data that could take months to collect through more traditional survey methods could be attained in days or in hours if you use the synthetic sources or a synthetic focus group,” said Lizzy Foo Kune, an analyst at Gartner.

Gartner’s research with a pharma company found that a traditional survey might have cost around $90,000. The synthetic version cost only around $20,000, with “the same types of results,” Kune said.

As tempting as it may be to replace human surveys with synthetic data, Kune and other analysts dissuade such ideas.

“We tell clients that synthetic data should complement and not replace real-world data at this point, because it's still in its early days,” Kune said.

Moving forward, Myers already sees potential use cases. When the marketing team recently approached the Navy Federal research team asking for consumer feedback on 20 value statements, she saw synthetic research as a possibility.

“Can we test these 20 messaging statements first through synthetic?” Myers said. “For prioritization exercises, that's where I'm most excited about this – to quickly sort through large lists for the first level of feedback.”

The other use case Navy Federal identified is for pre-design, or “pre-pre-launch,” Myers said.

“So as you’re working at the design of that questionnaire, am I asking the right questions? Do they mean what I think they mean? Are there parts that are confusing?” she said. “You can get that kind of feedback by leveraging synthetics first before you turn it loose on the humans.”