Why Does Synthetic Data Appeal to State and Local Agencies?
Synthetic data is generated via machine learning, using generative AI to build a data set based on real-world data. It will be mathematically identical to the original data, with the same patterns, correlations and statistical properties.
For state and local governments, synthetic data generation solves a number of problems.
At the Maryland Longitudinal Data System Center, which has been experimenting with the creation of synthetic data sets, Executive Director Ross Goldstein says there is value in using synthetic data to help protect private or sensitive information.
The center created a proof of concept using synthetic data to train AI on educational statistics without having to access actual student information. This holds promise for other areas as well, such as training AI in support of citizen services.
“Synthetic data could be a useful tool to provide access to data, without the risk of disclosure or misuse of the actual data,” Goldstein says.
This in turn could accelerate AI adoption, says Kalyan Veeramachaneni, principal research scientist in MIT’s Laboratory for Information and Decision Systems.
In AI development, “state and local governments often will end up hiring third-party software consulting firms” to help develop and test applications, he says. Agencies can safely give these partners access to synthetic data “because it is not connected to any real person. It's not connected to any particular real data.”
Synthetic data could also help state and local agencies to train AI in situations where real-world data is sparse or hard to come by. It can help replace outdated information or fill in gaps where information is lacking, “relieving the burden of obtaining real-world data,” according to Gartner research.
“A lot of state and local governments will ask folks to volunteer their data to be used for AI-model development purposes,” Veeramachaneni says. “Maybe a hundred of us volunteered our data, but most of us did not. Synthetic data allows you to augment that data so that AI algorithms can extract patterns.”
READ MORE: Six ways AI will transform government in the year ahead.
How Do State and Local Governments Use Synthetic Data?
Recent examples show the potential here.
At the Maryland Longitudinal Data System Center, for example, researchers from several universities “collaborated on a project to determine the feasibility of creating synthetic data from linked longitudinal education data,” Goldstein says.
“The researchers were able to create three synthetic data sets and show that they accurately represented the real data and did not create a risk of disclosing any personally identifiable information about Maryland students or workers,” he says. The results could help policymakers and other stakeholders gain insights needed to elevate educational outcomes.
In another recent example, the Urban Institute has partnered with the Allegheny County Department of Human Services and the Western Pennsylvania Regional Data Center to pilot synthetic data generation at the local level. The goal is to improve care coordination and drive operational improvements across a range of social services.
Local government agencies can make wide use of synthetic data, Veeramachaneni says. In managing the electric grid, for example, AI trained on synthetic data could help predict outages.
“When you give it enough training data — where you have a past event that happened at a certain point of time at a transformer, for example — and you have all the data preceding that event, the AI model will automatically learn what kind of patterns led to that event,” he says. “Once you can create synthetic data and provide it alongside real data, it can help create more accurate models.”
Overall, synthetic data “could be used by state and local governments to train AI in a variety of applications and services such as urban planning, public safety, emergency management, pandemic prevention and air quality monitoring,” says IEEE Fellow Houbing Herbert Song.
DISCOVER: State and local agencies improve contact centers with AI.
What Are the Types of Synthetic Data?
Synthetic data can take a number of forms.
Amazon Web Services, for example, describes two main types of synthetic data: partial and full. Partial synthetic data represents a small part of a real data set and can be used to protect sensitive information within that larger set. Full synthetic data, by comparison, contains no real-world data. Such data can be used when there’s insufficient data available to train AI accurately.
Synthetic data types also can be defined by use case.
“You have synthetic language data, where you learn from a large language corpus and can generate English sentences,” Veeramachaneni says. “There's also synthetic media data: images, audio, video.”
“The third type of synthetic data is tabular data, which is what a lot of state and local agencies have. Examples of these range from time-stamped voltage data on a power line, occupancy data in different residential or commercial complexes, or data on permits that were given out,” he says.
“That tabular data is really complex, because you have many, many different data sources, different data tables connected in numerous ways, and these interconnections are where all the patterns are,” he adds. “In synthetic data, we can replicate all of those properties and patterns.”
EXPLORE: Municipalities can streamline operations with AI.
What Is the Impact of Synthetic Data in a Data Management Strategy?
New forms of data will inevitably impact the ways in which state and local agencies handle and store their information resources.
“Synthetic data is transforming the way data is managed, just as the internet has transformed the way data is transmitted,” Song says.
As part of their data management strategies, some IT teams are creating synthetic data platforms, “platforms that allow them to create a database that has synthetic data in it,” Veeramachaneni says.
Primarily, the aim is to clearly identify and track synthetic data in order to differentiate it from real-world data.
“Synthetic data looks like real data,” Veeramachaneni says. “We need to mark it, so that people will know which one is a real database and which one is synthetic data. When people do analysis or use it for downstream applications, they need to know that they are accessing synthetic data, not the real data.”
With robust data management strategies in place, agencies will then be empowered to make full use of synthetic data. They’ll be able to augment real-world sets where data is insufficient, ensuring constituent and employee privacy, as they look to train AI models in support of improved constituent services and operational efficiency.