To state and local managers, nothing is better than a good supply of facts when it comes to making decisions. Not viral moments, flashy anecdotes or widely held opinions, but something solely data-based that offers a defensible way to choose among alternative, competing paths. A recent paper from the Brookings Institution’s Hamilton Project titled “Fact-Based Policy: How Do State and Local Governments Accomplish It?” suggests using data lakes to improve the policymaking process.
The report’s definition of a data lake is closer to that of a data warehouse. The institution suggests that, to make fact-based policy, a data lake must have processed data and derived tables (summary info); documentation, common coding and codebooks for information, such as location; automated processes for loading new data; and management buy-in to ensure the data lake is maintained so it is useful across a broad set of tasks.
From an IT perspective, building a data lake is much more about the process and documentation than finding the right database or storage technology. Unlike many other application-focused projects, a data lake isn’t something you develop and then put into operation; it is a resource requiring continuous development, care and feeding. Thus, the IT commitment to data lake operation and maintenance is much higher than a typical application would see — and it’s important that IT managers put these expenses on the table so there are no surprises.
Data lakes are based on technology all IT managers already know: databases. But a data lake is not a standard database project, and the normal tools and techniques generally don’t apply. Although a data lake is indeed a big database, there are four ways in which building and maintaining a data lake differs significantly from other types of databases that an organization may be using.
1. Use Data Lakes' Flexibility to Make Specific Queries
Performance and highly structured queries are not particularly important in a data lake. Transactions and queries are not going to be made in real time and aren’t in response to an immediate request. Storing data in a data lake also will not be a real-time operation; more likely, batches will be added on a daily or monthly basis. For a data lake, the important performance metric is the ability to support ad hoc queries and iterative data retrievals.
Data lakes can be used by policy leaders and data scientists who want to understand a problem or answer a specific question; they won’t be looking for a single record to pop out in response to their queries. Data lake databases can be large and have many tables, but IT teams developing data lakes should put their emphasis on user-friendly query languages, graphical query systems and other tools that facilitate getting data out of the lake, rather than optimizing performance.
2. Add Access Controls to Areas of the Data Lake
A data lake will ideally have information contributed by many different groups, and may even serve as a cross-agency and cross-departmental resource. A data lake will probably contain sensitive, private information in it, so security and accessibility controls should be considered early on. Because the goal of the data lake is to support decision-making, applying anonymization to any personally identifiable information during data loading is a very good way to reduce the likelihood of problems down the line.
Security can be a difficult balance, but the basics of securing a data lake are not much different from any other enterprise database, and normal IT policies and procedures can almost always be applied. Users should have individual (rather than group) usernames, queries should be logged, and the different areas of the data lake should be controlled by group or individual access controls. If nonanonymized personal information is present, the data lake must not be a back door for a data breach.
The number of states using cost-benefit analysis in at least one policy area
Source: The Hamilton Project, “Fact-Based Policy: How Do State and Local Governments Accomplish It?” January 2019
3. Put in Place Detailed Documentation to Track and Catalog Data
Unlike most databases, the data lake is not intended to support a specific application, which means that a tight binding between the application and the database won’t be in place. The result is that institutional knowledge about information placed in the data lake can be easily lost if it is not captured in exceptionally good documentation.
Data lakes may have automated processes to add data, but the information will come from a wide variety of sources and applications and move down different paths. If the metadata about each data lake source is not carefully maintained, the data lake can quickly turn into a data swamp, housing enormous piles of data that have no clear provenance. Current trends in agile development eschew documentation in favor of getting software running quickly, but this approach cannot be allowed to infect the data lake. It does no good to load data into the data lake using an agile approach if no one can say what the data means. IT managers may have to wade deep into quality control if teams are not accustomed to this level of documentation.
MORE FROM STATETECH: Find out why agencies need to analyze Big Data effectively to improve citizen services.
4. Invest in IT to Support Large-Scale Databases
Most databases have a clear growth rate and, hopefully, a good data retirement and archiving policy. To ensure that things run smoothly and correctly, database administrators use this information to size storage systems, servers and everything else. In contrast, data lakes don’t share these predictable characteristics.
A data lake may have tables that are never archived or retired, just because the long-term trend information in the data may be exactly what someone is looking for. And because data lakes get their data from various sources, knowing how much data will be in the database may be difficult to predict. To handle these growth issues, IT managers developing and supporting a data lake should investigate technologies that support very large-scale databases at reasonable costs.
Illustrations by Rob Dobi