1. Use Data Lakes' Flexibility to Make Specific Queries
Performance and highly structured queries are not particularly important in a data lake. Transactions and queries are not going to be made in real time and aren’t in response to an immediate request. Storing data in a data lake also will not be a real-time operation; more likely, batches will be added on a daily or monthly basis. For a data lake, the important performance metric is the ability to support ad hoc queries and iterative data retrievals.
Data lakes can be used by policy leaders and data scientists who want to understand a problem or answer a specific question; they won’t be looking for a single record to pop out in response to their queries. Data lake databases can be large and have many tables, but IT teams developing data lakes should put their emphasis on user-friendly query languages, graphical query systems and other tools that facilitate getting data out of the lake, rather than optimizing performance.
2. Add Access Controls to Areas of the Data Lake
A data lake will ideally have information contributed by many different groups, and may even serve as a cross-agency and cross-departmental resource. A data lake will probably contain sensitive, private information in it, so security and accessibility controls should be considered early on. Because the goal of the data lake is to support decision-making, applying anonymization to any personally identifiable information during data loading is a very good way to reduce the likelihood of problems down the line.
Security can be a difficult balance, but the basics of securing a data lake are not much different from any other enterprise database, and normal IT policies and procedures can almost always be applied. Users should have individual (rather than group) usernames, queries should be logged, and the different areas of the data lake should be controlled by group or individual access controls. If nonanonymized personal information is present, the data lake must not be a back door for a data breach.