Computer and Technology: Unstructured Information Management

Companies large and small create an impressive amount of data, including email messages, documents and presentations. Most of that data is unstructured, existing primarily on corporate file servers, employee desktop and notebook computers. Industry analysts estimate that this unstructured data accounts for 80% of all corporate information, and expect it to grow 50% or more each year.

Unstructured Information is Unmanaged Information Unstructured data is typically unmanaged. The file system on which this information resides typically is not monitored and the content is practically invisible to employees, auditors or corporate compliance officers. In an effort to provide a greater degree of visibility, control and management of this information to meet compliance reporting requirements, companies have implemented one or more technologies, each of which has advantages and disadvantages:

Enterprise search ? An enterprise search engine is an effective way to index and find documents that contain certain terms. Most are easy to implement and require only a modicum of regular maintenance. Unfortunately, most enterprise search engines are tuned to find all the documents that may contain a particular term, rather than a specific document that may be required by an auditor. It is left to the user to winnow through all the returned documents to find what they need, which can be a time-consuming and costly exercise. Additionally, search engines are mostly lacking in providing the ability to manage the documents it indexes.

Enterprise Content Management ? ECM systems can effectively manage many types of content and can provide access and version control, both of which are effective aspects of information management. ECM systems also tend to be very expensive to setup and maintain. These systems typically require an organization to purchase server and user licenses, implement policies and processes for using the system, and train its users. Because of these costs, companies often limit their ECM implementations to specific areas of their business or types of data, such as documents that pertain to finance. According to many analyst organizations, ECM systems are being used to manage approximately five percent of today?s corporate information.

File Backup ? Many companies attempt to solve the problem of document retention by creating regular backups of all the data on the network. These backups are saved to tapes, which are then stored offsite for disaster recovery purposes. Backing up all data regardless of its business value is an inefficient use of time and resources, increases the cost of tape storage and decreases the likelihood of rapid single file recovery, which is the most-used aspect of file backup.

Doing nothing ? This is the ?solution? that many companies choose for handling unstructured information. Unfortunately, the prevailing thought among many has been that unstructured information is insignificant and therefore does not require management. After all, most of this information ranges from personal files to draft documents or one of dozens of copies of sales presentations, the majority of which aren?t worth the cost required to manage them.

While most files aren?t worth managing, the risk comes from the small number of files that do matter. For instance, your Sarbanes-Oxley policy and procedure manual, which took valuable internal resources, a consulting firm, and many months to create, has likely been copied from the content management system specially created for finance-related documents. The next time you update that manual with critical information, you have fulfilled one aspect of the act by tracking and recording those changes in your records management system. However, what about the dozens of copies that may have spread across the network on shared file servers? How can you be certain those copies are deleted or updated to keep people from following old procedures or controls? If you aren?t doing anything to manage that data, you are leaving your company exposed and vulnerable.

Recognizing Valuable Information
Addressing these issues is key to an effective solution for Sarbanes-Oxley or any information governance initiative. Obviously doing nothing is not the answer. At the same time, it would be cost-prohibitive to manage all files as though they were critical business records. Therefore, the ability to specify which data is critical and worthy of this level of management is a crucial first step. If you are aware of the data?s value, you can make educated decisions as to the disposition of important data and create an appropriate retention policy.

Determining a data?s value is a result of effective information visibility and control.

Information Visibility ? The first aspect of recognizing valuable data requires that it be visible. While your compliance office may have access to all corporate information across the network, the sheer amount of data necessitates the use of technology to find and manage the appropriate documents.

Information Control ? To effectively manage and control unstructured information, you need a solution that allows you to copy, move, delete or tag documents with custom metadata; i.e., information about the document. Even better, the solution should provide an integrated policy engine that can be customized with your company?s information governance regulations. For instance, creating a policy mandates that any document on the employee network that contains a customer account number must be 1) tagged with custom metadata of ?Customer,? and 2) moved to a secured server or file archive system.

Data classification is an important aspect of information visibility and control. Several products have emerged or expanded into this space, to offer an all-embracing solution for complying with Sarbanes-Oxley and other regulations. By implementing one of these data classification systems, documents on your network can be located, opened and tagged according to the content found within each document. A typical classification workflow might look something like this:

1. Catalog ? The system scans the file systems, finding and collecting file metadata from hundreds of file types.

2. Classify ? Opening each document, the system classifies data according to file attributes and keywords or word patterns, and tags with custom metadata according to pre-set policies.

3. Search ? The system allows users to find desirable information based on a combination of metadata and full document text, utilizing standard Windows and UNIX access control lists.

4. Report ? The system should allow appropriate users to create and access summary or detailed reporting functionality.

5. Act ? Finally, the system should integrate actions, such as tagging files with custom metadata, setting retention and monitoring policies, and offering move, copy and delete functionality, again based upon an access control list.

To contrast, an enterprise search engine provides an efficient method to find content that contains the search term you need. But then what? If you wanted to copy, move, delete or perhaps tag the document with customized metadata, you would have to manually do so.

Data Retention, Availability and Recovery Retention is another aspect of corporate information that cannot be overlooked. While many companies elect to back up all data on a weekly or monthly basis, the cost of time and resources increases as the amount of data grows. Knowing what is in your data ? by making information visible, by tagging with metadata and by controlling access ? allows you to intelligently create a retention policy that moves or backs up only the data needed to comply with your corporate information governance policy or government regulation.

Most organizations use a backup solution that periodically copies data to tape or disk drives. An organization may back up its mission-critical data every night and all of its data every week. It may store the backup tapes for up to six months to guard against accidental deletions, send tape copies offsite as a safeguard against disaster and retain backup tapes long-term to meet regulatory requirements.

Lacking the means to gauge the value of the data, companies often take the safe route and back up all of it. Not only is the approach ineffective, it indicates inefficient data management and creates a potential risk. Storing data that is not required to be kept can be used against a company in the event of a lawsuit or regulatory compliance issue. In this respect, backing up data in its entirety creates a liability.

Corporations can meet regulatory data retention requirements, cut backup and recovery costs and manage risk by introducing file archiving into the mix.

A file archiving system uses data classification to determine the content?s value, then moves or copies files according to that value. File archiving systems can find and retrieve files based on their content. Any number of parameters can be used, including author, date, and customized tags such as ?SEC 17a-4? or ?Sarbanes-Oxley.?

This naturally leads us to the tiering of storage services. Backup and file archiving are natural places to start for providing tiered storage services, based upon the value of the data in your network.

As an example, consider a company that has 10 terabytes (TB) of data on production file servers. In the past, the company may have backed up critical files onto disk storage and then backed up all files onto tape once a week. The company catalogued the tapes, kept them for three months and then cycled them back through the process. New government regulations mandate that all data related to quarterly financial results must be kept for five years. Unfortunately, the company has no way to differentiate among the disparate types of data on its network. The company is forced to retain all of the data for five years, expanding the amount retained from 10 TB to 2.5 petabytes (PB). As data amounts double annually, so will the amount that must be stored. The company will find itself devoting more and more time and resources to data backup.

To solve this problem, let us assume that the company implemented a data classification system. By discovering the value of its unstructured information and tagging according to the value, the company copied 500 GB of financial reporting data to WORM storage for long-term retention and moved seven TB to tiered storage, which is backed up to tape every three months. The data in three-month storage would total 42 TB, compared with the 2.5 PB that would have been required if the data had not been archived. With tiered storage, the company significantly reduced backup time and resources, shrank the cost of production file storage and increased its IT service levels by freeing up personnel and data for other tasks.

Tiering your data storage services allows you to put SOX controls only around the data that pertains to your financial information and lock down the appropriate data on compliance-specific storage boxes.

Proving Compliance

The old adage is true: the best defense is a good offense. In the case of Sarbanes-Oxley compliance, the best offense is to create and implement provable policies. Having a data classification system allows you to produce standard reports that show duplicate copies of applicable documents, that show who has accessed the file within a specific time period, and that monitor implementation of your information governance policies. With reporting functionality available in a dashboard implementation, you can think of your system as a burglar alarm: a deterrent to potential wrongdoing and a way to prove that you?re actively checking for compliance-related issues.

Best Practices
Implementing one of today?s data classification systems should be an integral part of your Sarbanes-Oxley best practices. Setting information governance policies fulfills a basic requirement. Active management of your unstructured data will find, tag and move content according to your corporate policies, lowering the risk that information will ?fall through the cracks? and potentially protect you from breaking the law. Creating a tiered storage system will allow you to set retention policies according to the value of the content, saving money and reducing risks. And proving compliance ? or at least show that you?re attempting to comply ? is sometimes the best way to meet and exceed current ? and future ? government regulations not only around financial systems but around employee and customer privacy as well.

Reducing Risk and Lowering Costs
In the end, the benefits of visibility and control of your unstructured information reduces risks ? of compliance violations, litigation exposure, untimely responses and privacy and security breaches ? and lowers costs through streamlined storage operations, improved service levels and automated policy-driven data management.

Computer and Technology

Friday, 17 March 2017

Unstructured Information Management - What You Don't Know Can Hurt You!

No comments:

Post a Comment