Enterprise Search Implementation, Step 4: Database Cleanup (Hiding What You Shouldn’t Be Able To Find)

30 Nov

Guest Post By John Gillies, Director of Practice Support at Cassels Brock

This is the fourth in a series of posts about the process of choosing and implementing an enterprise search engine in a law firm. The first addressed Establishing the Business Requirements and the second looked at Picking the Right Search Engine. The last one looked at the Proof of Concept stage, which is where you put your selected engine to the test and ensure that it performs as expected in your environment. Assuming that it passed those tests and the decision has been made to proceed, the next hurdle is cleaning up the databases you will be indexing.

As part of your strategic planning, you will have decided which databases those are. The primary advantage of indexing two or more databases is that users are able to see aggregate results brought together that would otherwise have to be searched separately. The main disadvantage of doing so is that you will have to ensure that, in mixing apples and oranges (as it were), the results are displayed in a way that users can understand and use. In the initial roll-out of enterprise search at our firm, for example, we opted to index only the documents in the document management system (DMS). We did this so that we could start with focused content, train users on using the tool for that content, and then slowly build the available content.

Among the databases commonly indexed for enterprise search are those in the accounting system, the DMS, the library catalogue, the KM/precedent repository, relevant content on the firm intranet, and the legal updates on your firm website. While indexing the content in the four items on this list should be fairly straightforward, indexing the accounting and DM systems poses their own challenges.


Indexing the accounting system requires you to make policy decisions as to who will be able to see what content. For example, can all users search the financial data? Only certain users? All the accounting data or only certain segments? Furthermore, from a usability perspective, while the search engine offers the ability to deliver all the content that corresponds to the search criteria, you may wish to narrow the financial data indexed so as not to overwhelm the user.

One issue to address is whether to index time entry narratives. Those narratives may provide very relevant information, particularly when identifying internal expertise. The question is whether the firm wishes to expose this information to all users. This is one area where the solution is not all or nothing. You may choose, for example, to index this data and use the results for determining relevance, without displaying the actual content.

Document Management System

You will have several concerns with indexing your DMS content. First and foremost you will need to ensure that your confidentiality screens effectively deal with relevant content. This works both ways. In other words, those behind a screen need to be able to find content that they are entitled to view, and those outside the screen need to be blocked from seeing any of that content.

Dealing With “Sensitive Documents

It is, however, the problem of “sensitive” documents in the DMS that will prove to be the most vexing. “Sensitive content” may include, for example, confidential memos from firm committees, memos regarding partner allocations and associate compensation, performance reviews, and so forth..

(You may wish to review the PowerPoint slides done at the ILTA 11 presentation entitled Managing Risks Associated with Enterprise Search, which was a panel composed of Lisa Kellar Gianakos, the Director of Knowledge Management at Pillsbury Winthrop, Rizwan Khan, the Vice-President of Customer Service at Autonomy, and me.)

Typically, in the process of implementing enterprise search, firms discover that sensitive content that should not, for whatever reason, be public has in fact been filed in a publicly accessible part of the DMS. Until that point, that content had not really been available because, realistically, users would have been unable to find it (colloquially referred to as “security through obscurity”). With the advent of better search, that approach is no longer possible.

One way to start finding and securing this content is to draw up a list of “dirty words”. You may wish to begin by referring to the terms on List A that formed part of our ILTA presentation (which are also reproduced as an appendix at the end of this article).

This slide from our presentation shows the most frequently recurring “dirty words” as a tag cloud:

Dirty Word Tag Cloud

You will, however, need to exercise discretion when reviewing the results that a search for these terms returns. For example, while it might seem logical to search for curse words, they are frequently used in e-mails and other documents that are sent to the firm and in court transcripts, so you should not set up a absolute rule to exclude these terms.

Consider searching from some or all of the following:

  1. Terms related to the payment of personal income taxes (e.g., where a lawyer has saved to the DMS letters related to the amount and/or payment of personal income taxes).
  2. Wills and related documents such as “last will and testament”, “living will”, and related terms, such as “life support”. Do the same relating to family law matters, like “divorce”, “separation”, “alimony”, “cohabitation”, etc. (The exact terms will depend on the terms used in your jurisdiction.) Note, however, that if your firm has an estates or a family law practice, a number of these terms may legitimately form part of client files. If firm members have used the services of either the estates or family law group, ensure those files are protected.
  3. Names of firm committees such as “executive committee”, “management committee”, etc. Confidential e-mails to and between committee members are not infrequently filed in publicly accessible locations.
  4. Terms like “cottage”, “country house”, or whatever people may call their secondary residence.
  5. Within personal matter numbers of firm members (if you have such numbers), although there may be relevant public material there such as conference papers, articles, publications, etc.

Check with your Finance and HR departments to find out what terms they would search for. Also, seek suggestions from your pilot group, since they may well come up with terms that your implementation team will not have thought of.

This is perhaps a good opportunity to determine whether any of your internal policies (for example, on confidentiality screens) or external policies (for example, relating to the protection of personal information) need to be updated or whether more internal training is needed.

Understand, as well, that this process should be iterative. Even after you are confident that you have plugged the leaks in the dike, you should continue to do different searches to ensure that you have stopped as much as you can. Consider setting up a reminder system to test these issues post roll-out.

Particularly in the first few months after launch, you will want to review reports of the search terms that users have been using, in part to get a sense of what user behaviour actually is (as opposed to what you’ve assumed it will be!) but also to determine whether users are using terms you had not thought of that might turn up other sensitive documents.

When setting expectations as to implementation, you should be aware that your testing for “sensitive” documents may end up being the most time-intensive portion of your project. Depending on your variables (primarily, the number and size of the repositories you will be indexing), you will want to devote several months to ensuring that you are satisfied with the results that users will be seeing. You will want to avoid any unnecessary bumps at the outset, since that can impair the impression of the search engine you will have spent so much time preparing for!

When you are satisfied on this point, you are now ready for pilot testing, which is the topic I will treat in my next article.

Appendix: “Dirty word” list

  • Associate reviews
  • Bonus allocation
  • Bonus decision
  • Bonus structure
  • Charitable contributions
  • Charitable donations
  • Department budget
  • Direct deposit
  • Discretionary bonus program
  • Equity partner
  • Operations committee/Executive
  • committee
  • Partner admission
  • Partner compensation
  • Partner remuneration
  • Partnership admission
  • Partnership issues
  • Partnership meeting
  • Performance review
  • Performance review
  • Promote/promotion
  • Resignation
  • Staff bonus
  • Termination letter/letter of termination



2 Responses to “Enterprise Search Implementation, Step 4: Database Cleanup (Hiding What You Shouldn’t Be Able To Find)”

  1. Michael Mills November 30, 2011 at 11:59 pm #

    A terrific post on a critical task. Getting it wrong can delay, or even kill a search project, and leave bruises on the participants.

    Other suggestions:

    – Search for the names of members of the management, finance, administration, personnel and recruiting committees.

    – Identify some matters that you know are, or should be, confidential, e.g., a current M&A deal or a matter subject to an ethical wall. Search for client names, key deal descriptors and project codenames.

    – Recruit the firm’s reference librarians to help. They are the most skilled searchers.

    – If the firm has a matter type schema, ask the DMS administrator for lists of matters by type that are NOT confidential.

    – Don’t hide data in the search engine. Fix security in the source system.


  1. Are Search Algorithms Neutral? – Enterprise Search & Discovery: Systems Thinking - April 24, 2017

    […] remove/hide results deemed undesirable, inappropriate or not useful, using negative filters of ‘dirty words’. For example not showing results where the word ‘conference’ is mentioned. It would be an […]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: