Bringing open government to courts

Harlan Yu on how "privacy by obscurity" in court records is changing.

yu.harlan.jpgAs court records increasingly become digitized, unexpected consequences will result from that evolution. It’s critical to be thinking through the authentication, cost and privacy issues before we get there.

Harlan Yu, a Princeton computer scientist, worked with a team to create online tools that enable free and open access to court records that highlight the need for more awareness. My interview with Yu this summer was a reminder that the state of open government is both further advanced and more muddled than the public realizes. As with so many issues, it’s not necessarily about technology itself. Effective policy will be founded upon understanding the ways that people can and will interact with platforms. Although applying open government principles to public access for court documents is a little dry for the general public, the ramifications of digital records being published online means the issue deserves more sunlight. A condensed version of our interview follows.

Your open government work has focused improving public access to court records in the on PACER system. PACER stands for “Public Access to Court Electronic Records” but the reality of public access is more complicated. What’s the history of your involvement with this aspect of open government?

Back in February of last year, Steve Schultze, who was at the time at the Berkman Center, was giving a round of talks about access to court materials on PACER. He came to CITP in February to give a talk with one of his colleagues. I had never heard of PACER before, but I went to Steve’s talk and learned about how the federal government provides these documents that form the basis of our common law. I was appalled that these public records were essentially being sold to the public at the detriment to our democracy.

What did you propose to Schultze to fix this situation?

We thought there was a way that you could automatically allow PACER users to share documents that were legitimately purchased from the PACER system. Because these are public records, once a legitimate user pays for a document, they should be able to share it on their blog, send it to their friend, post it online, or do whatever they want with it. We decided to venture out and build a [Firefox] plug-in called RECAP that essentially automatically crowdsources the purchase of PACER documents.

Who else was involved in building RECAP?

Gov 2.0 Summit, 2010We worked with the Internet Archive and with Carl Malamud at public.resource.org. We built a system where users could download the RECAP plug-in and install it. While they used PACER, any time they purchased a docket or a PDF, whether it was a brief, an opinion or any motion, it automatically gets uploaded into our central repository in the background.

The quid pro quo in that, as you’re using the RECAP plug-in, if we already have a document that has been uploaded by another user, that gets shown to you in PACER to say, “Hey, we already have a copy. Instead of purchasing another copy for $.08 or whatever it’ll cost you, just get it from us for free.”

We now have about 2.2 million PACER documents in our system, which is actually a small fraction of the total number of documents in the PACER system. The PACER administrative office claims that there are about 500 million documents in PACER, with 5 million being added every month. So 2.2 million is actually a pretty small number of documents, by percentage.

We think that we have a lot of the most commonly accessed documents. For the court cases that have high visibility, those are the ones that people access over and over. So we don’t have a lot of “long tail,” but we have a lot of the ones that are most commonly used.

Are there privacy and security considerations here? Why does the concept of “practical security” matter to open government?

We’d like to make all of these documents freely available to the public. We’ve found a couple of different barriers to offering free and open public access. The biggest one is definitely privacy. When an attorney files a brief [in federal courts], they need to ensure that sensitive information is redacted. Whether it’s a Social Security number, the name of a minor, bank account numbers, all of these things need to be redacted before the public filing, so when they put it on PACER, it can’t be mined for this private information. In the past, the courts themselves haven’t been very vigilant in making sure their own rules were properly applied. That’s mainly because of “practical obscurity.” These documents were behind this paywall, or you had to go to the courts to actually get a copy. The documents weren’t just freely available on Google. The worry about privacy was not as significant, because even if there were a Social Security number, it wouldn’t be widely distributed. People didn’t care so much about the privacy implications.

So a condition of “privacy by obscurity” persisted?

Exactly. The information’s out there publicly in public record, but it’s practically obscure from public view. So now we have a lot of these PDF documents, but there’s actually a number of these documents that have private information, like Social Security numbers, the names of minors or names of informants. Just going out and publishing these documents on Google isn’t necessarily the best and most moral thing to do.

I think one of the consequences of RECAP, Carl’s work and our work in trying to get these documents online is the realization that eventually all of these documents will be made public. The courts need to be a lot more serious about applying their own rules in their own courts to protect the privacy of citizens. The main problem is that in the past, even though these records weren’t available publicly and made freely available, there were already entities in the courtrooms essentially mining this information. For example, in bankruptcy cases, there were already data aggregators looking through court records everyday, finding Social Security numbers, and adding this information into people’s dossier but out of the view of the public. Bringing this privacy issue to the forefront, even if these documents aren’t yet publicly available, will make a big impact on protecting privacy of citizens who are involved in court cases.

As court records become more public, what will that mean for citizens?

If somebody sues you — and it’s a claim that eventually is unfounded — that might end up in some dossier and the information may be incorrect. With these 2.2 million documents, we try to make them as publicly accessible as possible without harming the privacy of citizens. Last month, we came out with the RECAP Archive, which is essentially a search interface for our database of documents. We now allow users to search full text across just the metadata associated with the case. You can search across all the documents we had for case title, case number or judge. If there’s a summary of the documents, you can search over all of the metadata on the docket. We haven’t enabled full text search of the actual PDF or of the brief yet because that’s where a lot of the PII is going to be found.

What about the cost of making court records available? Is there a rationale for charging for access?

The other issue with PACER — and it’s hard to ignore — is cost. The reason why the courts charge money for these public domain documents is that Congress authorized them to. In the 2002 E-Government Act, Congress essentially said that they”re allowed to charge you their fees to recoup the cost of running this public access system, only to the extent necessary to recoup these costs. The courts determined at the time that that should be $0.07 a page and eventually upped that per page access rate to $0.08 per page. But if you look at their budgeting documents, we’ve found that they actually charge a lot more than the expense necessary to provide these documents. My colleague, Steve Schultze, has done a ton of work digging into the federal judiciary budget. We found that about $21 million every year looks like it’s being spent directly on running the PACER systems. That includes networking, running servers, or directly to providing public access through PACER. Their revenue in 2010 is projected to be — I believe — $94 million. So there”s a $73 million difference this year in the amount of money that they”re collecting versus the amount of money that they’re spending on public access. That $73 million difference is thrown into this thing called the Judiciary Information Technology Fund or the JIT Fund.

The JIT Fund is being used on other court technology projects, like flat screen monitors, telecommunications, embeddable microphones in court benches. I’m not opposed to these projects being funded and more technologies in courtrooms, but these projects are being funded at the expense of public access to the law, including the ability for researchers and others interested in our judicial process to access and study how the judicial process works, which I think is highly detrimental to society.

You’ve offered a thorough walkthrough of many of the issues that were raised at the Law.gov workshop earlier this year. What is the next step in opening up the court system in a way that the American people can find utility from those efforts?

I think the ball is essentially in Congress’ court, so to speak. The courts need to work together with Congress to find the right appropriation structure such that PACER is funded not by user feeds but can be supported by general appropriations. Only in that case could the courts take down that user pay wall and allow all of these documents to be freely available and accessible. It’s important to look at exactly how much money Congress needs to appropriate to the courts to actually run the system. I think $21 million isn’t necessarily the right number, even though that”s how much they spend today for a couple of reasons.

Carl has done a bunch of FOIA requests to all of the individual executive agencies and found, for example, that DOJ pays the judiciary $4 million ever year to access cases. That”s probably true for a lot of the other agencies or for Congress. They pay the courts to access PACER. So a lot of that money is already coming from general appropriation where taxpayer money goes to DOJ, $4 million and then that is paid out to the courts.

If Congress were able to redirect that money directly, the courts would get that money directly and that would go a long way in making up this $21 million. In addition, the amount of money to run the payment infrastructure, to keep track of user accounts, to process bills, to send out letters, to collect the fees, I”m sure probably would cost a couple million dollars, too. If you take down the pay wall, that whole system doesn’t even need to be run.

From a policy perspective, I think it’s important for Congress and the courts to look into how much money is being sent by using taxpayer money already on running PACER and then directly appropriating that money, along with however, more is necessary on top of that if there’s a shortfall to fund the system. Once enough funding is available, then you can take down the pay wall and keep the system running.

There are privacy issues that we need to deal with, certainly in bankruptcy cases, there”s a lot more private information that’s left un-redacted, in the regular district appeals courts, appellate courts, probably a bit less. But there are definitely issues that we need to talk about.

What are you focusing on in your doctoral work at Princeton?

On the open government front, I’ve been looking into a variety of topics in privacy and authentication of court records. I think that’s extremely important, especially as the focus is on publishing raw data and third-party reuse of data, in terms of re-displaying government data through third parties and intermediaries. It’s also important that governments start to focus on the authentication of government records.

By authentication, I mean actual cryptographic digital signatures that third parties can use to verify that whatever dataset that they downloaded, whether it’s from the government directly or from another third party, is actually authentic and numbers within the data that haven’t been perturbed or modified, either maliciously or accidentally. I think those are two issues definitely will be increasingly important in the open government world.

What will your talk on “Government Data and the Invisible Hand” at the Gov 2.0 Summit examine?

When we try to do open government, government tries to look at the data that they have and try to publish it. Then they get to a certain technological limit, where an important dataset that they want to publish is on paper file or is in a digital record but not in any machine-parsable way. Or records are available in some machine-parsable way, but there are privacy problems. When we talked about open government and innovation, I think a lot of people have been focusing on user-facing innovation, where the data had been published and the public goes out and takes that and makes user-facing interfaces.

There’s also back end innovation, where tools that enable government to better build this platform and sharpen this platform make the front-end innovation possible. These things include better redaction tools for privacy that make it more efficient for government to find private information in their public records. Or tools that help government source data at its creation in machine-readable formats, rather than doing it the same old way and then having some very complex and leaky process for converting Word documents or other non-parsable documents into machine-parsable formats. I think there’s a lot of innovation that needs to happen in the tools that government can use to better provide the open platform itself.

tags: , , ,