Heatlh data in the raw
By Carolyn Duffy Marsan
Friday, November 06, 2009
The Obama administration is moving aggressively to open government information to the public, and it wants to use the latest Web-based tools to do it. At the forefront of the transparency movement is www.data.gov, a Web site that aims to be an online clearinghouse for federal data, including healthcare-related information.
“This is a cultural shift,’’ said Marion Royal, project manager for the data.gov Web site at the U.S. General Services Administration. “The federal government has always provided data to the public. However, we provided data to the public in a format that we thought was appropriate for them to see. This led to some concerns by citizens that we weren’t showing them all the data. We were just showing them the data we wanted them to see.”
With data.gov, the federal government is providing datasets and tools in raw, machine-readable formats so Web developers can create applications for slicing and dicing the data.
“We are encouraging agencies to provide the data in the raw format, which is a little bit different than the way we’ve always provided that data,” Royal said. “A lot of times, the data was behind a database that we provided as an interface for [the public] to search on. Now it’s time for us to step back–and just give it to them.”
Good debut
The data.gov Web site is still in its infancy and it’s unclear whether it will become a popular destination for the health IT community or spur the creation of new innovative healthcare applications. But overall the health IT community’s response has been positive.
“It has tremendous prospects,” said Dr. Ken Buetow, the National Cancer Institute’s Associate Director for Bioinformatics and Information Technology. “It has the potential for being an invaluable clearinghouse of all sorts of information and could be a great place for people to be able to go to have the type of information in health that’s been so transformative in other sectors of the economy, such as finance.”
The site has only whet the appetites of others. “This is a great start, but we want more’’ said Fred Trotter, an advocate for open source software in healthcare and a blogger. “I love the fact that it’s a coordinated effort and that it has systems so you can rate the information that they have. But it’s very limited. I’d love to see more geo-encoded healthcare.’’
Data.gov was launched on May 21, by White House chief information officer Vivek Kundra. Four months later, the site still has limited healthcare data sets available, and all of these are already available to the public elsewhere.
One-stop data shop?
Data.gov is designed to be a one-stop shop for federal data. As such, the Web site is an online catalog that links to authoritative data housed elsewhere on individual agency Web sites. “The datasets at data.gov are those that have already been out to the public, but are a little bit difficult to find if one doesn’t know what agencies are responsible for what information,’’ Royal said.
By the end of September, the site had more than 60 datasets related to health and nutrition, all of which were posted by the Department of Health & Human Services. The site’s tools section include 20 health-related items from a variety of agencies, including the CDC, the FDA and the National Cancer Institute. The site’s geo-data catalog include seven datasets related to human health and disease.
Some of the site’s health and nutrition tools—CDC’s FluView National Flu Activity Map and the FDA’s Peanut-Containing Product Recall—are among the most popular items on the site, Royal said.
It’s unlikely that the healthcare data sets are going to be accessed much by average citizens. Instead, they are being downloaded by Web developers and other IT professionals with the expertise to manipulate the raw data. “I have seen mostly application developers and Web site developers being the most excited about this site,” Royal said. “That’s who I see taking advantage of the data.”
The GSA hasn’t been targeting specific types of datasets, such as healthcare, for inclusion in data.gov. Instead, they’ve opened the door to all federal agencies and asked their CIOs to decide which data sets are appropriate for publication on the data.gov site.
“The ‘value add’ of data.gov is to be a starting point for people who are seeking data and they may not know what category of data they are looking for or what agency might have the data,” Royal said. “Data.gov gives them a place to start.”
Cancer surveillance
One health-related tool on data.gov is the National Cancer Institute’s Surveillance, Epidemiology and End Results (SEER) database, which provides cancer statistics for the United States. Having data.gov link to these kinds of huge health-related data sets shouldn’t create any technical problems, Buetow said.
“The social and other issues can be addressed by having appropriate access controls and security controls in place,” he said. “The challenge always with health information has to do with the need for having appropriate privacy protection; there are also proprietary issues for individual grantees.’’
What excites Buetow about data.gov is the idea that user base for tools like SEER will grow.
“The attractive thing for me about data.gov is that it would allow non-traditional constituencies for the types of data that already exist,” Buetow said. “You have to know to come to the SEER site to find it. The notion of discovery is broader by having these other aggregator portals point to this type of information.’’
Buetow said the National Cancer Institute is looking at posting an array of data on data.gov, including the Cancer Biomedical Information Grid (caBIG) and the Cancer Genome Atlas.
“More raw scientific data can and should be distributed through data.gov,” Buetow said. “But there are interesting challenges in the presentation of information. For example, with SEER, do you present data or tools because some of the data in its raw form is almost unusable without processing and manipulation?’’
So far, the response to data.gov from users has been positive. In its first four months of operations, the site attracted 8.4 million hits.
Trotter says Web developers will eventually take the raw data on data.gov and make it usable to average citizens.
“We can develop systems that process that data, that correlate that data with other data, and that present it to the general user,” Trotter said. “The best analogy is the Google mash-up, which would allow us to take geo-encoded health data and mash it together with other sources of data to enrich what we would be able to see.”
Among the datasets Trotter would like to see added to data.gov are any healthcare outcome data from the Department of Veterans’ Affairs. “The VA is the largest healthcare system in the world. If you release that data, it’s a goldmine for everyone,” Trotter said, who put de-identified healthcare data from the VA and “aggregate health data’’ from the VA, CDC and Defense Department on his data.gov wish list.
Though Trotter isn’t working with any of the healthcare data available on data.gov—nor does he know anybody who is working with it—he’s still pleased that it exists.
“The people I talk to are excited about what it is and what it represents and the direction it’s going,’’ Trotter said. “It’s as significant to us how they are releasing the data as what they are releasing because they are releasing it in a way that takes in feedback, and that takes in comments, and that takes into account what people want and need.’’