Rank21

Idea#43

This idea is active.
Solution Architecture »

Super Crunching

Require agencies to submit datasets in standard format with common metadata fields, including short and long descriptions to improve user understanding of data-set.

Create a user interface that enables users to easily graph multiple time-series data sets (simple trend graphs). This will let them visually compare different data sets on relative scales.

By aggregating all government data into a standard format and enabling users to select and compare different data-sets, more in depth "super crunching" can be done, like multivariate regression analysis across selected years and data sets.

Other features: 1) allow users to easily export underlying selected data from any individual or series of dataset comparisons, 2) allow users to easily export the graph/tables they creates, ensure a consistent stamp that indicates the source(s) and when the data was extracted.

Data.gov is currently a hodge-podge of data sets and random "apps". I work the EPA and found the TRI data sets horribly unorganized and not very useful. The power users would go to the EPA website, and new users would find it difficult to make sense of them. I think it is imperative to make data.gov a database of standardized data sets instead of just a flea market for whatever agencies submit.

For many data sets, particularly the economic data, this standardization would not be hard to do. The relative burden of each agency is minimal compared to the benefits of being able to access all government data in a structured way from a common database.

Comment

Submitted by 4 years ago

Comments (5)

  1. Standardizing these can be a huge challenge - however, by implementing ontologies, data dictionaries and so on, one might be able to start constructing a framework that could help do so in a more informed way.

    4 years ago
  2. Homogenizing certain types of instance data could be an interesting use case for semantic.data.gov.

    Don't know if that falls into the notion of super-crunching (I am going to have to research that) but this sounds interesting.

    4 years ago
  3. Maybe people input a minimum set of required metadata (to keep the burden low) and ontologies and data dictionaries improve the ability to compare datasets from different sources.

    4 years ago
  4. Interesting suggestion, but very difficult to achieve in certain problem domains. For example, National Marine Fisheries encounters significant challenges just standardizing data across our six geographic regions, and has had limited success to date. As an example of some common problems, consider species coding in data collections.

    Some data is collected by NMFS scientists and species coding is very specific (do I get credit for a pun?) and accurate. Even in this best case, occasionally scientists decide to establish a new species as they understand populations more thoroughly. When they do decide to delineate a separate species, this decision makes some of the prior data ambiguous -- which species was it, if prior observations didn't differentiate to the same level?

    Some data is collected by trained non-scientists. The data collection protocol in this case may or may not call for precise species determination, depending on requirements, difficulty of the determination, and resources. Some data collection protocols might provide species categories (skates, family Rajidae) because that is the most cost-effective data coding available under the circumstances.

    Some data is collected by industry. Generally the commercial fishing industry is not as concerned about scientific accuracy as they are about price. If they can throw a fish into a category that commands a good price, why would they obsess about classifying it into some obscure category that might not sell?

    The mix of which data is collected under which data collection protocol, and when the protocols (or market-driven categories) are changed, varies around the country. Typically local data users understand these issues and know how to work around them. (In fact the data collection protocols are generally designed to meet local data analysis requirements.) But external users find this complexity challenging, and it presents barriers to wider use.

    This just happens to be an example that I have recent experience with, but it may not be the most significant example... but it may further understanding of some of the issues.

    4 years ago
  5. kohler.jim Idea Submitter

    Great comments everyone. I am still learning about the complex world of data-base management. I didn't realize how quickly the complexity can escalate beyond a simple table with entries (rows) and fields (columns)!

    The phrase "Supercrunching" comes from a book that I am reading by Ian Ayres called "Super Crunchers", which has enlarged my vision of what data.gov could be. I am currently working on my own "tool" that does some of things described in my post, but I began development with a much narrower concept (easily graphing/comparing time-series data-set). I am just about to post a job proposal on Elance.com to enlist some more professional support. Check out infiniterecast.com that tracks the development of this venture. Let me know what you think!

    Larry: Thanks for the example. I think there will be many cases like the one you present, i.e. where the number of fields to describe the nuances of the data-set could become overwhelming. (For example, the pilot database for the tool I'm developing only has 20-30 fields to describe a time-series data-set.) However, I think discussion at the agency level and data.gov level over which fields to include to accurately capture the data would be fruitful. I don't think the difficulty should be a reason to diminish the vision of what data.gov could be (not that your suggested that!). Thanks again.

    4 years ago

Vote Activity Show

(latest 20 votes)