Metadata

Metadata (Greek meta "after" and Latin data "information") are data that describe other data. Generally, a set of metadata describe a single set of data, called a resource.

Metadata are of special interest in various fields of computer science, e. g. information retrieval and the semantic web. Although many consider them a powerful tool to bridge the semantic gap, they are criticized severely by others.

Definitions

The term was introduced intuitively, i.e. without exact definition. Because of that today there is a whole variety of definitions. The most common one is the literal translation:

Metadata are data on data.

As for most people the difference between data and information is a merely philosophical one of no relevance in practical use, existing definitions include

"Metadata is information about data"
"Metadata is information about information".

There are also more sophisticated definitions such as:

"Metadata is structured, encoded data that describe characteristics of information-bearing entities to aid in the identification, discovery, assessment, and management of the described entities."^[1] and
"[Metadata is a set of] optional structured descriptions that are publicly available to explicitly assist in locating objects."^[2].

These are used more rarely because they tend to concentrate on one feature of metadata — in general their usage to find "objects", "entities" or "resources" — and ignore other purposes like using metadata to optimize compression algorithms.

The metadata concept has been extended into the world of systems to include any "data about data"--the names of tables, columns, programs, and the like. Different views of this system metadata are described below, but beyond that is recognition that metadata describe all aspects of systems--data, activities, people and organizations involved, locations of data and processes, access methods, limitations, timing and events, as well as motivation and rules.

Fundamentally, then, metadata are "the data that describe the structure and workings of an organization’s use of information, and which describe the systems it uses to manage that information." To do a model of metadata is to do an "Enterprise model" of the information technology industry itself.

^ Committee on Cataloging, Task Fore on Metadata. Summary Report. June 1999
^ D. C. A. Bultermann. Is It Time For a Moratorium on Metadata? IEEE MultiMedia, Oct-Dec 2004

Difference between data and metadata

Usually it is not possible to distinguish between (raw) data and metadata because of the following effects:

Data can be both raw data and metadata at the same time. The headline of a text is both part of the text, i. e. data, and title of the text, i. e. metadata.
Data and metadata can change their roles. A poem, as such, would be regarded data, but if there were a song that used it as lyrics, the whole poem could be attached to an audio file as metadata. Thus, the interpretation depends on the point of view.
It is possible to create meta-meta-...-metadata. As, according to the common definition, metadata themselves are data, it is possible to create metadata about metadata, metadata about metadata about metadata and so on. Though at first look this sounds useless, it can be essential to archive metadata about metadata, e. g. to keep track of where the metadata came from when merging two documents.

These effects apply, no matter which of the above definitions is used.

Use

The main purpose of metadata is to speed up and enrich searching for resources. In general, search queries using metadata can save users from performing more complex filter operations manually.

Metadata helps to bridge the semantic gap. By telling a computer how data are related and how these relations can be evaluated automatically it shall be possible to process even more complex filter and search operations. For example, if a search engine understands that "Van Gogh" was a "Dutch painter", it can answer a search query on "Dutch painters" with a link to a web page about Vincent Van Gogh, although the exact term "Dutch painters" never occurs on that site; today, this is not possible. This approach is also called knowledge representation. It is of special interest to the semantic web and artificial intelligence.

Some metadata are designed to optimize compression algorithms. For example, if there are metadata that enable the computer to tell foreground from background in a video, it can compress both parts independently and thus reach higher compression rates.

Some metadata are intended to enable variable content presentation. For example, if a picture viewer knows the most important region of an image — e. g. the one where there's a person in it — it can reduce the image to that region and thus show the user the most interesting details on a small screen, such as on a mobile phone's. A similar kind of metadata is intended to enable blind people "reading" diagrams and pictures, e. g. by converting them for special output devices or by reading a description using voice synthesis.

Other descriptive metadata can be used to automize workflows. For example, if a tool knows content and structure of data it can convert them automatically and give them to another tool as input. By that, users could save many copy-and-paste actions that are necessary when analyzing data with different tools.

Types of metadata

Metadata can be distinguished by their ...

content. Metadata can either describe the resource itself, e. g. name and size of a file, or the content of the resource, e. g. "The video shows a boy playing football."
mutability. With respect to the whole resource, metadata can be either immutable, e. g. the title of a file does not change, no matter what part of the file is being considered, or mutable, e. g. the scene descriptions of a video vary.
logical function. There are three layers of logical function lying on top of each other: the bottom is the subsymbolic layer that contains the raw data themselves, on the symbolic layer are metadata describing the content of the raw data and the topmost logical layer contains metadata that allow logical reasoning using the symbolic layer.

Important issues

To successfully develop and use metadata, there are several important issues that must be treated with care:

Metadata lifecycle

The lifecycle of metadata is usually underestimated in its complexity. Yet there are three phases that must be regarded independently:

Creation. Even in the early phases of planning and designing it is necessary to keep track of all metadata created. It is not economical to start attaching metadata only after the production process has been completed. For example, if metadata created by a digital camera at recording time are not stored immediately they must be restored afterwards manually with great effort. Therefore, it is necessary for the different groups of resource producers to cooperate using compatible methods and standards.
Manipulation. Metadata must adapt if their resources change. They must be merged when two resources are merged. These operations are seldom performed by today's software, e.g. image editing software usually does not keep track of metadata from digital cameras stored in EXIF format.
Destruction. It can be useful to keep metadata even after their resource has been destroyed, e. g. in change histories within a text document or to archive file deletions due to digital rights management. None of all of today's metadata standards considers this phase.

Storage

Metadata can be stored either internally, i. e. in the same file as the data, or externally, i. e. in a separate file. Both possibilities have advantages and disadvantages:

Internal storage allows transferring metadata together with their data; thus they are always at hand and can be manipulated easily. This method creates high redundancy and does not allow holding metadata together.
External storage allows bundling metadata, e. g. in a database, for more efficient searching. There is no redundancy and metadata can be transferred simultaneously when using streaming. However, as most formats use URIs for that purpose, the method of how the metadata are linked to their data must be treated with care: What if a resource does not have an URI, e. g. resources on a local hard disk or web pages that are created on-the-fly using a content management system? What if metadata can only be evaluated if there is a connection to the WWW, especially when using RDF? How to realize that a resource is replaced by another with the same name but different content?

Moreover there is the question of data format: Storing metadata in a human-readable format such as XML can be useful because users can understand and edit them without any tools at all. On the other side these formats are not optimized for storage capacity, i. e. it may be useful to store them in a binary non-human-readable format instead to speed up transfer and save memory.

Critics

Although the majority of computer scientists see metadata as a chance for better interoperability, there are some critic voices whose main arguments must be taken seriously:

Metadata are too expensive and time-consuming. The argument is that companies will not produce metadata without need because they cost extra money and private users also will not produce complex metadata because their creation is very time-consuming. Thus it is not useful to create formats and standards when no one will use them.
Metadata are too complicated. Private users will not create metadata because existing formats, especially MPEG-7, are too complicated. As long as there are no automatic tools for creating metadata, they will not be created.
Metadata are subjective and depend on context. Most probably, two persons will attach different metadata to the same resource due to their different points of view. Moreover metadata can be misinterpreted due to its dependency on context. E. g. a search on "post post modern art" may not give a result because at the time of creation of a work of art of that period the term did not exist or a search on "pictures taken at 1:00" may lead to confusing results due to local time differences.
There is no end to metadata. E. g. annotating a match of soccer with metadata one can describe all the players and their actions in time and stop there. Or one can also describe the advertising in the background and the clothes the players wear. Or one can also describe each fan on the tribune and the clothes they wear. All of these metadata can be interesting to one party or another — e. g. the spectators, sponsors or a counterterrorist unit of the police — and even for a simple resource the amount of possible metadata can be gigantic.
Metadata are useless. Many of today's search engines allow finding texts very efficiently. There are other techniques for finding pictures, videos and music, namely query-by-example that will become more and more powerful in the future. Thus there is no real need for metadata.

The opposers of metadata sometimes use the term metacrap to refer to the unsolved problems of metadata in some scenarios.

Uses

Metadata have become important on the World Wide Web because of the need to find useful information from the mass of information available. Manually-created metadata add value because they ensure consistency. If one webpage about a topic contains a word or phrase, then all webpages about that topic should contain that same word or phrase. They also ensure variety, so that if one topic has two names, each of these names will be used. For example, an article about Sport Utility Vehicles would also be given the metadata keywords ‘4 wheel drives’, ‘4WDs’ and ‘four wheel drives’, as this is how they are known in some countries.

Examples of metadata for an audio CD include the MusicBrainz project, and AMG's All Music Guide. Similarly, MP3 files have metadata tags in a format called ID3.

Metadata are more properly called an ontology or schema when structured into a hierarchical arrangement. Both terms describe “what exists” for some purpose or to enable some action. For instance, the arrangement of subject headings in a library catalog serves not only as a guide to finding books on a particular subject in the stacks, but also as a guide to what subjects “exist” in the library’s own ontology and how more specialized topics are related to or derived from the more general subject headings.

Metadata are frequently stored in a central location and used to help organizations standardize their data. This information is typically stored in a Metadata Registry.

Types

In general, there are two distinct classes of metadata: structural or control metadata and guide metadata.

Structural metadata is used to describe the stucture of computer systems such as tables, columns and indexes. Guide metadata is used to help humans find specific items and is usually expressed as a set of keywords in a natural language.

Relational database metadata

Each relational database system has its own mechanisms for storing metadata. Examples of relational-database metadata include:

Tables of all tables in database, their names, sizes and number of rows in each table.
Tables of columns in each database, what tables they are used in, and the type of data stored in each column.

In database terminology, this set of metadata is referred to as the catalog. The SQL standard specifies a uniform means to access the catalog, called the INFORMATION_SCHEMA, but not all databases implement it, even if they implement other aspects of the SQL standard. For an example of database-specific metadata access methods, see Oracle metadata.

Data warehouse metadata

Data warehouse metadata systems are sometimes separated into two sections:

back room metadata that are used for Extract, transform, load functions to get OLTP data into a data warehouse
front room metadata that are used to label screens and create reports

Kimball

lists the following types of metadata in a data warehouse (See also [1]):

source system metadata
- source specifications, such as repositories, and source schemas
- source descriptive information, such as ownership descriptions, update frequencies, legal limitations, and access methods
- process information, such as job schedules and extraction code
data staging metadata
- data acquisition information, such as data transmission scheduling and results, and file usage
- dimension table management, such as definitions of dimensions, and surrogate key assignments
- transformation and aggregation, such as data enhancement and mapping, DBMS load scripts, and aggregate definitions
- audit, job logs and documentation, such as data lineage records, data transform logs
DBMS metadata, such as:
- DBMS system table contents
- processing hints

Michael Bracket defines metadata (what he calls "Data resource data") as "any data about the organization’s data resource".

Adrienne Tannenbaum defines metadata as "the detailed description of instance data. The format and characteristics of populated instance data: instances and values, dependent on the role of the metadata recipient." These definitions are characteristic of the "data about data" definition.

Business Intelligence metadata

Business Intelligence is the process of analyzing large amounts of corporate data, usually stored in large databases such as the Data Warehouse, tracking business performance, detecting patterns and trends, and helping enterprise business users make better decisions. Business Intelligence metadata describes how data is queried, filtered, analyzed, and displayed in Business Intelligence software tools, such as Reporting tools, OLAP tools, Data Mining tools.

Examples:

OLAP metadata: The descriptions and structures of Dimensions, Cubes, Measures (Metrics), Hierarchies, Levels, Drill Paths
Reporting metadata: The descriptions and structures of Reports, Charts, Queries, DataSets, Filters, Variables, Expressions
Data Mining metadata: The descriptions and structures of DataSets, Algorithms, Queries

Business Intelligence metadata can be used to understand how corporate financial reports reported to Wall Street are calculated, how the revenue, expense and profit are aggregated from individual sales transactions stored in the data warehouse. A good understanding of Business Intelligence metadata is required to solve complex problems such as compliance with corporate governance standards like Sarbanes Oxley (SOX) or Basel II.

General IT metadata

In contrast, David Marco, another metadata theorist, defines metadata as "all physical data and knowledge from inside and outside an organization, including information about the physical data, technical and business processes, rules and constraints of the data, and structures of the data used by a corporation." Others have included web services, systems and interfaces. In fact, the entire Zachman framework (see Enterprise Architecture can be represented as metadata.

Notice such definitions expand metadata's scope considerably, to encompass most or all of the data required by the Management Information Systems capability. In this sense, the concept of metadata has significant overlaps with the ITIL concept of a Configuration Management Database (CMDB), and also with disciplines such as Enterprise Architecture and IT portfolio management.

This broader definition of metadata has precedent. Third generation corporate repository products (such as those eventually merged into the CA Advantage line) not only store information about data definitions (COBOL copybooks, DBMS schema) but also about the programs accessing those data structures, and the JCL and batch job infrastructure dependencies as well. These products (some of which are still in production) can provide a very complete picture of a mainframe computing environment, supporting exactly the kinds of impact analysis required for ITIL-based processes such as Incident and Change Management. The ITIL Back Catalogue includes the Data Management volume which recognizes the role of these metadata products on the mainframe, posing the CMDB as the distributed computing equivalent. CMDB vendors however have generally not expanded their scope to include data definitions, and metadata solutions are also available in the distributed world. Determining the appropriate role and scope for each is thus a challenge for large IT organizations requiring the services of both.

Since metadata is pervasive, centralized attempts at tracking it need to focus on the most highly leveraged assets. Enterprise Assets may only constitute a small percentage of the entire IT portfolio.

Some practitioners have successfully managed IT metadata using the Dublin Core metamodel.

IT metadata management products

First generation data dictionary/metadata repository tools would be those only supporting a specific DBMS, such as IDMS's IDD (integrated data dictionary), the IMS Data Dictionary, and Adabas's Predict.

Second generation would be ASG's DATAMANAGER product which could support many different file and DBMS types.

Third generation repository products became briefly popular in the early 1990s along with the rise of widespread use of RDBMS engines such as IBM's DB2.

File system metadata

Nearly all file systems keep metadata about files out-of-band. Some systems keep metadata in directory entries; others in specialized structure like inodes or even in the name of a file. Metadata can range from simple timestamps, mode bits, and other special-purpose information used by the implementation itself, to icons and free-text comments, to arbitrary attribute-value pairs.

With more complex and open-ended metadata, it becomes useful to search for files based on the metadata contents. The Unix find utility was an early example, although inefficient when scanning hundreds of thousands of files on a modern computer system. Apple Computer's current version of its Mac OS X operating system (Tiger) supports cataloging and searching for file metadata through a feature known as Spotlight. Microsoft worked in the development of similar functionality in the WinFS file system, although the project was cancelled. Linux implements file metadata using extended file attributes.

Image metadata

Examples of image files containing metadata include Exchangeable Image File Format (EXIF) and Tagged Image File Format (TIFF).

Having metadata about images embedded in TIFF or EXIF files is one way of acquiring additional data about an image. Image metadata are attained through tags. Tagging pictures with subjects, related emotions, and other descriptive phrases helps Internet users find pictures easily rather than having to search through entire image collections. A prime example of an image tagging service is Flickr, where users upload images and then describe the contents. Other patrons of the site can then search for those tags . Flickr uses a folksonomy: a free-text keyword system in which the community defines the vocabulary through use rather than through a controlled vocabulary.

Digital photography is increasingly making use of metadata tags. Photographers shooting Camera RAW file formats can use applications such as Adobe Bridge or Apple Computer's Aperture to work with camera metadata for post-processing. Users can also tag photos for organization purposes using Adobe's Extensible Metadata Platform (XMP) language, for example.

Program metadata

Metadata is casually used to describe the controlling data used in software architectures that are more abstract or configurable. Most executable file formats include what may be termed "metadata" that specifies certain, usually configurable, behavioral runtime characteristics. However, it is difficult if not impossible to precisely distinguish program "metadata" from general aspects of stored-program computing architecture; if the machine reads it and acts upon it, it is a computational instruction, and the prefix "meta" has little significance.

In Java, the class file format contains metadata used by the Java compiler and the Java virtual machine to dynamically link classes and to support reflection. The J2SE 5.0 version of Java included a metadata facility to allow additional annotations that are used by development tools.

In MS-DOS, the COM file format does not include metadata, while the EXE file and Windows PE formats do. These metadata can include the company that published the program, the date the program was created, the version number and more.

In the Microsoft .NET executable format, extra metadata is included to allow reflection at runtime.

Document metadata: Most programs that create documents, including Microsoft Word and other Microsoft Office products, save metadata with the document files. These metadata can contain the name of the person who created the file (obtained from the operating system), the name of the person who last edited the file, how many times the file has been printed, and even how many revisions have been made on the file. Other saved material, such as deleted text (saved in case of an undelete command), document comments and the like, is also commonly referred to as "metadata", and the inadvertent inclusion of this material in distributed files has sometimes led to undesirable disclosures.

For a list of executable formats, see object file.

Metamodels

Metadata on Models are called Metamodels. In Model Driven Engineering, a Model has to conform to a given Metamodel. According to the MDA guide, a metamodel is a model and each model conforms to a given metamodel. Meta-modeling allows strict and agile automatic processing of models and metamodels.

The Object Management Group (OMG) defines 4 layers of meta-modeling. Each level of modeling is defined, validated by the next layer:

M0: instance object, data row, record -> "John Smith"
M1: model, schema -> "Customer" UML Class or database Table
M2: metamodel -> Unified Modeling Language (UML), Common Warehouse Metamodel (CWM)
M3: meta-metamodel -> Meta-Object Facility (MOF)

Strange metadata

Since metadata are also data, it is possible to have metadata of metadata–"meta-metadata." Machine-generated meta-metadata, such as the reversed index created by a free-text search engine, is generally not considered metadata, though.

Metadata that are embedded with content is called embedded metadata. A data repository typically stores the metadata detached from the data.

Digital library metadata

There are three categories of metadata that are frequently used to describe objects in a digital library [2][3]:

descriptive - Information describing the intellectual content of the object, such as MARC cataloguing records, finding aids or similar schemes. It is typically used for bibliographic purposes and for search and retrieval.
structural - Information that ties each object to others to make up logical units (e.g., information that relates individual images of pages from a book to the others that make up the book).
administrative - Information used to manage the object or control access to it. This may include information on how it was scanned, its storage format, copyright and licensing information, and information necessary for the long-term preservation of the digital objects.

References

William R. Durrell, Data Administration: A Practical Guide to Data Administration, McGraw-Hill, 1985

Ralph Kimball, The Data Warehouse Lifecycle Toolkit, Wiley, 1998, ISBN 0471255475

Guy V Tozer, Metadata Management for Information Control and Business Success, Artech House, 1999, ISBN 0890062803

Michael H. Brackett, Data Resource Quality, Addison-Wesley, 2000, ISBN 0201713063

David Marco, Building and Managing the Meta Data Repository: A Full Lifecycle Guide, Wiley, 2000, ISBN 0471355232

Adrienne Tannenbaum, Metadata Solutions: Using Metamodels, Repositories, XML, and Enterprise Portals to Generate Information on Demand, Addison-Wesley, 2002, ISBN 0201719762

David C. Hay, Data Model Patterns: A Metadata Map, Morgan Kaufman, 2006, ISBN 0120887983

Bretherton, F. P. and Singley, P. T. 1994, Metadata: A User's View, Proceedings of the International Conference on Very Large Data Bases (VLDB), 1091-1094

R. Todd Stephens (2003). Utilizing Metadata as a Knowledge Communication Tool. Proceedings of the International Professional Communication Conference 2004. Minneapolis, MN: Institute of Electrical and Electronics Engineers, Inc.

External links

"Metacrap" - An opinion by Cory Doctorow on the limitations of metadata on the Internet
A review of Mac OS X v. 10.4's new metadata implementations
Meta Meta Data Data - Article by Ralph Kimball
Template:LISWiki link
Marine Metadata Interoperability Project - A collaborative attempt to address marine science metadata needs
Guidance and techniques for tagging and keywording images - Article by Third Light Ltd
"Enterprise Metadata SME" - Articles, Best Practices, and Publications on Enterprise Metadata
"Effective reporting of tacit (soft) information" - Article by Dr. Cyril Brookes

[1] Committee on Cataloging, Task Fore on Metadata. Summary Report. June 1999

[2] D. C. A. Bultermann. Is It Time For a Moratorium on Metadata? IEEE MultiMedia, Oct-Dec 2004

[1]

[2]