• Tidak ada hasil yang ditemukan

Information.Integration

Introduction

The progress in distributed computing has played a significant role in facilitating replacement of large, monolithic databases with smaller and conceptually self- contained databases. This has resulted in efficient yet “ever-expanding” clusters of heterogeneous sources of data (data sources) such as the traditional databases, Web sites, e-mails, and operational applications. The need to view and analyze data from various data sources unified under one application context has become more prevalent and has gained prominence among data analysts, both in the information technology industry and academia. For example, large-scale enterprises have a grow- ing demand for enterprise integration products, while data mining researchers need a unified view of data from different sources to select and test their algorithms. The task of providing integrated data management within one umbrella is particularly challenging. In this chapter, we describe the need and evolution of the Weather Tool system, a weather data provisioning and integration system accessible via a standard application programming interface (API). The chapter begins with an introduction to the evolution of sources of data and to the problem of weather data integration.

Some of the previous and existing data integration technologies and products are then summarized. The chapter will also address some of the new technologies that the data/information integration systems benefit greatly from, including the agent technology and the Semantic Web. We then provide an overview of the Weather Tool architecture (and an example application) that enables integration of several data sources to provide a unified data environment. The architecture discussion describes how data processing services are provisioned in a dynamically-changing environment. The advantages and pitfalls of such a system are elaborated, followed by a discussion of potential improvements in the architecture. The chapter briefly discusses the core ideas of Quantum Leap Innovation, Inc..(QLI) intelligent data management (IDM) framework, the generalized, enhanced version of the Weather Tool effort. This framework is aimed at performing beyond the Weather Tool’s single repository approach to data integration by incorporating several strategies for on- demand, data access and analysis remotely, thereby avoiding creation of redundant intermediary repositories. Finally, the chapter highlights the need, evolution, and importance of data/information integration efforts.

Evolution. from. Large. Databases. to...

Large. Data. Environment.

Historically, due to high set-up and maintenance costs for databases, extremely large volumes of data were stored in one database; this strategy was considered ef-

ficient and had the advantage of easy access to all data. In reaction to the scale and complexity of such large databases, data analysts have often created and exploited selective views of subsets of the large databases. Since the advent of cheaper, flex- ible, and more efficient data-storage solutions, the large databases have made way for individual databases that store smaller sets of relevant data. These databases have varying data access mechanisms and protocols that are usually dependent upon non-standard, product specific technology. Some examples of commonly-used relational databases are MySQL, Oracle, Postgres, and DB2.

Data-access mechanisms for the databases vary in complexity of implementation and user-friendliness. The most common, structurally straightforward but tricky to use data-access mechanism is Structured Query Language (SQL) that allows forma- tion of data queries via a set of special commands and keywords. Relatively more expressive and easy-to-use approaches involve the graphical user interface (GUI) applications developed using technologies such as Visual Basic, Java, and so forth.

However, these involve long implementation times. More recent tools that have been developed for easier implementation, use, and data access are the Web sites that enable access to remote databases via Web pages. The Web sites/portals that are easily customized to create data-access forms have become an important source of information in today’s world. There are examples of Web sites functioning not only as data-access mechanisms but also as sources of intermediate transient data that can become crucial input to relevant applications. Examples include unprocessed weather data, hourly stock market data, and event announcements. Transient data inherently do not require elaborate storage but often becomes invaluable for certain categories of consumers. For instance, unprocessed weather data can help in as- sessing weather conditions and road conditions in the absence of validated weather data. Event announcements, though not useful once the event is over, can be of great use in estimating population densities and movements in cities and can also help in highlighting the factors that affect the cities’ economies.

The popular adoption of Web sites as front-ends to relational databases has resulted in the emergence of many heterogeneous data sources, a large data environment with easy accessibility via the Internet. Other than the inherent efficiency and easy maintainability aspects of having relatively small, self contained data sources, the industry has recognized the advantages of dynamically composing new data structures from the data in these data sources. There is high potential that analyzing previously unknown combinations of data will result in useful data that can even be life-saving in certain situations. These combinations are not known in advance and the data sources have complex and often unique access mechanisms, as they were not designed to be used in conjunction with each other. This makes it inefficient to use traditional IT development mechanisms to integrate such sources statically, es- pecially in view of the computational costs and the configuration overhead involved in run-time addition of new data sources. In the following subsection, we describe

one such data integration problem in the weather domain, where all components of weather data are not available at one data source.

The.Weather.Problem

The data retrieval process assumes knowledge of at least the structure of the data and at times of other constraints pertinent to the data source. The retrieval process entails the basic data processing steps starting with data extraction, followed by data validation, completion, and optional transformation and storage. In case of storage, one must also worry about storage and consolidation of the corresponding meta- data with the existing data. Increasing the number of data sources only escalates the complexity in providing these data processing services. Adding to the complexity of data retrieval, there are types of data, such as news reports, demographic data, and weather data that change over time and impose extra processing to account for the constant updates. Furthermore, the data may need to be extracted from more than one relevant source, each with their own formats. Data from different sources vary not just by data quality and content, but also by data format and update frequencies

Figure 1. National Climatic Data Center’s (NCDC) text-based validated historical data

(how often data is made available). For example, news articles differ in content and quality over a period of time, across different newspapers and Web sites. Moreover, several news providers may write articles on different aspects of the same incident.

News is validated typically anywhere from a few hours to days.

Weather data is a good example of time-varying data. The National Climatic Data Center’s (NCDC) Web site allows its users to view validated historical data in textual form (Figure 1); the National Oceanic and Atmospheric Administration’s (NOAA) Web site (Figure 2) shows forecast data for the next ten days in textual form, and National Weather Service’s (NWS) web-site has non-validated forecast data in graphical form (Figure 3). These sources are all showing different aspects of weather, potentially for the same geographical location, and two of them have Figure 2. National Oceanic and Atmospheric Administration’s (NOAA) textual forecast data

non-validated data that has the potential to be inconsistent with the third source, while claiming higher quality in terms of completeness, accuracy, and so forth.

A weather expert is usually responsible for collecting current weather data from a set of forecast data sources. The weather expert may also be required to retrieve historical weather data from another source. The often unique formats of these data require the expert to go through specific data extraction procedures for all the re- quired locations. The locations that can be designated by zip codes can quickly rise in number, with 99 being the number for the state of Delaware, one of the smaller states in the U.S. Following the collection of data, the expert may be required to validate and convert this data into standard format. Sources with transient data entail provisioning of efficient storage in the form of caches or extra data repositories.

This also helps to avoid duplicating efforts for future access by applications that require weather data.

Furthermore, as forecast weather data is validated (typically within two or three days), the expert must update the repository accordingly. The expert is also respon- sible for providing weather data to the relevant applications in the required formats.

If the applications require advanced meteorological data such as radiosonde (up- per air) data, the expert must integrate weather data from the different sources per location and date.

Figure 3. National Weather Service HTML-based forecast data

Static and consistent data can be directly retrieved from the data sources. However, in the case of weather, where tools and applications (consumers) require time-vari- ant data, extra care must be taken while automating the retrieval process to account for a series of data processing steps needed to make the data useful for analysis.

Consumers must maintain their own enormous databases in cases where the provid- ers only store the data temporarily. Hence there is a need for a Weather Tool that enables a logical value chain with individual system components for carrying out the various services required for handling time-sensitive information. Although the industry is still looking for better solutions for realizing dynamic, distributed, real- time clusters of data sources, a lot of foundational work has already been explored by the research community and identified by the industry.