Benchmarks Online

Skip Navigation Links


Page One

Campus Computing News

UNTeCampus is Here!

Microsoft Products Available to UNT Employees at Low Cost

Data Warehousing for the UNT System

SPSS 12 Patches, Licenses, and Updates Now Available

ZoomText 8.0 is Now Available

The Macintosh Turns 20

EDUCAUSE in Dallas

Today's Cartoon

RSS Matters

The Network Connection

Link of the Month

WWW@UNT.EDU

Short Courses

IRC News

Staff Activities

Subscribe to Benchmarks Online
    

Data Warehousing for the UNT System*

By Dr. Dave Mukherjee, Programmer Analyst on EIS Project

Creating an infrastructure for managing data, information & knowledge resources


Overview

What is it all about?

Data warehousing is a process of creating an infrastructure and a system that transforms raw data into actionable information and eventually into intelligence and knowledge. Data warehousing provides for data gathering, cleansing, formatting, integration and reconstruction so as to provide information to its users in doing their day-to-day activities like planning, analysis, and decision-making. The ability of the UNT System to efficiently carry out its activities is directly dependent on its ability to access and use clean, consistent and correct data available in a timely fashion.

This document describes the nature of data, information, intelligence, knowledge and data warehousing. Further, it describes the road map to achieving the goals, objectives and strategies of the UNT System and the various options that are available towards accomplishing this goal. It also mentions the major guiding principles and major considerations that impact the delivery of the data warehouse.

Defining our definitions

 However, before we describe how data warehousing will help us create an environment to exploit data and information we need to understand the basic meaning of data, information, knowledge and intelligence.  Like a wise man once said before we argue let’s define our definitions.

Data, information, intelligence and knowledge

Data

Data is everywhere. Data is locked up in different systems, in spreadsheets, desktop databases, flat files, tapes, documents, text files all stored in different formats. This is unprocessed data that needs to be cleaned, formatted and integrated so that we can extract meaning from these sources. Data is like raw ore that needs to be processed so that it may be used for whatever final product the user may desire.

Data is raw, unprocessed and a potential source of information

Information

Information is data with meaning. It is providing data a context and relevance to the users current activities. Data without context is like, using the ore analogy, being provided the wrong type of ore for making metal. For data to be transformed into information, data not only needs to be in context but also needs to be consistent, clean, timely, and in the right format. In short, it needs to be processed so that data is usable for day-to-day activities of the users.

Information is data with meaning and context

Intelligence

One description of Intelligence is the ability to solve problems. Intelligence is the next step in the evolution – data, information and intelligence. When relevant information is processed further into structures and artifacts that enable decision making a system can be said to provide its users intelligence, the ability to aid in solving day-to-day problems.

Intelligence is the ability to solve problems

Knowledge

Knowledge is gained by continuous learning based upon the day-to-day activities of the systems users. As users learn to use the information at their disposal and solve problems and make decisions and gather intelligence they also arrive at the best possible way to do things. This experience of users is a valuable resource that needs to be saved and shared with all within the UNT System.

In fact, this knowledge is the lifeblood of the UNT System that should be carefully gathered and distributed and very zealously protected. Knowledge is what provides the UNT System the edge to excel and propel itself to growth and expansion in the future.

Knowledge is the collective wisdom of experience

Data Warehousing

Data warehousing

Data warehousing is a process of creating an infrastructure and a system that supports the information requirements of an UNT System. It is a process of transforming data to create an integrated repository that provides information to users. Data warehousing involves the creation of a system and infrastructure that enables the access and use of required information in an UNT System’s day-to-day activities.

Data warehousing is the process of creating an infrastructure and a system that leverages data and information to gain intelligence and knowledge

The road to data warehousing is varied. The next section describes the strategy, goals and objectives of the data warehouse implementation and a high-level view of how data warehousing is expected to evolve over time.


The Roadmap

Strategy, Goals, and Objectives

The overall objective

The overall objective of data warehousing is to make information available to all users in the UNT System who need it to accomplish their day-to-day activities. The objective is to provide information in a consistent format, be timely and easily accessible.

Implementation strategy

Implementation will be handled in a phased and incremental manner. The strategy is to evolve the system from phase to phase of the data warehouse implementation cycle. The first phase is to gather data from all known sources into one data staging repository for further processing. Next, the data will be reconstructed into separate business area models and structures that enable meaningful reporting and analysis. Finally, reporting, analysis and access tools and applications will need to be developed or provided to enable the use and access of information.

Extraction, Transformation & Loading and Staging

For the first phase of the data warehouse various tools and utilities are being utilized to access data from the mainframe, tapes, flat files on disk and funneled into a data staging repository also called the Operational Data Store (ODS) which will be a database used for further processing, data conversion, cleansing, operational reporting etc.

The purpose of having a single data repository is to ensure a single version of the truth or a sole source of information so that there is not a “Tower of Babel” effect in the way information is available in the UNT System.

It is important to note that this ODS is a database separate from the operational databases like ADABAS and the PeopleSoft databases that store transactions created in day-to-day operations of the UNT System. An ODS is created to provide a single source of information for further processing and transformation into subject area (Finance, HR, Student etc.) models and structures.

An ODS can be used for operational reporting and analysis for all data that is stored therein and for other uses like conversion and downloads for ad-hoc analysis.

Creating Data marts and Data warehouses

In the next phase the data in the ODS which is the data repository or the data store created in the previous phase will be re-designed, transformed, remodeled, and restructured to make it meaningful and efficient for reporting and analysis.

There is going to be extra effort taken during this phase to create models that reflect the decision support and analytical needs of the business units. This is very important as the value of the data warehouse is directly related to the meaningfulness and value of the information in the business activities of the business unit.

Tools and Interfaces

We will take an evolutionary process of acquiring tools and creating interfaces to information stored in the ODS and the data marts. The current approach is to use available tools like Crystal Reports, Access and Excel to report and analyze information. As needs evolve we will evaluate appropriate tools and interfaces that can be used for the decision support needs of the various business units.

The next section describes how the need for information varies within the UNT System. The decision to use different tools is dependent on how the data is used and by whom in the UNT System.

The decision support pyramid

Decision support information may be used differently by different people in the UNT System based upon their need to either do static reporting, ad-hoc reporting and what-if analyses or for advanced analysis like data mining. This is an important concept to understand as most decision support requirements are usually met by static reporting and with a little ad-hoc reporting and data mining, required by a few power users. Using this 80/20 rule 80% of the users decision support needs can be met with static reporting and only 20% of the users need advanced ad-hoc reporting, data mining capabilities.

This will be a guiding principle for our strategy in addressing general reporting and data access needs and thereafter evolve the product acquisition to ad-hoc reporting and data mining tools.

Decision Support Pyramid


Planning Options

Prioritizing the various initiatives

It is fairly easy to visualize the different phases of delivery described in the road map of creating an ODS, creating data marts and providing access tools and interfaces. However, there are a myriad of tools, products, and methods to achieve the creation of a data warehouse. Over the next year and beyond we need to consider the various product options available to us to evolve from phase to phase and from data and information to intelligence and knowledge.

Legacy data extraction

The current initiative is in legacy data extraction and consolidation in the ODS. This initiative is well under way and involves the use of Extraction tools like Informatica that enable the extraction of data from various sources like the mainframe, transformation into the right format like EBCIDIC to ASCII and loading the data to an Oracle database.

No major changes are expected in the next year in terms of the tools and utilities needed to accomplish extractions and to load the ODS.

Central report repository

The next major initiative that impacts data warehousing is the need for report access and distribution once the data is loaded in the ODS. Users will use reporting tools like Crystal Reports to develop reports and publish such reports for distribution. There needs to be a report access and distribution system in place to first house the reports in a report repository and next enable distribution to those who are authorized to view it.

The time line for this initiative needs to synchronize with the go-live dates for PeopleSoft modules for each team (Finance, HR, Student). Effort is underway to get the product in place by early next year.

Reporting Database Service (RDS)

The creation of subject area data marts is an involved process of understanding the decision support needs of the business units and the specific models for reporting and analysis based on the data objects in the databases. This can be a time consuming process and is the next phase activity for data warehousing after the ODS is created.

One option is to deploy the Reporting Database Service (RDS) provided by PeopleSoft to quickly bring up another database based on the PeopleSoft transactional tables but restructured for decision support performance.

The time line for implementing this option if chosen is early to mid-2004 as some of the business units would have already transitioned to the PeopleSoft system by then.

Enterprise Performance Management (EPM)

EPM is an analytical interface to PeopleSoft data and involves the creation of business warehouses and the support infrastructure to access this information using a web server and reporting application server and end user tools and interfaces. Some modules for EPM have been acquired under the PeopleSoft contract and implementation for HR and Finance.

Deployment of these EPM modules will need to be started by early next year to avail of the benefits of these tools for score carding, budgeting and analysis and workforce analytics.

Customer Relationship Management (CRM)

Related to the EPM modules in PeopleSoft CRM and EPM for CRM is another initiative that needs to be implemented late 2004 and after.

This initiative utilizes history data regarding alumni, students and other prospects to better guide, craft and position our message and offering to customers. This module too has EPM modules that provide analysis and reporting that is crucial to understanding our customers and their behavior vis-à-vis UNT.

PeopleSoft Enterprise Warehouse (EW)

Another option is to take a total PeopleSoft approach to data warehousing and deploy all their products and build a data warehouse. Obviously, there are implications to doing this as there is dependence on one vendor’s solution and this may not be practical or economically viable.

This initiative as others will have to be implemented in phases and in an iterative manner and time lines are expected to be mid-2004 and later.

Informatica PeopleSoft Warehouse

Other vendors like Informatica have developed interfaces to the PeopleSoft data model and can provide reporting and analytics using their reporting interfaces like Informatica’s Power Analyzer product. This may be a competitively priced option and could be an option worth evaluating.

The time line to evaluate this option would be towards the mid-2004 time period and could be compared to RDS and PeopleSoft Enterprise Warehouse.

As you can see there are quite a few initiatives that need to be considered. The next step is to plan the next set of initiatives based on business need and other practical considerations like cost, support and maintenance, etc.

The Data Warehouse Framework

The data warehouse framework presents the various aspects of the data warehouse solution and provides a view of how the various components relate to each other and the flow of data from the various sources to the users.

Data Warehouse Framework diagram


Evolution of the Data Warehouse

Phased delivery and Incremental growth

The most important guiding principle for the planning process for the evolution of the data warehouse is phased delivery.  Data warehousing has been known to evolve successfully in a phased manner and attempts to take a “big bang” approach and do everything in one pass has been unsuccessful.

 Taking and incremental growth approach in each phase like staging and doing each business unit incrementally, for instance, and gradually evolving through each phase is the preferred approach to be employed.

Build vs. Buy

It is important to consider the implications of doing everything by hand like loading tables, creating data marts, reporting and data access methods. Sometimes it is efficient and less time consuming to use pre-built tools and utilities for use in each phase of the data warehouse. In fact, it is advisable to create processes and procedures around the use of tools and minimize custom coding to enable ease of use and maintenance of the system over time.

Management and User Involvement

Another prerequisite for the successful evolution of the data warehouse is the commitment and involvement of the management of each business unit, EIS management and the technical team managers and leads in each phase of the data warehouse implementation.

This needs to be implemented by creating data warehouse advisory and coordination groups consisting of business and technical team members who actively participate in providing input to the various planning, management and implementation aspects of the data warehouse.

Better communication and involvement leads better understanding and helps reduce user resistance to data warehousing as well.

Goals & Objectives

It is critical to clearly establish goals and objectives for each phase and deliverables for each business unit and set expectation about what will be delivered. Nothing is as unattainable as a nebulous set of goals.

Best Practices

All data warehouse implementation will be guided by industry accepted best practices, standards and implementation methodologies.

Finally, the value of data warehousing depends on user perception and acceptance. No system is useful if the users do not use it. To make that possible there needs to constant continuous communication, feedback, training and management support for enhancements and growth which need to be incorporated into the process of evolving the data warehouse.


Major Considerations

Issues for consideration

Some issues to consider that will impact the delivery of the data warehouse are cost, storage, human resources, software and the required time schedule. These are important issues that need to be considered as we proceed down the path of planning and designing for the data warehouse.

Cost vs. Value

One of the most obvious considerations is cost of the solution. There are so many options available to us and the cost of the solution needs to match the requirements and the value that the solution brings to the UNT System.

Storage

Another issue to consider is that of data storage. The legacy data dates back to the early 1990s and sometimes into the 80s. It is important to retain as much data as is needed for the needs of the business units and care needs to taken to estimate the storage space requirements so that storage can be arranged for and costs related to hardware and maintenance can be budgeted.

Human Resources

Data warehouses have human resource and manpower needs as well. Planning, designing, implementing and maintaining data warehouses require dedicated teams for development and implementation; they require support and interaction with network and database support teams, security and user administration teams, application development teams, functional analysts and managers to mention a few. This is something that needs to planned and provided for to have a efficient and fully functioning data warehouse.

Software & Hardware

A data warehouse involves many software and hardware platforms. Data warehouses require dedicated hardware like servers for ETL, Report servers, Business Intelligence servers, and other web and application servers. It also requires dedicated databases for different data repositories. The software that impacts data warehousing ranges from ETL, data cleansing, data profiling, data auditing software to business intelligence tools, reporting and analytical tools, pre-built data warehouses, packaged implementations, data modeling tools, and data mining software.

Each of these hardware and software products and platforms need to be carefully planned for as the success of certain parts of the business depends on the timely availability of these products and platforms.

Schedule

The time line for data being available in the data warehouse is dependent on the needs of each business area. One of the goals of data warehousing is to provide timely data for reporting and analysis. The goal is to synchronize legacy data delivery to the PeopleSoft go live dates so that the data warehouse is populated with legacy data and is ready to be integrated with current data to provide cross period analysis. This is a major consideration as this synchronized delivery is only possible if business units recognize the value and plan for time and human resources to support the data warehouse team to get data validation and verification completed in time. The data warehouse delivery schedule is important as an early delivery reduces the cost to maintain parallel systems and reduces support and maintenance costs for legacy systems and allows administrative computing to focus on current and future platforms while retaining the data from legacy for future use.


* This article is also available at this location (best viewed with Microsoft Explorer): https://projectweb.cc.unt.edu/IMD/White%20Papers/Forms/AllItems.htm.