The purpose of ETL (Extract, Load, Transform) tools is to help with the consolidation
of data that is dispersed throughout the information system. Data is stored in disparate
applications, databases, files, operating systems, and in incompatible formats. The
consequences of such a dispersal of the information can be dire, for example, different
business units operating on different data will show conflicting results and information
cannot be shared across different entities of the same business.
Imagine the marketing department reporting on the success of their latest campaign
while the finance department complains about its lack of efficiency. Both have
numbers to back up their assertions, but the numbers do not match!
What could be worse than a shipping department that struggles to understand
customer orders, or a support department that cannot confirm whether a customer
is current with his/her payment and should indeed receive support? The examples
are endless.
The only way to have a centralized view of the information is to consolidate the
data—whether it is in a data warehouse, a series of data marts, or by normalizing
the data across applications with master data management (MDM) solutions. ETL
tools usually come into play when a large volume of data has to be exchanged (as
opposed to Service-Oriented Architecture infrastructures for instance, which would
be more transaction based).
In the early days of ETL, databases had very weak transformation functions. Apart
from using an insert or a select statement, SQL was a relatively limited language. To
perform heavy duty, complex transformations, vendors put together transformation
platforms—the ETL tools.
Product Overview
Over time, the SQL language has evolved to include more and more transformation
capabilities. You can now go as far as handling hierarchies, manipulating XML
formats, using analytical functions, and so on. It is not by chance that 50 percent of
the ETL implementations in existence today are done in plain SQL scripts—SQL
makes it possible.
This is where the ODI ELT architecture (Extract-Load-Transform—the inversion
in the acronym is not a mistake) comes into play. The concept with ELT is that
instead of extracting the data from a source, transforming it with a dedicated
platform, and then loading into the target database, you will extract from the
source, load into the target, then transform into the target database, leveraging
SQL for the transformations.
To some extent, ETL and ELT are marketing acronyms. When you look at ODI
for instance, it can perform transformations on the source side as well as on the
target side. You can also dedicate some database or schema for the staging and
transformation of your data, and can have something more similar to an ETL
architecture. Similarly, some ETL tools all have the ability to generate SQL code
and to push some transformations at the database level.
The key differences then for a true ELT architecture are as follows:
• The ability to dynamically manage a staging area (location, content,
automatic management of table alterations)
• The ability to generate code on source and target systems alike, in the
same transformation
• The ability to generate native SQL for any database on the market—most
ETL tools will generate code for their own engines, and then translate that
code for the databases—hence limiting their generation capacities to their
ability to convert proprietary concepts
• The ability to generate DML and DDL, and to orchestrate sequences of
operations on the heterogeneous systems
In a way, the purpose of an ELT tool is to provide the comfort of a graphical interface
with all the functionality of traditional ETL tools, to keep the efficiency of SQL
coding with set-based processing of data in the database, and limiting the overhead
of moving data from place to place.
In this chapter we will focus on the architecture of Oracle Data Integrator 11g, as
well as the key concepts of the product. The topics we will cover are as follows:
• The elements of the architecture, namely, the repository, the Studio, the
Agents, the Console, and integration into Oracle Enterprise Manager
• An introduction to key concepts, namely, Execution Contexts, Knowledge
Modules, Models, Interfaces, Packages, Scenarios, and Load Plans
ODI product architecture
Since ODI is an ELT tool, it requires no other platform than the source and target
systems. But there still are ODI components to be deployed: we will see in this
section what these components are and where they should be installed.
The components of the ODI architecture are as follows:
• Repository: This is where all the information handled by ODI is stored,
namely, connectivity details, metadata, transformation rules and scenarios,
generated code, execution logs, and statistics.
• Studio: The Studio is the graphical interface of ODI. It is used by
administrators, developers, and operators.
Product Overview
• Agents: The Agents can be seen as orchestrators for the data movement and
transformations. They are very lightweight java components that do not
require their own server—we will see in detail where they can be installed.
• Console: The Console is a web tool that lets users browse the ODI
repository, but it is not a tool used to develop new transformations. It can
be used by operators though to review code execution, and start or restart
processes as needed.
• The Oracle Enterprise Manager plugin for ODI integrates the monitoring of
ODI components directly into OEM so that administrators can consolidate
the monitoring of all their Oracle products in one single graphical interface.
At a high level, here is how the different components of the architecture
interact with one another. The administrators, developers, and operators
typically work with the ODI Studio on their machine (operators also have the
ability to use the Console for a more lightweight environment). All Studios
typically connect to a shared repository where all the metadata is stored. At
run time, the ODI Agent receives execution orders (from the Studio, or any
external scheduler, or via a Web Service call). At this point it connects to the
repository, retrieves the code to execute, adds last minute parameters where
needed (elements like connection strings, schema names where the data
resides, and so on), and sends the code to the databases for execution. Once the
databases have executed the code, the agent updates the repository with the
status of the execution (successful or not, along with any related error message)
and the relevant statistics (number of rows, time to process, and so on).
ODI repository
To store all its information, ODI requires a repository. The repository is by default a
pair of schemas (called Master and Work repositories) stored in a database. Unless
ODI is running in a near real time fashion, continuously generating SQL code for
the databases to execute the code, there is no need to dedicate a database for the
ODI repository. Most customers leverage existing database installations, even if
they create a dedicated tablespace for ODI.
Repository overview
The only element you will never find in the repository is the actual data processed
by ODI. The data will be in the source and target systems, and will be moved
directly from source to target. This is a key element of the ELT architecture. All other
elements that are handled through ODI are stored into the repository. An easy way
to remember this is that everything that is visible in the ODI Studio is stored in the
repository (except, of course, for the actual data), and everything that is saved in the
ODI Studio is actually saved into the repository (again, except for the actual data).
The repository is made of two entities which can be separated into two separate
database schemas, namely, the Master repository and the Work repository.
We will look at each one of these in more detail later, but for now you can consider
that the Master repository will host sensitive data whereas the Work repository will
host project-related data. A limited version of the Work repository can be used in
production environments, where the source code is not needed for execution.
Repository location
Before going into the details of the Master and Work repositories, let's first look into
where to install the repository.
The repository is usually installed in an existing database, often in a separate
tablespace. Even though ODI is an Oracle product, the repository does not have to
be stored in an Oracle database (but who would not use the best database in the
world?). Generally speaking, the databases supported for the ODI repository are
Oracle, Microsoft SQL Server, IBM/DB2 (LUW and iSeries), Hypersonic SQL, and
Sybase ASE. Specific versions and platforms for each database are published by
Oracle and are available at:
http://www.oracle.com/technetwork/middleware/ias/downloads/fusioncertification-
100350.html.
It is usual to see the repository share the same system as the target database.
We will now look into the specifics of Master and Work repositories.
Master repository
As stated earlier, the Master repository is where the sensitive data will be stored.
This information is of the following types:
• All the information that pertains to ODI users privileges will be saved
here. This information is controlled by administrators through the Security
Navigator of the ODI Studio. We will learn more about this navigator when
we look into the details of the Studio.
• All the information that pertains to connectivity to the different systems
(sources and targets), and in particular the requisite usernames and
passwords, will be stored here. This information will be managed by
administrators through the Topology Navigator.
• In addition, whenever a developer creates several versions of the same object,
the subsequent versions of the objects are stored in the Master repository.
Versioning is typically accessed from the Designer Navigator.
Work repository
Work repositories will store all the data that is required for the developers to design
their data transformations. All the information stored in the Work repository is
managed through the Designer Navigator and the Operator Navigator. The Work
repository contains the following components:
• The Metadata that represents the source and target tables, files, applications,
message buses. These will be organized in Models in the Designer Navigator.
• The transformation rules and data movement rules. These will be organized
in Interfaces in the Designer Navigator.
• The workflows designed to orchestrate the transformations and data
movement. These are organized in Packages and Load Plans in the
Designer Navigator.
• The jobs schedules, if the ODI Agent is used as the scheduler for the
integration tasks. These can be defined either in the Designer Navigator
or in the Operator Navigator.
• The logs generated by ODI, where the generated code can be reviewed,
along with execution statistics and statuses of the different executions
(running, done successfully or in error, queued, and so on). The logs
are accessed from the Operator Navigator.
Execution repository
In a production environment, most customers do not need to expose the source
code for the processes that are running. Modifications to the processes that run
in production will have to go through a testing cycle anyway, so why store the
source code where one would never access it? For that purpose, ODI proposes an
execution repository that only stores the operational metadata, namely, generated
code, execution results, and statistics. The type of Work repository (execution or
development) is selected at installation time. A Work repository cannot be converted
from development to execution or execution to development—a new installation will
be required if a conversion is needed.
Studio
The ODI Studio is the graphical interface provided to all users to interact with ODI.
People who need to use the Studio usually install the software on their own
machine and connect to a shared repository. The only exception would be when
the repository is not on the same LAN as the Studio. In that case, most customers
use Remote Terminal Service technologies to ensure that the Studio is local to the
repository (same LAN). Only the actual display is then sent over the WAN.
Agent
The ODI Agent is the component that will orchestrate all the operations. If SQL code
must be executed by a database (source or target), the agent will connect to that
database and will send the code (DDL and DML, as needed) for that database to
perform the transformations. If utilities must be used as part of the transformations
(or, more likely, as part of the data transfer) then the agent will generate whatever
configuration files or parameter files are required for the utility, and will invoke this
utility with the appropriate parameters—SQL Loader, BCP, Multiload, and NZload
are just a small list of such utilities.
There are two types of ODI Agent, namely, the standalone agent (available in all
releases of ODI) and the JEE agent (available with ODI 11g and after) that runs on
top of WebLogic Server. Each type has its own benefits, and both types of agents
can co-exist in the same environment:
• The JEE agent will take advantage of Weblogic in terms of high availability
and pooling of the connections
The standalone agents are very lightweight and can easily be installed on any
platform. They are small Java applications that do not require a server.
A common configuration is to use the JEE agent as a "Master" agent, whose sole
purpose it is to distribute execution requests across several child agents. These
children can very well be standalone agents. The master agent will know at all
times which children are up or down. The master agent will also balance the
load across all child agents.
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment