College Papers

SOLUTION OF EXTRACTION STAGE

SOLUTION OF EXTRACTION STAGE:
It would be necessary to develop an enhanced ETL framework to integrate heterogeneous data sources to manage the data integration when the requirement from a larger scale environment like Data ware House wants to be met. The enhance ETL solution includes two core components: ODS and two-step ETL subcomponents, in which the ODS manage to integrate various operational data sources (ODSS) into a simple relational database while the corresponding ETL tool takes two steps to dynamically integrate ODSS to detailed data warehouse(DCDW). Due to the heterogeneous data sources and the related rich data formats (e.g. Excel, Oracle, and SQL Server. Therefore, the proposed enhanced ETL system was designed based on the technology of Eclipse Rich Client Platform (RCP) Plug-ins to meet these requirements. The description of the ETL component is following which is describing the ODS implementation.

OPERATIONAL DATE STORE (ODS) DATA MODEL
ODS is a subject oriented, integrated, variable data set. ODS is used to store the heterogeneous data. Integration of data form the different source can done easily by the ODS data model.
In ODS, data warehouse architectures use data, its type, content and its detail as subject domain, and use as the subject model of main activity content. Then that data divided into tables as per their content detail. These tables include this information which are divided on some kind of uniqueness. In ODS mechanism all table are related to each other like a relationship system. Each record in each table have a unique identifier by which data can be extract easily without doing any extra operation. Due to the support of ODS data model, we could manage ETL functionality to directly extract the data records from original operational data sources to ODS database. Than the stored data in ODS database could be further transformed to DCDW by using specialized procedures. With the help of ODS database we can directly get the original data source.
To Avoid the Data Source overloading problem the warehouse need to do perfect mapping between the ODS and the user request to manage the operational activities properly.
The reading operation queue should be manage properly by using CDC change data capture processor to perform the transaction. By limiting the thread pool we cant overcome this problem. The pool will set the requests in different queue then it can manage easily. By limiting the HTTP sessions amoung the different queues on the basic of same performance we can do minimum load on data source. The processors don’t have to handle different type of request simtanouelsy due to managing the same type of operational requests in thread pool.
To avoiding the data source overloading stuck, the data warehouse need to manage the threads pool. The thread pool having many requests of the same type demanding the same operation on the same time. To avoiding the stuck in pool, the request should be terminated after a specific time and Exist On out of memory exception should be alert.
The out of memory exception have to rejects the requests to enter into the thread pool when the memory is full.
Solution of the Transformation Stage:
The main problem in transaction stage is Master Data overhead that means the data is divided into two types the name of the data in storage called master data and the transactional data. As the master data remains same than transactional data. therefore Data management is one of the most important, and complex operation facing by any data warehouse. Non-integrity and inaccuracy of manually assembling inconsistent, redundant, and outdated data, many organizations are seeking a new solution of data management to seamlessly convert hundreds of data sources into powerful data assets that can be shared across the data warehouse.
Master Data Management (MDM) would be the best solution to this problem due to which the aggregation of the data can also be under control. Master views are created by integrating data from a variety of internal data sources such as enterprise resource planning (ERP), MDM virtually set the dta sources and done the operation on it. MDM is a mechanisim which divide the data into three core things:
Uniquely Identify:
The master data are usually the customer or the client of the organizations, which remain constant during all the transaction they made with the organization. So as the master data means the client remain constant then the MDM make it unique identifier. A unique identifier number will be assign to each of the client so that he can easily recognize. A table will set with the uniquely identifier. So that these record cant changed. This uniquely identifier operates against the related transaction with out need any change in them.
Attributes:
Attribute means the qualities on which basis an entity have to be uniquely identify. The MDM will manage the entity with the help of it attributes. It can be a type of transaction, or any type of request made by the customer to the data warehouse. The MDM will create and manage a table of attributes against each identifiy entity. Therefore this table will manage the transacionof the same type to reduce the aggregation of the data soucres. The attribute table join with the identiy entity and than complete its transaction in a specific time of refreshment cycle.
Transactions:
Transaction are the operation that an entity done in the organization. The MDM also divide the transaction into different type of table on the basis of their nature. The transaction can be many types of nature. It can like a financial transaction or a sale representative type of transaction.
The transaction division reduces the master data overheaded. Because due to transaction table data aggregation can control. The the transaction of the uniquely identify entity the data from the different data source can be set into the same type of data table. due to the MDM the data source managed so easily without any inconsistency. The data come from the data sources can be place fit in it require table.
The aggregation of the data can be managed so easily beause of MDM tables dividison. So that the data requests for from the clients can easily matched with its demand. The data can easily update in the transaction table’s record. The data source can be different for the many operations but due to MDM uniquely identifying entity on the basis of some attributes can be placed into the group of transaction on the nature of its own.
Problem in Loading Stage:
The issue of query contention and scalability is the most difficult issue facing organizations deploying real-time data warehouse solutions. Data warehouses were separated from transactional systems in the first place because the type of complex analytical queries run against warehouses don’t “play well” with lots of simultaneous inserts, updates, or deletes.
Usually the scalability of data warehouse and OLAP solutions is a direct function of the amount of data being queried and the number of users simultaneously running queries. Given a fixed amount of data, the number of users on the system is proportional to query response time. Lots of concurrent usages causes reports to take longer to execute.
While this is still true in a real-time system, the additional burden of continuously loading and updating data further strains system resources. Unfortunately the additional burden of a continuous data load is not just equivalent to one or two additional simultaneously querying users due to the contention between data inserts and typical OLAP select statements. While it depends on the database, the contention between complex selects and continuous inserts tends to severely limit scalability. Surprisingly quickly the continuous data loading process may become blocked, or what used to be fast queries may begin to take intolerably long to return.
There are ways to get around this problem, including the near-real-time approaches described in previous sections, but where true real-time is a hard and fast requirement, the approaches described below help to address this problem in various ways.
Solution 4a: Simplify and Limit Real-time Reporting
Many real-time warehousing applications are relatively simple. Users that want to see up-to-the-second data may have relatively simple reporting requirements. If reports based on real-time data can be limited to simple and quick single-pass queries, many relational database systems will be able to handle the contention that is introduced. Frequently the most complex queries in a data warehouse will be accessing data across a large amount of time. If these queries can be based only on the non-changing historical data, contention with the real-time load is eliminated.
Another important consideration is to examine who really needs to be able to access the real-time information. While real-time data may be interesting to a large group of users within an organization, the needs of many users may be adequately met with non-real-time data, or with near-real-time solutions.
Also many users who may be interested in real-time data may be better served by an alert notification application that sends them an email or wireless message alerting them to real-time data conditions that meet their pre-defined thresholds. Designed properly, these types of systems can be scaled to 100 or 1000 times more users than could possibly run their own concurrent real-time warehouse queries (see Section 5 for more details).
Solution 4b: Apply More Database Horsepower
There is always the option of adding more hardware to deal with scalability problems. More nodes can be added to a high-end SMP database system, or a stand-alone warehouse box can be upgraded with faster processors and more memory. While this approach may overcome short-term scalability problems, it is likely to only represent a band-aid approach. Real-time query contention often has more to do with the fundamental design of a RDBMS than with the system resources available.