Most of the DBAs are involved in daily routine jobs such as checking backups, security issues, performance tunings and many other tasks that are defiled to be line of DBA responsibilities.
On the other hand Data Warehousing is a challenge for database developers and programmers. Import, export and processing data is part of each DW task. I found myself very interested and amused when I’m involved in such a task. Most of the small companies don’t have a dedicated DW developer, so the requests are directed to DBA, and the expectation is that we (as DBAs and Data guys) should be able to complete any requests in a short amount of time.
In a world that each company is trying to expand and inquires another companies and competitors, it’s obvious that synchronizing the technologies of those companies are challenges for all IT personal, and when it comes to Data and information synchronization, then DBA’s and DW developers have important role to be able to provide solid data synchronization and reporting solutions.
I was involved in many projects that importing and processing the data from other databases and even from other platforms were essential for business owners. We had multi platform production database servers and DW farms. It’s obvious that reports are of the most imports piece for management. We had to create a solid DW and reporting solution which be able to provide real-time and financial reports on both sides. We start by building a new DW farm, utilizing MSSQL 2005 SSAS, SSRS and SSIS. Loading data from MySQL and Oracle databases, in a daily or hourly base, becomes a challenge for me and after many attempts and different approaches; I was able to build and maintain a solid DW farm.
Most of the schemas I was importing had INSERTDATETIME (Timestamp) field. I took 2 different approaches and used each in different scenarios. I will try to describe each method below and list the pros and cons for each one.
Importing the records within a certain period of time
Importing records based on the previous import’s timestamp
1- Importing the records within a certain period of time
In this method, I built a table where I keep the history of the pull (or push) . StartTime / EndTime is basically the range that records are affected. The insert query will look like this:
Insert into “TargetTable”
(Field1, Field2,Field3)
Select Field1, Field2,Field3 from “SourceTable” where InsertDateTime >= @StartTime and InsertDateTime < @EndTime As you can see only records in the range are selected. StartTime is always calculated based on the previous successful history, plus the desired Interval. Select @interval = interval from PackageInformation Select top 1 @StartTime = EndTime , @EndTime = EndTime + @interval from PackageExecutionHostory order by InsertDateTime desc After defining the processing date range, package will execute and upon successful completion the new date range will be recorded. This will become the start point for next time. 2-Importing records based on the previous import’s timestamp In the method, I record only the max of timestamp, which is processed. Select top 1 @StartTime = LastMaxDate from PackageExecutionHistory order by InsertDateTime desc The insert query will look like this: Insert into “TargetTable” (Field1, Field2,Field3) Select Field1, Field2,Field3 from “SourceTable” where InsertDateTime >= @StartTime
For recording successful transactions history, will need to run another query as :
Select @LastMaxDate = max(TimeStamp) from TargetTable
Inert into PackageExecutionHistory (LastMaxDate, InsertDateTime) values (@lastMaxDate,getdate())
I need to mention that each of these processes are depending on Job Schedule and will be executed from a Job.
Pros & Cons of each method.
Importing records based on the previous import’s timestamp
No matter when you run the job, all the records, since last execution, will be pulled (or inserted)
No control for pulling records between a certain periods of time. If there are high volumes of transactions, this might be a slow process. Inserting and selecting large number of records is always a bottleneck.
If any record in target table is getting inserted, while the select statement is trying to prepare the data set, then there is potential deadlock situation.
It’s possible to not capture some of the records if using select statement with (nolock) option to prevent performance impact.
Importing the records within a certain period of time
Certain data range to be selected
The data size is mostly predictable and by tuning interval period, it’s possible to minimize the impact of loading large data set
If the SQL job is not executed, the next job will pull only the records which are within the range of selection, which were supposed to be pulled by earlier job. This is a potential problem if the SQL jobs are not executed for a while, and will cause delay on data delivery.
Need more controlling and trapping procedures to prevent false execution on not valid date range.
No comments:
Post a Comment