SSIS – its the metadata that’ll kill ya

Introduction

There is something to be said for older computer languages.  They didn’t try to protect you from yourself and do a lot of thinking for you. Because they’re not very smart, they can be far more fault tolerant. Take for example, Perl. Anybody remember what that was originally said to stand for? That’s right, the “Practical Extraction and Reporting Language.” The original ETL, although, Larry Wall, doesn’t even think of it that way.  (He’d be the guy who invented Perl if you’re wondering).

Just with any language, there are those who will defend it to the last.  Hey, there’s still COBOL guys out there somewhere.  But, Perl is slipping off its perch a bit. It’s not used for most new web applications anymore, Powershell is replacing it as a batch language on Windows platforms, and most new programmers don’t know how to spell it let alone want to use it. Over the years, I’ve found it to be very useful to quickly grab data from one place, twist it a bit and put it in another place. Yep, that’d be ETL alright. But, when I stack it up against another newcomer — if it can still actually be considered new — SSIS, the performance from Perl isn’t on the same level. Unless of course, I know nothing about the data.

Huh? What a minute. Isn’t SSIS supposed to be better at that? It figures out the metadata for you and then does the rest, doesn’t it? From a development standpoint that might be true. However, data needs change over time. Data definitions change over time. And, what once was easy to develop with SSIS is now breaking in a production environment where the Perl version that does the same thing and simply chugs along.

See, it’s that need to know everything about the metadata that shoots SSIS (and it’s developers) in the foot.

Contrived Example Time

Let’s say, you (App Guy) build this wonderful SSIS package to quickly grab data and move it somewhere else. Further, you built it to replace this old, antiquated Perl job that the IT dinosaur (we’ll call him T-Rex) in the corner cube keeps saying is just fine the way it is.  Only problem is that T-Rex is the only guy in the department capable of reading all those regular expressions and no one else wants to touch it. The solution is to use this shiny WYSIWYG ETL tool that Microsoft is kind enough to ship for free with SQL Server — SQL Server Integration Services. Looks simple enough that even someone fresh out of school could use it. The stored proc ol’ T-Rex uses to get the data out of the database for his Perl job works just fine, so App Guy decides to re-use that. The new package is constructed and tested in days. It gets deployed to production and is running in parallel with the old Perl job just to make sure the execution is the same.

App Guy [smiles to himself]:”T-Rex couldn’t have done it that fast.”

Now the fun begins. Someone (Data Designer Guy) decides that the AdventureWorks.HumanResources.Employee table should be accessed only through a view for security reasons. Not such a bad idea. So, the proc is modified to select from our new view: AdventureWorks.HumanResources.vEmployee2 (Why ‘2’? ‘Cuz we want to change it for our contrived example without messing up the existing vEmployee).

create view HumanResources.vEmployee2
as
SELECT [EmployeeID]      
,[NationalIDNumber]      
,[ContactID]      
,[LoginID]      
,[ManagerID]      
,[Title]      
,[BirthDate]      
,[MaritalStatus]      
,[Gender]      
,[HireDate]      
,[SalariedFlag]      
,[VacationHours]      
,[SickLeaveHours]      
,[CurrentFlag]      
,[rowguid]      
,[ModifiedDate]  
FROM [AdventureWorks].[HumanResources].[Employee]
GO

Which gives us the following view when we look at it in the Object Explorer of SSMS:



Great! The procedure now simply selects from the view and it gets all the same data it did from the table with all the same data types.  Perl still works and so does the SSIS package.

App Guy: “One step closer to making your Perl job extinct, T-Rex.”

Now, Data Designer Guy comes back and notices the data in the Title column has underscores instead of spaces in it.

Data Designer Guy: “That’s so 1990’s. It’s been a new millenium for 11 years now, it’s about time we put spaces in there.”

So, Data Designer Guy leaps into action and quickly updates the data in the table using REPLACE(). (Hey, I said it was contrived and there are still way too many people with access to production systems out there.)

App Guy: “Wait! SSIS might be able to handle a change to the data, but I can’t refactor the app that uses the data fast enough to handle this change and it will blow up when it tries to parse a string with a space!”

Data Designer Guy: “No worries. This is another reason why we created the view. We’ll just modify the view to put the underscores back in there for this app, we’ll create a new view for the new apps and no one will have to change any code.”

App guy: “Really? That’s awesome! I’ve got so many other projects to worry about. Data Designer Guy, you rock!”

T-Rex [whispers to himself]: “Careful what you wish for.”

And Data Designer Guy executes the following code:

alter view HumanResources.vEmployee2
as
SELECT [EmployeeID]      
,[NationalIDNumber]      
,[ContactID]      
,[LoginID]      
,[ManagerID]      
,REPLACE([Title], ' ', '_') as 'Title'      
,[BirthDate]      
,[MaritalStatus]      
,[Gender]      
,[HireDate]      
,[SalariedFlag]      
,[VacationHours]      
,[SickLeaveHours]      
,[CurrentFlag]      
,[rowguid]      
,[ModifiedDate]  
FROM [AdventureWorks].[HumanResources].[Employee]
GO

Beautiful. To test it out, Data Designer Guy executes the proc inside SSMS and viola, the same data is returned. The Perl job runs and reads the data with no issues whatsoever.  The SSIS package, however, blows up and returns the following error:

Warning: 0x800470C8 at DTSTask_DTSDataPumpTask_1 1, OLE DB Destination [1296]: The external metadata column collection is out of synchronization with the data source columns. The column “Title” needs to be updated in the external metadata column collection.

Data Designer Guy: “Huh? I’m returning the exact same column I was before, I just have underscores rather than spaces. How could it be out of sync? The Perl code runs fine. What the hell is wrong with SSIS?”

App Guy: “Data Designer Guy, you suck!”

T-Rex: “BWA-HA-HA-HA!!!”

And, here’s the problem

SSIS runs through it’s validation phase, making sure that everything it thinks it knows about the metadata is still valid.  And, despite the Perl code working just fine the metadata has indeed changed thanks to the REPLACE() function call. Object Explorer is now displaying:

While Title is still a nvarchar data type, it’s length is quite different. Unless you cast the results of the REPLACE() function call, you’re going to get the maximum sized nvarchar available because that’s what REPLACE() returns, or nvarchar(4000). The function doesn’t know that you’re simply changing instances of one character for another character. You could be replacing one character for the the Preamble to the Constitution. So, the definition of the length of the datatype in the view will change and SSIS really doesn’t like change. Perl on the other hand just treats it as a string like it always has and moves merrily on its way.  And while Data Designer Guy sat there swearing at SSIS and App Guy was swearing at Data Designer Guy, T-Rex sat in the corner laughing at you all the way. If you’re the actor playing all 3 roles, it’s time for your psych eval.

The moral of the story

So, the moral is we should never use SSIS and stick with Perl forever, right? Despite T-Rex’s agreement, that’s not the moral at all. The moral is this: Be very attentive to the data types being returned, not just the data. The metadata is SSIS’s interface contract. When you change the datatypes (even the length of them), SSIS thinks you’ve violated the contract and stops running until you update the contract. When you execute scalar functions against returned columns, it is a wise idea to make sure you check more than just the data the query is returning as the datatype itself may have changed. The function may not return the same datatype you thought it would.

4 thoughts on “SSIS – its the metadata that’ll kill ya

  1. I have a gdg file extension as ‘G0001V00’. How can i read this file through ETL SSIS and create a output file?

Leave a comment