29
Aug 10

The poor man’s ETL (Python)

So many times I’ve been faced with a project to perform ETL (Extract, Transform, Load) between two or more databases. And everytime it grates at my soul because there’s just no good way to handle it. On the one hand, you can use some third party tool that probably comes with a hefty license fee, and comes loaded with wizards and its own custom syntax for handling external logic (Because ETL would be too easy if it was just a one-to-one transfer).

OR, you can go with the other, more popular method: Scripting the entire painstaking process by hand. You’ve probably had to deal with this, too. You need to get Table A from Database 1 into Tables B and C in Database 2. But wait, Table C takes a non-nullable value for one of its columns, but Table A allows null values. So for this one case you have to write additional logic.

As it turns out, you end up having to write additional logic for most parts of the system, and you have to maintain it. Time passes, schemas change, and so will the monkey-job of spaghetti code that handles the process.

Faced with this task yet again, I decided to try something different.

Now, I want to make it clear that I am not in any way endorsing the ORM philosophy. I will neither extol its virtues nor berate its flaws in this article. I will also not claim that this code is pretty, or should ever be used by another human being. Maybe a chimp. It’s merely a prototype that made a dull job slightly more interesting. I’m posting it here because I happen to think it was a neat way to solve the problem and there’s many places I can re-use this code with minimal changes and effort.

How it’s laid out:

def setDefaults(result):
    for row in result:
        row['campaign_name'] = 'Default Campaign'
        row['sec_mode'] = 3 if row['url'].find('https://', 0, 8 ) else 2
    return result

etl(
   self.oracle
).table(
    name='wc_active_listing', alias='wca'
).join(
    name='campaign', alias='c',
    predicate='wca.campaign_id=c.campaign_id'
).fetchAll(
   'status = 100'
).callback(setDefaults).into.database(
    self.mssql
).table(
    name='wc_active_listing', alias='bsa', primary_key=['listing_id','category_id'],
    mapping = {
        'listing_id': 'listing_id',
        'category_id': 'category_id',
        'bid': 'bid',
        [....]
    }
).commit()

It’s inspired by LINQ, but is much, much simpler. Currently it supports multiple callbacks, inner joins, and outputs to another database connection.
It’s actually more complex than it needs to be, but I encountered many issues with the pymssql package, so a few work-arounds had to be made. You’ll laugh or cry when you see them.
I’ll also note that due to our server being SQL Server 2005, we didn’t have access to MERGE either.

The full code, in all it’s hideous glory:

class etl(object):

    _instance = None

    def __init__(self, connection, result = None):
        etl._instance = connection
        self._instance = connection
        self._result = result

    def table(self, name, alias = None, result = None, mapping = None, columns = None, primary_key = None):
        if self._result is not None and mapping is not None:
            return insert(
                conn=self._instance,
                name=name,
                alias=alias,
                result=self._result,
                mapping=mapping,
                primary_key=primary_key)

        return select(
            conn=self._instance,
            name=name,
            alias=alias,
            mapping=mapping,
            columns=columns)

    def query(self, sql):
        return query(
            conn=self._instance,
            sql=sql)

    @staticmethod
    def fetchAll(sql, params = None):
        """Returns a list of dicts with columns-as-keys from a query."""
        cursor = etl._instance.cursor()
        cursor.execute(sql, params if params is not None else {})

        rows = cursor.fetchall()
        desc = cursor.description
        result = list()
        for data in rows:
            if data == None:
                return None

            #Get the field descriptors for this row
            dic = {}

            #zip returns a list of tuple where each nTh item
            #contains the nTh item of each argument.
            for (name, value) in zip(desc, data):
                dic[name[0].lower()] = value
            result.append(dic)

        print "Found %d rows" % len(result)
        print "---------------------------------------"

        return result

class select(object):

    def __init__(self, conn, name, alias, mapping, columns = None):
        column_list = '*'

        if type(columns) == type([]):
            column_list = ', '.join(columns)

        alias_sql = ' ' + alias + ' ' if alias is not None else ' '
        self.sql = "SELECT " + column_list + " FROM " + name + alias_sql
        self.database = conn
        self.name = name
        self.alias = alias

    def join(self, name, predicate, alias = None):

        alias_sql = ' ' + alias + ' ' if alias is not None else ' '
        self.sql += 'JOIN ' + name + alias_sql + 'ON (' + predicate + ') '

        return self;

    def fetchAll(self, predicate = None, params = None):
        predicate_sql = ' WHERE ' + predicate if predicate is not None else ' '
        self.sql += predicate_sql
        print "----------------------------------------"
        print "SQL: " + self.sql
        print "----------------------------------------"

        return result(etl.fetchAll(self.sql, params))

class query(object):

    def __init__(self, conn, sql):
        self.sql = sql
        self.database = conn

    def fetchAll(self, params = None):
        print self.sql
        return result(etl.fetchAll(self.sql, params))

class insert(object):
    def __init__(self, conn, name, alias, result, mapping, primary_key):
        """Blame the crappy freetds support for sql server on this mess."""
        self.database = conn
        self.cursor = self.database.cursor()

        primary_key_sql = ''
        if type(primary_key) == type(''):
            primary_key = [primary_key]

        for key in primary_key:
            primary_key_sql += key + ' = %s AND '

        primary_key_sql += '1=1' #I'm lazy

        update_sql = "UPDATE " + name + " SET %s WHERE "
        insert_sql = "INSERT INTO " + name + " (%s) VALUES (%s)"
        select_sql = "SELECT 1 FROM " + name + " WHERE "

        i = 0
        for row in result:
            i = i+1
            nMap = dict()
            for (key, val) in row.items():
                if key in mapping:
                    nMap[ mapping[key] ] = str(val)

            #Because pymssql sucks and rowcount doesn't work, we have to do a select
            self.cursor.execute(select_sql + primary_key_sql, tuple([nMap[key] for key in primary_key]))
            row_exists = True if self.cursor.fetchone() is not None else False

            print select_sql + primary_key_sql % tuple(nMap[key] for key in primary_key)

            if row_exists:
                #Do an update first, then check rows affected
                col_map_list = [x + '=%s' for x in nMap.keys()]
                col_map = ', '.join(col_map_list)

                print "-----------------------------------------"
                print "Row #%d" % i
                print "SQL: " + ((update_sql % col_map) + primary_key_sql) % (tuple(nMap.values() + [nMap[key] for key in primary_key]))

                self.cursor.execute(
                    ((update_sql % col_map) + primary_key_sql),
                    tuple(nMap.values() + [nMap[key] for key in primary_key])
                )
                print "Affected: " + str(self.cursor._source.rows_affected)

            else:
                #Do the insert if nothing was updated
                col_map = ("%s, " * len(nMap)).rstrip(', ')
                print "SQL: " + (insert_sql % (", ".join(nMap.keys()), col_map)) % tuple(nMap.values())

                try:
                    self.cursor.execute(
                        insert_sql % (", ".join(nMap.keys()), col_map),
                        tuple(nMap.values())
                    )
                except Exception as e:
                    print e

                print "-----------------------------------------"

    def commit(self):
        return self.database.commit()

    def rollback(self):
        return self.database.rollback()

class result(object):
    def __init__(self, res):
        self._result = res
        self.into = into(res)

    def callback(self, callback):
        self.into = into(callback.__call__(self._result))
        return self

class into(object):
    def __init__(self, result):
        self._result = result

    def database(self, connection):
        return etl(
            connection=connection,
            result=self._result)

29
Aug 10

Another CSS Tab tutorial

I’ve noticed that many times when a page’s content calls for tabs, the most often used approach is something like the following:

<div class="tabs">
	<a href="?tab=tab1" class="tab activeTab">Tab 1</a>
	<a href="?tab=tab2" class="tab">Tab 2</a>
	<a href="?tab=tab3" class="tab">Tab 3</a>
	<a href="?tab=tab4" class="tab">Tab 4</a>
</div>

There’s nothing inherently wrong with this approach, but it tends to involve more logic than is necessary to solve the problem. With this system, you have to write logic that tracks each individual tab’s state from request to request. Something that looks like this:

<?php
	$activeTab = 'tab1';
	if(isset($_GET['tab']))
	{
		$activeTab = $_GET['tab'];
	}
	$tabs = array(
		'tab1' => 'Tab 1',
		'tab2' => 'Tab 2',
		'tab3' => 'Tab 3',
		'tab4' => 'Tab 4'
	);

	echo '<div class="tabs">';
	foreach($tabs as $tab=>$label)
	{
		if($activeTab == $tab)
		{
			echo '<a href="?tab='.$tab.'" class="tab activeTab">'.$label."</a>\n";
		}
		else
		{
			echo '<a href="?tab='.$tab.'" class="tab">'.$label."</a>\n";
		}
	}
	echo "</div>\n";
?>

It doesn’t have to be like this. A trick I picked up (from a very talented coworker) allows for a more elegant solution.

Consider the following HTML:

<div class="tabs tab1">
	<a href="?tab=tab1" class="tab tab1">Tab 1</a>
	<a href="?tab=tab2" class="tab tab2">Tab 2</a>
	<a href="?tab=tab3" class="tab tab3">Tab 3</a>
	<a href="?tab=tab4" class="tab tab4">Tab 4</a>
</div>

Now, through the magic of CSS we can make something uh…”magic”… happen. Here’s the CSS:

.tabs { overflow: hidden; }
.tabs .tab {
	float: left;
	margin-left: 1px;
	padding: 5px;
	background: #EEE;
}

div.tab1 a.tab1,
div.tab2 a.tab2,
div.tab3 a.tab3,
div.tab4 a.tab4
{
	background: #D4ED9F;
}

We can control the graphical state of the active tab simply having CSS determine a tab’s style according to its parent class.

<div class="tabs <?php echo $activeTab; ?>">
The active tab can be changed by modifying one class name

The active tab can be changed by modifying one class name

There’s likely a name for this technique that I’m ignorant to, so feel free to point it out to me. I in no way claim to have invented this method, but I do see many developers that don’t know CSS enough to leverage it like this. Hopefully this will be helpful to some of you out there.


01
Jul 09

Why does SOAP leave me feeling so dirty?

I’ve had my head buried in SOAP for the last few months. In particular, ext/soap with PHP. There’s a great opportunity out there for PHP find itself as an incredibly easy and robust way to implement a web service environment through SOAP. Unfortunately, the PHP5 soap extension has let this opportunity slip by again and again.

Sure, it’s fairly easy to implement a simple rpc/encoded SOAP framework, especially if you leverage the Zend_Soap (and Autodiscover) classes. If you plan on only consuming the web service with PHP (or even perl and python) then it’s an excellent choice. But the whole point of a webservice is to offer interoperability. This is something where SOAP itself falls short.

The SOAP “standard” isn’t a standard at all. Rather, it’s a generic, fuzzy guideline for implementing one of a dozen other standards. You’ve got your rpc/encoded, your rpc/literal, document/literal, document/encoded and document/literal wrapped conventions. Each offers various caveats and perks.

Here’s where ext/soap comes in. While the rest of the world has been trying to come up with a SOAP implementation that everyone can agree upon, PHP’s soap extension has been fairly stagnant. The package maintainer, Dmitry, does some casual bug fixes but overall there are several gaping holes that leave a lot to be desired.

As it turns out, most of the world HAS agreed on a SOAP implementation. It’s called “document/literal wrapped”. The wrapped portion dictates that in the WSDL, a binding operation’s parameters must refer to a complex type that defines those parameters, thus avoiding a nasty issue with soap-enc:Array. I won’t attempt to explain the breadth of the issue here, but a good article can be found at http://atmanes.blogspot.com/2005/03/wrapped-documentliteral-convention.html

For the last few months I’ve been in a back-and-forth match between PHP, Java and .NET. The struggle has been in getting a WSDL implementation that can play nice with PHP, Perl, Python, as well as the more robust languages like Java and .NET. It’s not just a casual “wish”, it’s part of my project requirements.

The problems have been numerous. Firstly, Java doesn’t support RPC/Encoded anymore. It doesn’t even support vanilla document/literal, in fact; it requires the wrapped convention to even be able to compile. .NET had similar issues. Unfortunately, my WSDLs were being auto generated by Zend_Soap_Autodiscover, which, as of writing this, is incapable of producing WSI compliant WSDLs.

And so began the hacking. I’ve rearranged, rewritten and added a very large chunk of code to the Zend Framework/Soap code base. I was finally able to get a compliant WSDL out of PHP. It compiled in Java and .NET just fine. It worked in python, perl, even ruby. And then I checked PHP.

So far PHP had been pretty agnostic when it came to my WSDL format. It didn’t seem to care whether or not it was in rpc or document, or even if my namespaces mismatched (Something which python and perl DID care about). However, after implementing my “wrapped” changes, something funny happened. PHP was now throwing a SoapException from the client, telling me that the procedure I was calling didn’t exist. A quick var_dump() of SoapClient::__getFunctions() showed that they were indeed there, but PHP couldn’t interact with them for some reason.

After some quick googling, I was able to turn up a few PHP bug reports describing exactly my issue. Most of these bugs were over two years old, with virtually no feedback from the maintainer. An example is http://bugs.php.net/bug.php?id=30302

I contacted Andi and Dmitry personally to ask about the status of the issue. The response I got wasn’t very optimistic:

Hi Matt,

It’s not supported, because it wasn’t a part of SOAP specification (I’m not sure about present time).

I didn’t plan to implement it. May be someone else…

Thanks. Dmitry.

So when I have some time, basically my only option is to implement it myself, and submit the patch to internals.

The problems with ext/soap are far more numerous than just that, however. In the past few months I’ve encountered issues that lead me to believe the people writing these implementations know even less about SOAP than I do. Here are the main two issues I encountered:

  • ext/soap has no support for retrieving Soap Headers. You’re stuck with one of two hacks. You can either pass an object through from the client that maps to a method call on the server — which is less than helpful when using headers for authentication — or parse the XML from HTTP_RAW_POST_DATA manually in your SOAP request. I opted for the latter.
  • ext/soap’s fault support is buggy and inconsistent. Defining custom Soap Faults in an intelligent way means hacking the response envelope.

In conclusion, PHP’s soap extension really needs a complete rewrite. I’m probably not the person to do it, but it’s an idea I’ve been toying with. It’s just not ready to be used in a production environment.


01
Jul 09

Compiling the OCI8 and PDO_OCI PHP extensions on OS X and Linux

Yet another case of too much misinformation polluting the illustrious intarwebs, this article will, very briefly, discuss how to compile PHP with oracle support.

Firstly, if you haven’t already, follow my guide for installing the oracle instantclient (found here). Once that’s done, you’re ready to begin.

Step 1: Download PHP

You’ll need to get the PHP source from http://www.php.net/download. Extract the source into a location of your choice. cd into the ext/ directory in the source tree.

1
cd /tmp/src/php-5.3.0/ext

Step 2: Compile oci8

Now cd into the ext/oci8 directory. Complete the following steps:

1
2
3
phpize
./configure --with-oci8=instantclient,$ORACLE_HOME,10.2
make && make install

Be sure that you have the three environment variables discussed in the previous article set. Those were $ORACLE_HOME, $TNS_ADMIN, and $LD_LIBRARY PATH (or $DYLD_LIBRARY_PATH for OS X)

If you installed a different version of the instantclient, replace the ’10.2′ in the configure line with your version. The correct version can be quickly obtained by glancing at the files in $ORACLE_HOME (for instance, the file libocci.so.10.2 would indicate the 10.2 release)

Step 3: Installing PDO_OCI

The steps for installing PDO_OCI are basically the same, with one minor revision. cd into the ext/pdo_oci directory, and complete the following steps:

1
2
3
phpize
./configure --with-pdo_oci=instantclient,$ORACLE_HOME,10.2
make && make install

Again, make sure that your environment variables are set. This is imperative; without those variables set PHP doesn’t know where to look for the appropriate libraries.

Step 4: Edit php.ini

The final step is to edit php.ini, and add the following lines

1
2
extension=oci8.so
extension=pdo_oci.so

Once that’s done, you can test your installation by running the following:

1
php -r '$db = new PDO("oci:dbname=dbname", "user", "password");'

If you get an error, check to make sure you’ve followed each step, and that $ORACLE_HOME , $TNS_ADMIN and $LD_LIBRARY_PATH or $DYLD_LIBRARY_PATH are set.

Restart your web server and enjoy.


01
Jul 09

Installing the oracle client on OS X and Linux

There are a lot of existing tutorials for installing the oracle instant client out there, but many of them require you to go through several unnecessary steps, or are simply wrong. This leads to the misconception that installing the instant client is hard, lengthy, and generally a huge pain in the ass.

In reality, it takes about two minutes and is incredibly simple. I’ll concede that, like any piece of software, some of you will run into an edge case where my method doesn’t work. I had a fair amount of trouble getting it to work on SuSE64, but in the end it turned out to be a conflict with an existing RPM (Which no one else should really run into, since it was a proprietary RPM)

In this article I’ll be covering how to install the oracle instant client on linux (32 or 64 bit) and OS X.

So lets begin.

Step 1: The O-Files

The first thing you’ll need to do is get the zip files for the client. These can be found at http://www.oracle.com/technology/software/tech/oci/instantclient/index.html

Select the appropriate package for your target operating system. You’ll then have a choice of files to download (after agreeing to Oracle’s License Agreement).

There’s four or five packages available on this page. For our purposes, we’ll be getting the Instant Client Package – Basic and the Instant Client Package – SDK files. You may also choose to download the SQL*Plus package.

Note that you’ll need to sign up for an account in order to download these.

When you have these files downloaded, choose a location for the instantclient to live. In my case, I use /opt/local/oracle/instantclient

Copy the files you downloaded to /opt/local/oracle and unzip. This will create the directory /opt/local/oracle/instantclient_x_x (where x_x is the version number). You can keep the directory as-is, or rename to instantclient. In my example, I’ve renamed the directory.

The entire process should look like this:

matt@localmachine:/opt/local/oracle$ unzip instantclient-basic-10.2.0.4.0-macosx-x86.zip
Archive:  instantclient-basic-10.2.0.4.0-macosx-x86.zip
  inflating: instantclient_10_2/BASIC_README
  inflating: instantclient_10_2/classes12.jar
  inflating: instantclient_10_2/genezi
  inflating: instantclient_10_2/libclntsh.dylib.10.1
  inflating: instantclient_10_2/libnnz10.dylib
  inflating: instantclient_10_2/libocci.dylib.10.1
  inflating: instantclient_10_2/libociei.dylib
  inflating: instantclient_10_2/ojdbc14.jar
  inflating: instantclient_10_2/libocijdbc10.dylib
  inflating: instantclient_10_2/libocijdbc10.jnilib
matt@localmachine:/opt/local/oracle$ unzip instantclient-sdk-10.2.0.4.0-macosx-x86.zip
Archive:  instantclient-sdk-10.2.0.4.0-macosx-x86.zip
   creating: instantclient_10_2/sdk/
   creating: instantclient_10_2/sdk/demo/
  inflating: instantclient_10_2/sdk/demo/cdemo81.c
  inflating: instantclient_10_2/sdk/demo/demo.mk
  inflating: instantclient_10_2/sdk/demo/occidemo.sql
  inflating: instantclient_10_2/sdk/demo/occidemod.sql
  inflating: instantclient_10_2/sdk/demo/occidml.cpp
  inflating: instantclient_10_2/sdk/demo/occiobj.cpp
  inflating: instantclient_10_2/sdk/demo/occiobj.typ
   creating: instantclient_10_2/sdk/include/
  inflating: instantclient_10_2/sdk/include/nzerror.h
  inflating: instantclient_10_2/sdk/include/nzt.h
  inflating: instantclient_10_2/sdk/include/occi.h
  inflating: instantclient_10_2/sdk/include/occiAQ.h
  inflating: instantclient_10_2/sdk/include/occiCommon.h
  inflating: instantclient_10_2/sdk/include/occiControl.h
  inflating: instantclient_10_2/sdk/include/occiData.h
  inflating: instantclient_10_2/sdk/include/occiObjects.h
  inflating: instantclient_10_2/sdk/include/oci.h
  inflating: instantclient_10_2/sdk/include/oci1.h
  inflating: instantclient_10_2/sdk/include/oci8dp.h
  inflating: instantclient_10_2/sdk/include/ociap.h
  inflating: instantclient_10_2/sdk/include/ociapr.h
  inflating: instantclient_10_2/sdk/include/ocidef.h
  inflating: instantclient_10_2/sdk/include/ocidem.h
  inflating: instantclient_10_2/sdk/include/ocidfn.h
  inflating: instantclient_10_2/sdk/include/ociextp.h
  inflating: instantclient_10_2/sdk/include/ocikpr.h
  inflating: instantclient_10_2/sdk/include/ocixmldb.h
  inflating: instantclient_10_2/sdk/include/odci.h
  inflating: instantclient_10_2/sdk/include/oratypes.h
  inflating: instantclient_10_2/sdk/include/ori.h
  inflating: instantclient_10_2/sdk/include/orid.h
  inflating: instantclient_10_2/sdk/include/orl.h
  inflating: instantclient_10_2/sdk/include/oro.h
  inflating: instantclient_10_2/sdk/include/ort.h
  inflating: instantclient_10_2/sdk/include/xa.h
  inflating: instantclient_10_2/sdk/ott
 extracting: instantclient_10_2/sdk/ottclasses.zip
  inflating: instantclient_10_2/sdk/SDK_README 

matt@localmachine:/opt/local/oracle$ mv instantclient_10_2 instantclient

Step 2: The missing link

Now that the files have been unzipped, there’s a key step here that most people over look. By default, the files included in the package are named with their version number prepended to the library file. This means that any piece of software looking for it won’t find it, because the filename has something like .10.1 after it.

The files you’ll need to symlink differ on Linux and OS X. In particular, on linux, the files are .so, whereas on OS X, the files are .dylib

The two files that need to be symlinked are libocci, and libclntsh.

On OS X:

matt@localmachine:/opt/local/oracle/instantclient$ ln -s libocci.dylib.10.1 libocci.dylib
matt@localmachine:/opt/local/oracle/instantclient$ ln -s libclntsh.dylib.10.1 libclntsh.dylib

On Linux:

matt@localmachine:/opt/local/oracle/instantclient$ ln -s libocci.so.10.1 libocci.so
matt@localmachine:/opt/local/oracle/instantclient$ ln -s libclntsh.so.10.1 libclntsh.so

This is one of the most important parts of the installation, so it’s important not to forget it.

Step 3: There’s no place like $ORACLE_HOME

Now that you have your files in the correct location, you need to set a few environment variables. The first two are the same across all platforms, but the third differs depending on whether or not you’re using Linux or OS X.

On OS X:

matt@localmachine:/opt/local/oracle$ export ORACLE_HOME=/opt/local/oracle/instantclient
matt@localmachine:/opt/local/oracle$ export TNS_ADMIN=$ORACLE_HOME
matt@localmachine:/opt/local/oracle$ export DYLD_LIBRARY_PATH=$ORACLE_HOME

On Linux:

matt@localmachine:/opt/local/oracle$ export ORACLE_HOME=/opt/local/oracle/instantclient
matt@localmachine:/opt/local/oracle$ export TNS_ADMIN=$ORACLE_HOME
matt@localmachine:/opt/local/oracle$ export LD_LIBRARY_PATH

Additionally, it would be wise to place these variables in a script that will be run when a user logs in. I generally keep these in /etc/profile (assuming your using a bash shell). ~/.bashrc or ~/.profile are two other good locations, but those are on a per user basis. Anything and everything that will use the instantclient needs these variables set (that includes the user who runs apache)

Step 4: TNSNames.ora (Not very clever, is it?)

Next you’ll need to set up your tnsnames.ora file. If you don’t know what this is, check with your DBA. If you don’t have a DBA and you’re not sure what it is, you probably won’t be using it, and you can skip this step.

Place your tnsnames.ora file in the location that you installed the instantclient. This should be the same as the path that $TNS_ADMIN points to. I have noticed that certain applications (like SQLDeveloper) seem to disregard the $TNS_ADMIN variable (or it simply doesn’t see it) and looks for the file in your home directory. For that case, you can provide a symlink to the file in ~

matt@localmachine:/opt/local/oracle/instantclient$ ln -s tnsnames.ora ~/tnsnames.ora

And that’s it. You’re done.