"Linux Gazette...making Linux just a little more fun!"
Large Objects and Perl DBI
By
If this document changes, it will be available here: http://genericbooks.com/Literature/Articles/2 .
Index:
- Resources
- Introduction
- Using Perl DBI
- Using Large Objects and Perl
- Using the database server with a webserver.
- Comments
Resources
I assume that you have DBI and DBD::Pg installed with Perl.
- man DBI
- man DBD::Pg
- ftp://ftp.perl.com/CPAN/modules/by-category/07_Database_Interfaces/
- A search for "large" and "dbi" on one of the postgresql newsgroups.
- http://www.postgresql.org/doxlist.html -- A chapter on Large Objects in the Programmer's Guide for PostgreSQL.
- Two examples using psql which explain how to use large objects.
Introduction
I work at ZING Is Not GNU. We sell and distribute free books or literature. Well, besides selling books inexpensively, we also want people to download the books. This creates a problem when you need to take megabytes of data and put them into a database. I have programmed putting large amounts of data into databases before, but only 100k or less of data per transaction. This was the first time I needed to put really large files into a database. My biggest problem was the fact it really isn't documented well when you use Perl. Also, I never needed to use Large Objects with PostgreSQL before. I have used other database servers for Large Objects, but not PostgreSQL.
What are Large Objects? Large Objects are "things" (text or binary data) which cannot fit into a normal field in a table. For example, a 100 megabyte file cannot really fit into a field in a table.
What were the technologies I was using? RedHat Linux, postgresql-6.5.3, Perl 5 (with DBI, DBD::Pg, and Pg), and apache_1.3.9.
In general, how do you use Large Objects in PostgreSQL? You can save large objects like "files" where the database server will let you input a file and it will give you a number in return, and when you want to retrieve the data, you use the number to export your data to a temporary file on your hard drive. It is kind of weird. In order to extract a Large Object, it sort of takes two steps. First, you copy the data to a file on your computer, and then you can read it. The problem is, from my point of view, is that data normally is read once. Here, it is read twice, to make the file, and then to read the file. There isn't any way around this (as far I as know), but I am just new to Large Objects in PostgreSQL (I have used them in other database servers before though), so perhaps there are better ways.
Using Perl DBI
PERL is a programming language used by many web and database professionals. It took a couple of years, but now people are finally recognizing that free and open software, like PERL, PHP, and Python can be used for commercial companies as an alternative to bloated, inefficient, and unstable commercial programming languages (which most of the time you don't have source code for, and hence, you can never be sure what you are using truly does everything it claims to do -- there is one company I am talking about in particular). Being a contractor, I have noticed an explosion of Perl jobs (it pleases me greatly to see the rise of ethically clean software), and hence, I am sure this will be useful to other contractors out there who would prefer to use PostgreSQL over other alternatives. Perl DBI is a generic database interface for all database servers or files. In order to connect to a specific database, you also need a Perl DBD driver. For example, to connect to PostgreSQL, I use Perl DBI and Perl DBD:Pg. Here is a brief example of how I print out the first and last names from a PostgreSQL database using the table "people".
#!/usr/bin/perl
use DBI;
use vars qw($dbh);
### zing is the database I am connecting to.
$dbh ||= DBI->connect("dbi:Pg:dbname=zing");
my $Command = "select first_name,last_name from people
sort by last_name,first_name";
my $sth = $dbh->prepare($Command);
my $Result = $sth->execute;
while (my @row_ary = $sth->fetchrow_array)
{print "<br> $row_ary[0] $row_ary[1]\n";}
The problem with PERL DBI is that the Large Object interface isn't fined tuned yet, and you have to use the specific DBD driver for each type of database. The problem with DBD::Pg is that Large Objects is not well documented, and it took me a long time looking through newsgroups to find the answer I was looking for. Eventually, I also did manage to find this note on Large Objects.
Using Large Objects and Perl
ONE WARNING: The reason why temporary files are actually a good thing is for the following reason, for big files, if you suck the data all into memory at once, Perl running under Apache using mod_perl will use up a lot of system memory and it won't give it back (even though Perl will reuse the memory itself). Imagine if you have 10 people downloading 10 megabyte files and your Perl script loads the files into memory before it prints them (instead of just printing them line-by-line). Apache will use 100 megabytes of system memory (actually Perl will) and it won't give it back. This can be bad. In other Database servers using Large Objects, I could load the data directly into memory. I would not want to do that with really large files anyways. Read this performance guide for more info.
Okay, how do we import and export large object from PostgreSQL? This following example comes straight from the this PostgreSQL documentation. This example uses the program psql.
CREATE TABLE image (
name text,
raster oid
);
INSERT INTO image (name, raster)
VALUES ('beautiful image', lo_import('/etc/motd'));
SELECT lo_export(image.raster, "/tmp/motd") from image
WHERE name = 'beautiful image';
Now, we need to convert this into perl. Here is a perl script which would do the exact same thing.
#!/usr/bin/perl
use vars qw($dbh);
$dbh ||= DBI->connect("dbi:Pg:dbname=zing");
my $Command = "INSERT INTO image (name, raster)
VALUES ('beautiful image', lo_import('/etc/motd'));";
my $sth = $dbh->prepare($Command);
$sth->execute();
$Command = "SELECT lo_export(image.raster, '/tmp/motd') from image
WHERE name = 'beautiful image'";
$sth = $dbh->prepare($Command);
$sth->execute();
Those two examples do the following. The first command loads the file "/etc/motd" into a table. The second command takes the data in the table and exports it to a file called "/tmp/motd". If you want to get the "oid" of data in the table and export it to a file, you could also do this.
## This command will return a numeric "oid" number
$Command = "SELECT raster from image WHERE name = 'beautiful image'";
$sth = $dbh->prepare($Command);
$sth->execute();
my @row_ary = $sth->fetchrow_array;
my $Oid = $row_ary[0];
## This command will export the data with the "oid" to a file
$Command = "SELECT lo_export($Oid, '/tmp/motd') from image";
$sth = $dbh->prepare($Command);
$sth->execute();
For live examples, ZING at www.genericboosk.com has all of its Perl Scripts available for public viewing. Look for scripts that view documents that are extracted from the database. ZING now has setup scripts to let people upload and download documents, so we should have real live examples floating around on the website.
Using the database server with a webserver.
Okay, what problems can you encounter when you use a webserver to input/output the data from a database server? Well, the biggest problem I saw, was that if the database server and the webserver run under two different accounts (like "www" and "pgsql"), the webserver and database server may have problem reading the others temporary files. Here is a list of problems and their solutions.
- The webserver cannot upload the file to PostgreSQL because it doesn't have super user privileges to use lo_import and lo_export. Well, give the webserver its own database server (same account), give the webserver super user status with the database server, or look into client-side commands that I never had a chance to look at yet, concerning lo_import and lo_export.
- The webserver cannot delete the exported files once the database server has exported them and the webserver doesn't need them anymore. Make the webserver and database server run under the same account, setup a "group" where the webserver can delete the database server files, or setup a cron job to delete the files every once in a while and save the files by the pid of the webserver so that they will tend to get overwritten by new processes later. ZING just saves files by their pid number so that the next time a document is pulled, it will overwrite the previous one (which isn't needed anymore).
At ZING, we export the files by the pid number of the child process of the webserver. When the webserver needs to export another document, it hands the process to one of its children, and if the child already has exported a document before, it will just overwrite the previous one which makes it so we don't end up with tons of exported undeleted files. Then, once an hour, we go through and delete any files older than 15 minutes. It is not as elegant as I would like, but it works fine.
In general, it is a bad idea to let your webserver have super user status or to have the webserver and database server run under the same username. For security, you don't want your webserver having the power to blow away the database server. Oh well.
Comments
I really don't like the way Large Objects are handled in PostgreSQL, or any other database server I have used. There should be a way to treat importing and exporting data from a database server as STDIN or STDOUT where you can just grab the stuff line by line and not the whole darn thing at once. I don't like the fact that we have to use intermediate files to get to the data. It would be nice if we could choose to directly dump data directly into memory or use temporary files if the data is too big to hold in memory. CHOICE is the key word here. What does this mean? Since PostgreSQL is the best free and open database server out there, help the guys out by becoming a developer for PostgreSQL! Perhaps there is a way to directly dump Large Objects into memory using PostgreSQL, and if there is, write a follow-up to this article and prove me wrong! Actually, being able to get Large Objects one line at a time would really be cool.
Anyways, I will try to find better solutions, so if you hear of any!
Large Objects have always been a pain for me. I have always wanted to use Large Objects in PostgreSQL. Now that I had a reason, I finally did it. Hopefully it will save other people headaches. If it saved you from headaches, donate time or money to ZING, or do something charitable for some other cause! If we all do a little, it makes a big impact.
My next goal is to do handle Large Objects in PostgreSQL using the programming languages PHP and Python. After that, I want to see if there is a way to use Large Objects without being a super user. After that, I want to compare this to MySQL. For licensing reasons and since PostgreSQL has always been 100% free and open, I prefer PostgreSQL over MySQL. However I want to compare and contrast them to help make PostgreSQL better. Recently, MySQL has relaxed its license, but I am going to stick with PostgreSQL since they have always had the best license.
Mark Nielsen works for The Computer Underground as a receptionist and as a book binder at ZING. In his spare time, he does volunteer stuff, like writing these documents for The Linux Gazette (and other magazines).
Copyright © 2000, Mark Nielsen
Published in Issue 50 of Linux Gazette, February 2000