
I am currently importing a lot of data into a Postgres Database. Initially things where quickish, but now it seems to be running very slowly. I put this down to index scans. The machine is a Dual P4 with 2 GB of ram and a Raid array (14 Fibre Channel Harddrives) Top is reporting the disk cache is 1.7 GB and system idle around 70%. The postgres backend is using 10% to 20% of one CPU. I don't know if the bottleneck is the Disk IO or Ram. What would people suggest I can look at to see what is holding up the process?

Raymond Burgess wrote:
I am currently importing a lot of data into a Postgres Database. Initially things where quickish, but now it seems to be running very slowly. I put this down to index scans. The machine is a Dual P4 with 2 GB of ram and a Raid array (14 Fibre Channel Harddrives)
Top is reporting the disk cache is 1.7 GB and system idle around 70%. The postgres backend is using 10% to 20% of one CPU. I don't know if the bottleneck is the Disk IO or Ram.
What would people suggest I can look at to see what is holding up the process?
If it's IO bound, you should see a high system percentage (%system in top). Tools like vmstat, procinfo and iostat can let you follow through what is happening, although IIRC they apply to the whole system not a particular process. Daniel

Daniel Lawson wrote:
Raymond Burgess wrote:
I am currently importing a lot of data into a Postgres Database. Initially things where quickish, but now it seems to be running very slowly. I put this down to index scans. The machine is a Dual P4 with 2 GB of ram and a Raid array (14 Fibre Channel Harddrives)
Top is reporting the disk cache is 1.7 GB and system idle around 70%. The postgres backend is using 10% to 20% of one CPU. I don't know if the bottleneck is the Disk IO or Ram.
What would people suggest I can look at to see what is holding up the process?
If it's IO bound, you should see a high system percentage (%system in top).
Tools like vmstat, procinfo and iostat can let you follow through what is happening, although IIRC they apply to the whole system not a particular process.
I wouldn't expect sys% to be up with a scsi disk sub system. The system is running software raid, but raid5d isn't using a lot of cputime either I have run vmstat. 'vmstat 5 50' => http://hostility.deeper.co.nz/~raymond/vmstat.out There is often 1 proc listed as blocked so I guess the harddrives just can't keep up.

Raymond Burgess wrote:
I wouldn't expect sys% to be up with a scsi disk sub system. The system is running software raid, but raid5d isn't using a lot of cputime either
I have run vmstat.
'vmstat 5 50' => http://hostility.deeper.co.nz/~raymond/vmstat.out
There is often 1 proc listed as blocked so I guess the harddrives just can't keep up.
20% system is "high".
When doing mass imports, it's much better to run with no indexes, import the entire data, then regenerate the indexes at the end. Also, make sure if you're running a recent postgres that the autovacuum daemon isn't running revacuuming your data regularly. As Oliver mentioned, using COPY instead of INSERT makes a significant difference (I think it does the above itself, instead of having to do it manually).

Perry Lorier wrote:
There is often 1 proc listed as blocked so I guess the harddrives just can't keep up.
20% system is "high".
When doing mass imports, it's much better to run with no indexes, import the entire data, then regenerate the indexes at the end. Also, make sure if you're running a recent postgres that the autovacuum daemon isn't running revacuuming your data regularly.
As Oliver mentioned, using COPY instead of INSERT makes a significant difference (I think it does the above itself, instead of having to do it manually).
I don't think that COPY could be an option. A lot of the data is repeated, so store the string once and then use an ID for each string. The strings need to be indexed, as to the IDs so I can find the ID for an existing string. Currently the import is reporting a 85% hit ratio of existing strings. For each string I have statistics for different periods and sources which I store the stats for. I have run vmstat again and looked at the sys% and its not looking good at all. Yesterday during the import it was around 40% (ouch!) currently it is sitting around 20%. I might suggest that we get some more ram for the machine. Try and cache more of the indicies.

Raymond Burgess wrote:
Perry Lorier wrote:
There is often 1 proc listed as blocked so I guess the harddrives just can't keep up.
20% system is "high".
When doing mass imports, it's much better to run with no indexes, import the entire data, then regenerate the indexes at the end. Also, make sure if you're running a recent postgres that the autovacuum daemon isn't running revacuuming your data regularly.
As Oliver mentioned, using COPY instead of INSERT makes a significant difference (I think it does the above itself, instead of having to do it manually).
I don't think that COPY could be an option. A lot of the data is repeated, so store the string once and then use an ID for each string. The strings need to be indexed, as to the IDs so I can find the ID for an existing string. Currently the import is reporting a 85% hit ratio of existing strings. For each string I have statistics for different periods and sources which I store the stats for.
I have run vmstat again and looked at the sys% and its not looking good at all. Yesterday during the import it was around 40% (ouch!) currently it is sitting around 20%. I might suggest that we get some more ram for the machine. Try and cache more of the indicies.
oh if your data isn't too critical (you can reimport it?) turn of postgres fsync'ing, you'll get a MASSIVE speed increase and your sys% will drop like a stone. Also turn off atime on the filesystem the data is stored on. ALTER TABLE <foo> ALTER COLUMN <blah> SET STATISTICS may also help a lot on your string index table (if you have 100 strings that occur a lot more commonly than the rest, set the number to something over 100... if it doesn't help, set it low. VACUUM ANALYSE might also help (it updates the statistics for the table).

Have you tried not using inserts but rather use COPY? Just a random idea. Regards On Fri, 2004-05-07 at 12:41, Raymond Burgess wrote:
I am currently importing a lot of data into a Postgres Database. Initially things where quickish, but now it seems to be running very slowly. I put this down to index scans. The machine is a Dual P4 with 2 GB of ram and a Raid array (14 Fibre Channel Harddrives)
Top is reporting the disk cache is 1.7 GB and system idle around 70%. The postgres backend is using 10% to 20% of one CPU. I don't know if the bottleneck is the Disk IO or Ram.
What would people suggest I can look at to see what is holding up the process? _______________________________________________ wlug mailing list | wlug(a)list.waikato.ac.nz Unsubscribe: http://list.waikato.ac.nz/mailman/listinfo/wlug
-- Oliver Jones » Director » oliver.jones(a)deeperdesign.com » +64 (21) 41 2238 Deeper Design Limited » +64 (7) 377 3328 » www.deeperdesign.com
participants (4)
-
Daniel Lawson
-
Oliver Jones
-
Perry Lorier
-
Raymond Burgess