Hi list,
I have compiled a single-threaded (and quite non MP aware) pascal program that does a lot of numerical computations. One calculation takes about 5 days on my 1.25 GHz DDR G4 (single cpu) and I need about 200 computations to do. Well, when I run the program on my G4 it takes up all CPU time (cca 93%-95%). I thought of speeding the whole thing by running it on a dual processor machine so I got someone to run it on a dual 2 GHz G5. So here comes the problem. If only one instance is run on a dual G5 the Activity monitor shows 50% CPU load (which is understandable) but the graph alternates between 100% load on 1st CPU and 100% load on 2nd CPU. I thought that is an anomaly of Activity Monitor. However, when two instances are run, the time required to execute each task doubles even though the Activity monitor shows both cpus at 100% load.
My program doesn't really do anything much - it performs gazillions of computations in arrays that all together consume less than 20 Mb of memory.
Am I doing anything wrong or shouldn't I expect better performance on a dual cpu unit?
Highest regards, Boris
Just quickly - IMO this isn't really a Pascal Q so much as a program design Q (assuming there are no bizzare issues in threading with GPC!). You'll want your threads to have little data exchange or dependence on one another, or they'll either bottleneck over the data bus (MP programs can be come data bus bound rather than CPU bound, in principle at least; some folks claim this is an issue on the dual G5's) or end up waiting on completion of another thread (blocked) in order to proceed. Its possible what you describe is blocking a big way. You'll have to think about how your code is designed to understand its behaviour and to make good use of both CPUs. Just sticking in threads won't automatically give you a say a 1.5X speed gain.
Since you're doing array work, I presume you've looked into making the code use the vector processor instructions on the G4/G5? That could you give you bigger gains than the threading. If you're not already on it, look on the scitech mailing list at Apple for further ideas/discussions. Look out for Ian Ollman's posts in particular.
Have fun,
Grant
At 9:16 PM +0200 29/3/04, Boris Herman wrote:
Hi list,
I have compiled a single-threaded (and quite non MP aware) pascal program that does a lot of numerical computations. One calculation takes about 5 days on my 1.25 GHz DDR G4 (single cpu) and I need about 200 computations to do. Well, when I run the program on my G4 it takes up all CPU time (cca 93%-95%). I thought of speeding the whole thing by running it on a dual processor machine so I got someone to run it on a dual 2 GHz G5. So here comes the problem. If only one instance is run on a dual G5 the Activity monitor shows 50% CPU load (which is understandable) but the graph alternates between 100% load on 1st CPU and 100% load on 2nd CPU. I thought that is an anomaly of Activity Monitor. However, when two instances are run, the time required to execute each task doubles even though the Activity monitor shows both cpus at 100% load.
My program doesn't really do anything much - it performs gazillions of computations in arrays that all together consume less than 20 Mb of memory.
Am I doing anything wrong or shouldn't I expect better performance on a dual cpu unit?
Highest regards, Boris
Boris Herman wtote:
Hi list,
I have compiled a single-threaded (and quite non MP aware) pascal program that does a lot of numerical computations. One calculation takes about 5 days on my 1.25 GHz DDR G4 (single cpu) and I need about 200 computations to do. Well, when I run the program on my G4 it takes
My program doesn't really do anything much - it performs gazillions of computations in arrays that all together consume less than 20 Mb of memory.
Am I doing anything wrong or shouldn't I expect better performance on a dual cpu unit?
If you really care about speed you should verify where the botleneck is. When you compute you need to take data from memory to the processor first, compute and write back the result. With small data (fitting in processor cache) you stress processor (still, less acceses to the cache makes things faster). With larger data memory (DRAM) speed matters. If you move slowly trough your data then you can fully utilize the processor speed. However if you make many "fast" passes memory bandtwidth is the botleneck. Your processor shold deliver more then Gigaflop (2.5 GF???), but for double precision dot product you need 8 byte per flop, and I bet that your memory is unable to deliver 8 GB per second. Also, DRAM delivers normally block of 32-128 bytes (a cache line), so if you access scattered data you tranfer much more than you need.
If your botleneck is memory bandtwidth then adding CPU-s does not help, you still have the same memory. If your memory is fast enough then new process on new CPU should work in paralel with old CPU giving the speedup.
Both memory access and actual computations count as "cpu time" in OS. To know which dominate you may simply count various operations your program is doing. Little expriments changing size of your arrays may help. There are also special tools -- I know many for PC, but there must be something for Mac too.
Boris Herman wrote:
I have compiled a single-threaded (and quite non MP aware) pascal program that does a lot of numerical computations. One calculation takes about 5 days on my 1.25 GHz DDR G4 (single cpu) and I need about 200 computations to do. Well, when I run the program on my G4 it takes up all CPU time (cca 93%-95%). I thought of speeding the whole thing by running it on a dual processor machine so I got someone to run it on a dual 2 GHz G5. So here comes the problem. If only one instance is run on a dual G5 the Activity monitor shows 50% CPU load (which is understandable) but the graph alternates between 100% load on 1st CPU and 100% load on 2nd CPU. I thought that is an anomaly of Activity Monitor.
I suppose that's normal. Either the system balances the load intentionally, or it just happens that it gets scheduled on both CPUs now and then.
However, when two instances are run, the time required to execute each task doubles even though the Activity monitor shows both cpus at 100% load.
My program doesn't really do anything much - it performs gazillions of computations in arrays that all together consume less than 20 Mb of memory.
Am I doing anything wrong or shouldn't I expect better performance on a dual cpu unit?
I don't really have much MP experience, but ISTM the memory speed is the bottleneck (as is often the case on single processor machines already). Are the programs very memory intensive? (I suppose so, i.e., the working set is generally larger than the processor cache.) Then I think the size of allocated memory doesn't matter much, if both processes have to wait for access to the memory bus. Localizing the memory accesses may help quite a bit -- if possible. Or run them on two separate machines with their own memory. ;-)
(I assume they don't use shared memory. If they do, this might be a factor as well, of course.)
Frank
Hi list,
I have compiled a single-threaded (and quite non MP aware) pascal program that does a lot of numerical computations. One calculation takes about 5 days on my 1.25 GHz DDR G4 (single cpu) and I need about 200 computations to do. Well, when I run the program on my G4 it takes up all CPU time (cca 93%-95%). I thought of speeding the whole thing by running it on a dual processor machine so I got someone to run it on a dual 2 GHz G5. So here comes the problem. If only one instance is run on a dual G5 the Activity monitor shows 50% CPU load (which is understandable) but the graph alternates between 100% load on 1st CPU and 100% load on 2nd CPU. I thought that is an anomaly of Activity Monitor. However, when two instances are run, the time required to execute each task doubles even though the Activity monitor shows both cpus at 100% load.
My program doesn't really do anything much - it performs gazillions of computations in arrays that all together consume less than 20 Mb of memory.
Am I doing anything wrong or shouldn't I expect better performance on a dual cpu unit?
Apple's performance tools may be helpful, see <http://developer.apple.com/documentation/Performance/Conceptual/ PerformanceFundamentals/Concepts/Tools.html>
Regards,
Adriaan van Os
On Mon, 29 Mar 2004, Boris Herman wrote: [..]
My program doesn't really do anything much - it performs gazillions of computations in arrays that all together consume less than 20 Mb of memory.
20 Mb is several times the size of your L2 cache. The cpu uses only the cache with a portion of the program plus some of the data in the cache, so if the data the program needs isn't in the chunk of data in the cache the cpu does a store & load cache operation.
For example, this simple program took 1.8 sec to run:
program huge; var i,j : integer; BA : array [ 1..10000, 1..10000 ] of integer; begin for i := 1 to 10000 do for j := 1 to 10000 do BA [ i,j ] := i+j; end.
Change one line to read: BA [ j,i ] := i+j; and the program now takes 27 sec to run
Hope this helps, Russ