Skip to main content

Parallel Data Mining in Java

        This paper discusses several techniques used in developing a parallel, production quality data mining application in Java. Three sequential versions of the data mining application were developed: A sequential Fortran 90 version used as a performance reference, a plain Java implementation that only uses the primitive array structures from the language, and a baseline Java implementation that uses a Java Array package designed by us. When desired, this Array package provides parallelism at the individual Array and BLAS operation level. We have also developed two parallel Java versions: one that relies entirely on the parallelism from the Array package, and another that is explicitly parallel at the application level. We discuss the design of the Array package, as well as the design of the data mining application. We compare the trade-offs between performance and the abstraction level presented to the application programmer with the different Java versions. Our studies show that not only is the Java implementation of this program competitive with the Fortran implementation, but that it parallelizes well, both in the implicit as well as the explicit form. The Java implementation achieves a performance of 109 Mflops, or 91% of Fortran performance, on a 332 MHz PowerPC 604e processor. On an SMP with four of those processors, the implicitly parallel form achieves 290 Mflops with no effort from the application programmer, while the explicitly parallel form achieves 340 Mflops.

By: J. E. Moreira, S. P. Midkiff, M. Gupta, R. D. Lawrence

Published in: RC21326 in 1998

LIMITED DISTRIBUTION NOTICE:

This Research Report is available. This report has been submitted for publication outside of IBM and will probably be copyrighted if accepted for publication. It has been issued as a Research Report for early dissemination of its contents. In view of the transfer of copyright to the outside publisher, its distribution outside of IBM prior to publication should be limited to peer communications and specific requests. After outside publication, requests should be filled only by reprints or legally obtained copies of the article (e.g., payment of royalties). I have read and understand this notice and am a member of the scientific community outside or inside of IBM seeking a single copy only.

RC21326.ps

Questions about this service can be mailed to reports@us.ibm.com .