We have developed a data parallel dialect of Java that can be used for expressing common data mining algorithms at a high level. Our compiler generates a middleware specification from this dialect of Java. The middleware supports both distributed memory and shared memory parallelization, and performs a number of I/O optimizations to support efficient processing of disk resident datasets.
In this paper, we describe the commonality between different data mining algorithms, the middleware and its interface, the data parallel dialect of Java, and the compilation techniques required for generating the middleware specification.