Optimizing Performance

Consider the following to optimize performance on your MapD Core system.

Hardware

  • Even though MapD is an “in-memory” database, when the database first starts up, it reads data from disk. A large database can take a long time to read from a slow hard disk. Import and execution performance rely on disks with high performance characteristics to match the general nature of the database. As a starting point, MapD recommends fast SSD drives on a good hardware controller in RAID 10 configuration. If you use a virtual machine such as Amazon Web Services, MapD recommends you use Provisioned IOPS SSD disks in RAID configuration for storage.
  • Do not run unnecessary daemons. Ideally, only MapD services run on your MapD server.
  • For a production server, set the performance setting to performance instead of power saving. The performance setting is typically controlled by the system BIOS and prevents throttling back of the CPU. You also have to change the settings in the Linux power governor setup.
  • A large amount of swap activity on the machine probably indicates a memory shortage. Compare the amount of data the database is attempting to process in memory to the amount of memory available.
  • Because some work is always done on the CPUs, speed is important. MapD recommends you use systems that balance a high core count with high CPU speed.
  • Use the nvidia-smi -pm and nvidia-smi -ac commands to set the clock speeds of the GPUs to their maximum. On an NVIDIA Tesla K80, the commands look like this:
sudo nvidia-smi -pm 1
sudo nvidia-smi -ac 3004,875
--ecc-config=0

Database Design

Review a representative sample of the data from which your table is to be created. This helps you determine the datatypes best suited to your columns. Where possible, place data into columns with the smallest representation that can fit the cardinality involved.

Look for these areas of potential optimization:

  • Can you apply fixed encoding to TIMESTAMP fields?
  • Can you apply fixed sizes to FIXED ENCODING DICT fields?
  • What kind of INTEGER is appropriate for the values involved?
  • Is DOUBLE required, or is FLOAT enough to store expected values?
  • Is ENCODING NONE set for high-cardinality TEXT fields?
  • Can the data be converted from its current form to a more denormalized form?

Using the smallest possible encoding speeds up all aspects of MapD, from initial load to query execution.

Loading Data

  • Loading large flat files of 100M or more is the most efficient way to import data to MapD.
  • Consider increasing the block sizes of StreamInserter or SQLImporter to reduce the overhead of records loaded or streamed.
  • If you use a particular column on a regular basis to restrict the queries to a table, load the table sorted on the data in that column. For example, if most queries have a DATE dimension, then load data in date order for best performance.

Parallel GPUs

Parsing, optimization, and parts of rendering can overlap between queries, but most of the execution occurs single file. In general, you get the most throughput on the GPU by letting a query have all the resources. Contention is not a concern for buffer or cache memory. If queries are done very quickly, you get low latency, even with many simultaneous queries.

For simple queries on relatively small datasets, consider executing queries on subsets of GPUs. Different GPU groups can execute at the same time. This configuration benefits from parallelizing “fixed overheads” on each query between MapD servers on the same node.

You can implement this behavior by running multiple MapD servers on the same node and mapping each to different sets of GPUs with the --start-gpu and --num-gpus flags (see Configuration file).