Optimizing Performance

These are some ways you can ensure peak performance from your MapD Core system.

Hardware

  • Even though MapD is an “in-memory” database, when the database first starts up it needs to read data from disk. A large database can take a long time to read from a slow hard disk. Import and execution performance rely on disks with high performance characteristics to match the general nature of the database. MapD recommends fast SSD drives on a good hardware controller in RAID 10 configuration as reasonable starting hardware. If you use a virtual machine such as Amazon Web Services, MapD recommends you use Provisioned IOPS SSD disks in RAID configuration for storage.
  • Do not run unnecessary daemons. Ideally, only MapD services would run on your MapD server.
  • For a production server, set the performance setting to performance rather than power saving. The performance setting is typically controlled by the system BIOS and prevents throttling back of the CPU. You also have to change the settings in the Linux power governor setup.
  • If there is a large amount of swap activity on the machine, you probably have a memory shortage. Review the amount of data the database is attempting to process in memory compared with how much memory is available.
  • CPU speed matters to MapD, as there is always some work done on the CPUs. MapD recommends you use systems with a balance of high core count plus high CPU speed.
  • Use the nvidia-smi -pm and nvidia-smi -ac commands to set the clock speeds of the GPUs to their maximum. On a K80, the commands look like this:
sudo nvidia-smi -pm 1
sudo nvidia-smi -ac 3004,875
--ecc-config=0

Database Design

Review a representative sample of the data from which your table is to be created. This helps you determine the datatypes best suited to your columns. Where possible, place data into columns with the smallest representation that can fit the cardinality involved.

Look for these areas of potential optimization:

  • Can you apply FIXED ENCODING to TIMESTAMP fields?
  • Can you apply fixed sizes to FIXED ENCODING DICT fields?
  • What kind of INTEGER is appropriate for the values involved?
  • Is DOUBLE required, or is FLOAT enough to store expected values?
  • Set ENCODING NONE for high cardinality TEXT fields.
  • Can the data be converted from its current form to a more denormalized form?

Using the smallest possible encoding speeds up all aspects of MapD from the initial load to query execution.

Loading Data

  • Loading large flat files of 100M or more is the most efficient way to import data to MapD.
  • Consider increasing the block sizes of StreamInserter or SQLImporter to reduce the overhead per set of records loaded or streamed.
  • If you use a particular column on a regular basis to restrict the queries to a table, load the table sorted on the data in that column. For example, if most queries have a DATE dimension, then load data in date order for the best performance.

Parallel GPUs

Parsing, optimization, and parts of rendering can overlap between queries, but most of the execution occurs single file. In general, you get the most throughput on the GPU by letting a query have all the resources. You do not have to worry about contention for things like buffer or cache memory. If you can get queries done very quickly, you get low latency, even with many simultaneous queries.

For simple queries on relatively small datasets, consider executing queries on subsets of GPUs (smaller than the total number of GPUs). Different GPU groups can execute at the same time. This configuration benefits from parallelizing “fixed overheads” on each query between MapD servers on the same node.

You can implement this behavior by running multiple MapD servers on the same node, mapping each to different sets of GPUs with the --start-gpu and --num-gpus flags (see Configuration file).