Loading Data

COPY FROM

COPY <table> FROM '<file pattern>' [WITH (<property> = value, ...)];

<file pattern> must be local on the server. The file pattern can contain wildcards if you want to load multiple files. In addition to CSV, TSV, and TXT files, you can import compressed files in TAR, ZIP,7-ZIP, RAR, GZIP, BZIP2, or TGZ format.

You can import client-side files (\copy command in mapdql) but it is significantly slower. For large files, MapD recommends that you first scp the file to the server, and then issue the COPY command.

<property> in the optional WITH clause can be:

  • delimiter: a single-character string for the delimiter between input fields. The default is ",", that is, as a CSV file.
  • nulls: a string pattern indicating a field is NULL. By default, an empty string or \N means NULL.
  • header: can be either 'true' or 'false' indicating whether the input file has a header line in Line 1 that should be skipped. The default is 'true'.
  • escape: a single-character string for escaping quotes. The default is the quote character itself.
  • quoted: 'true' or 'false' indicating whether the input file contains quoted fields. The default is 'true'.
  • quote: a single-character string for quoting a field. The default quote character is double quote ". All characters inside quotes are imported “as is,” except for line delimiters.
  • line_delimiter a single-character string for terminating each line. The default is "\n".
  • array: a two-character string consisting of the start and end characters surrounding an array. The default is {}. For example, data to be inserted into a table with a string array in the second column (e.g. BOOLEAN, STRING[], INTEGER) can be written as true,{value1,value2,value3},3.
  • array_delimiter: a single-character string for the delimiter between input values contained within an array. The default is ",".
  • threads number of threads for performing the data import. The default is the number of CPU cores on the system.
  • max_reject number of records that the COPY statement allows to be rejected before terminating the COPY command. Records can be rejected for a number of reasons: for example, invalid content in a field, or an incorrect number of columns. The details of the rejected records are reported in the ERROR log. COPY returns a message identifying how many records are rejected. The records that are not rejected are inserted into the table, even if the COPY stops due to the max_reject count being reached. The default is 100,000.
  • plain_text: This parameter indicates that the input file is a plain text file so as to bypass the libarchive decompression utility. CSV, TSV, and TXT are always handled as plain text by default.

Note: by default, the CSV parser assumes one row per line. To import a file with multiple lines in a single field, specify threads = 1 in the WITH clause.

Examples:

COPY tweets from '/tmp/tweets.csv' WITH (nulls = 'NA');
COPY tweets from '/tmp/tweets.tsv' WITH (delimiter = '\t', quoted = 'false');
COPY tweets from '/tmp/*'          WITH (header='false');

SQL Importer

java -cp [MapD JDBC driver]:[3rd party JDBC driver]
com.mapd.utility.SQLImporter -t [MapD table name] -su [external source user]
-sp [external source password] -c "jdbc:[external
source]://server:port;DatabaseName=some_database" -ss "[select statement]"
usage: SQLImporter
-b,--bufferSize <arg>      Transfer buffer size
-c,--jdbcConnect <arg>     JDBC Connection string
-d,--driver <arg>          JDBC driver class
-db,--database <arg>       MapD Database
-f,--fragmentSize <arg>    Table fragment size
-i <arg>                   Path to initialization file.
-p,--passwd <arg>          MapD Password
--port <arg>               MapD Port
-r <arg>                   Row Load Limit
-s,--server <arg>          MapD Server
-sp,--sourcePasswd <arg>   Source Password
-ss,--sqlStmt <arg>        SQL Select statement
-su,--sourceUser <arg>     Source User
-t,--targetTable <arg>     MapD Target Table
-tr,--truncate             Drop and recreate the table, if it exists
-u,--user <arg>            MapD User

SQL Importer executes a select statement on another database via JDBC and brings the result set into MapD Core.

If the table does not, SQL Importer creates the table in MapD Core.

If the truncate flag is set, it truncates the contents of the file.

If the file exists and truncate is not set, data import fails if the table does not match the SELECT statement metadata.

MapD recommends that you use a service account with read-only permissions when accessing data from a remote database.

The -i argument provides a path to an initialization file. Each line of the file is sent as a SQL statement to the remote server from which the data is copied. This can be used to set additional custom parameters before the data is loaded.

MySQL Example:

java -cp mapd-1.0-SNAPSHOT-jar-with-dependencies.jar:
mysql/mysql-connector-java-5.1.38/mysql-connector-java-5.1.38-bin.jar
com.mapd.utility.SQLImporter -t test1 -sp mypassword -su myuser
-c jdbc:mysql://localhost -ss "select * from employees.employees"

SQLServer Example:

java -cp
/path/to/mapd/bin/mapd-1.0-SNAPSHOT-jar-with-dependencies.jar:/path/to/sqljdbc4.jar
com.mapd.utility.SQLImporter -d com.microsoft.sqlserver.jdbc.SQLServerDriver -t
mapd_target_table -su source_user -sp source_pwd -c
"jdbc:sqlserver://server:port;DatabaseName=some_database" -ss "select top 10 *
from dbo.some_table"

PostgreSQL Example:

java -cp
/p/to/mapd/bin/mapd-1.0-SNAPSHOT-jar-with-dependencies.jar:
/p/to/postgresql-9.4.1208.jre6.jar
com.mapd.utility.SQLImporter -t mapd_target_table -su source_user -sp
source_pwd -c "jdbc:postgresql://server/database" -ss "select * from some_table
where transaction_date > '2014-01-01'"

StreamInsert

Stream data into MapD Core by attaching the StreamInsert program to the end of a data stream. The data stream could be another program printing to standard out, a Kafka endpoint, or any other real-time stream output. You can specify the appropriate batch size, according to the expected stream rates and your desired insert frequency. The target table must exist before you attempt to stream data into the table.

 <data stream> | StreamInsert <table name> <database name> \
 {-u|--user} <user> {-p|--passwd} <password> [{--host} <hostname>] \
 [--port <port number>][--delim <delimiter>][--null <null string>] \
 [--line <line delimiter>][--batch <batch size>][{-t|--transform} \
 transformation ...][--retry_count <num_of_retries>] \
 [--retry_wait <wait in secs>][--print_error][--print_transform]

usage: <data stream> | StreamInsert <table name> <database name>
-u,--user          User name
-p,--password      Password
--host             Name of MapD host
--port             Port number for MapD Core on localhost (default 9091)
--delim            Field delimiter. Default is comma.
--null             Null string definition
--line             Line delimiter
--batch            Batch size
-t,--transform     Regex transformation
--retry_count      Number of retries before reporting a failure
--retry_wait       Wait time between retries, in seconds
--print_error      Print error messages
--print_transform  Print description of transform
--help             List options

For more information on creating regex transformation statements, see RegEx Replace.

Example:

cat file.tsv | /path/to/mapd/SampleCode/StreamInsert stream_example \
mapd --host localhost --port 9091 -u mapd -p MapDRocks! \
--delim '\t' --batch 1000

Using StreamInsert with AWS S3 Files

MapD recommends that you load S3 files into a MapD table using the StreamInsert utility. For example, this instruction loads a CSV format file into a table named ‘taxi_xmas_2015’ with a batch insert size of 100,000 records. Since comma is the default delimiter, no further parameters are required.

aws s3 cp s3://mapd.com/streamTest/aws_xmas2015.csv - | \
/raidStorage/prod/mapd/SampleCode/StreamInsert taxi_xmas_2015 \
mapd -u mapd -p MapDRocks! --batch 100000

HDFS

Consume a CSV or Parquet file residing in HDFS into MapD Core.

Copy the MapD JDBC driver into the sqoop lib, normally /usr/lib/sqoop/lib/

Example:

sqoop-export --table alltypes --export-dir /user/cloudera/ \
--connect "jdbc:mapd:192.168.122.1:9091:mapd" \
--driver com.mapd.jdbc.MapDDriver --username mapd \
--password HyperInteractive --direct --batch

Troubleshooting: How to Avoid Duplicate Rows

To detect duplication prior to loading data into MapD Core Database, you can perform the following steps. For this example, the files are labeled A,B,C...Z.

  1. Load file A into table MYTABLE.

  2. Run the following query.

    select t1.uniqueCol from MYTABLE t1 join MYTABLE t2 on t1.uCol = t2.uCol;
    

    There should be no rows returned; if rows are returned, your first A file is not unique.

  3. Load file B into table TEMPTABLE.

  4. Run the following query.

    select t1.uniqueCol from MYTABLE t1 join TEMPTABLE t2 on t1.uCol = t2.uCol;
    

    There should be no rows returned if file B is unique. Fix B if the information is not unique using details from the selection.

  5. Load the fixed B file into MYFILE.

  6. Drop table TEMPTABLE.

  7. Repeat steps 3-6 for the rest of the set for each file prior to loading the data to the real MYTABLE instance.