如何将数据从PostgreSQL迁移到HDFS | Toptal®-欧博体育app下载

Dallas H. Snider

Verified Expert in Engineering

Dallas拥有22年的数据库应用程序开发经验. 他使用过SQL服务器 & Oracle, in both Windows & Linux.

Expertise

Database Data Science

PREVIOUSLY AT

Share

Introduction

By now, 你可能听说过Hadoop分布式文件系统(HDFS), especially if you are data analyst or someone who is responsible for moving data from one system to another. 但是，HDFS相对于关系数据库有什么好处呢?

HDFS is a scalable, 用于存储和处理大量数据的开源解决方案. HDFS已经被证明是可靠和高效的 many modern data centers.

HDFS使用商用硬件 open source software 以降低每字节存储的总成本.

具有内置的复制和对磁盘故障的弹性, HDFS是存储和处理用于分析的数据的理想系统. It does not require the underpinnings and overhead to support transaction atomicity, consistency, isolation, and durability (ACID) as is necessary with traditional relational database systems.

Moreover, 与企业和商业数据库相比, such as Oracle, utilizing Hadoop as the analytics platform avoids any extra licensing costs.

One of the questions many people ask when first learning about HDFS is: How do I get my existing data into the HDFS?

In this article, we will examine how to import data from a PostgreSQL database into HDFS. We will use Apache Sqoop, 目前哪一种效率最高, open source solution to transfer data between HDFS and relational database systems. Apache Sqoop is designed to bulk-load data from a relational database to the HDFS (import) and to bulk-write data from the HDFS to a relational database (export).

通过将数据迁移到HDFS来加快分析速度.

The steps in this tutorial are written for someone with a basic knowledge of executing SQL queries and an elementary knowledge of HDFS commands.

使用的数据库系统为PostgreSQL 9.Windows为5,HDFS版本为Cloudera Hadoop 2.5.0-cdh5.2.0 on a Centos 6.4 Linux virtual machine.

Apache Sqoop relies on the JDBC driver JAR files that are specific to the relational database vendor and database version.

执行本文中所示的步骤, the user will need permissions to connect remotely to the PostgreSQL database, SELECT 关系数据库的权限, HDFS的写权限, 并在Sqoop可执行文件上执行权限.

For the purpose of this tutorial, we created a PostgreSQL database, named it Toptal，并使其可以通过端口5432访问.

PostgreSQL Data Source

首先，在我们的PostgreSQL中 Toptal 数据库中，我们将创建一个名为 sales. We will assume that the OpenSSL certificate and private key files already exist on the PostgreSQL server.

Server [localhost]:
Database [postgres]: Toptal
Port [5432]:
Username [postgres]:
Password for user postgres:
psql (9.5.3)
Toptal=# create table sales
Toptal-# (
total (# pkSales整数约束salesKey主键，
Toptal(#    saleDate date,
总销售额:
total (# orderID int不为空，
total (# itemID int不为空
Toptal(# );
CREATE TABLE

接下来，我们将向表中插入20行:

Toptal=# insert into sales values (1， '2016-09-27'， 1 ..23, 1, 1);
INSERT 0 1
Toptal=# insert into sales values (2， '2016-09-27'， 2.34, 1, 2);
INSERT 0 1
Toptal=# insert into sales values (3， '2016-09-27'， 1 ..23, 2, 1);
INSERT 0 1
Toptal=# insert into sales values (4， '2016-09-27'， 2.34, 2, 2);
INSERT 0 1
Toptal=# insert into sales values (5， '2016-09-27'， 3.45, 2, 3);
INSERT 0 1
Toptal=# insert into sales values (6， '2016-09-28'， 3.45, 3, 3);
INSERT 0 1
Toptal=# insert into sales values (7， '2016-09-28'， 4 ..56, 3, 4);
INSERT 0 1
Toptal=# insert into sales values (8， '2016-09-28'， 5.67, 3, 5);
INSERT 0 1
Toptal=# insert into sales values (9， '2016-09-28'， 1 ..23, 4, 1);
INSERT 0 1
Toptal=# insert into sales values (10， '2016-09-28'， 1 ..23, 5, 1);
INSERT 0 1
Toptal=# insert into sales values (11， '2016-09-28'， 1 ..23, 6, 1);
INSERT 0 1
Toptal=# insert into sales values (12， '2016-09-29'， 1 ..23, 7, 1);
INSERT 0 1
Toptal=# insert into sales values (13， '2016-09-29'， 2 ..34, 7, 2);
INSERT 0 1
Toptal=# insert into sales values (14， '2016-09-29'， 3.45, 7, 3);
INSERT 0 1
Toptal=# insert into sales values (15， '2016-09-29'， 4 ..56, 7, 4);
INSERT 0 1
Toptal=# insert into sales values (16， '2016-09-29'， 5.67, 7, 5);
INSERT 0 1
Toptal=# insert into sales values (17， '2016-09-29'， 6 ..78, 7, 6);
INSERT 0 1
Toptal=# insert into sales values (18， '2016-09-29'， 7.89, 7, 7);
INSERT 0 1
Toptal=# insert into sales values (19， '2016-09-29'， 7.89, 8, 7);
INSERT 0 1
Toptal=# insert into sales values (20， '2016-09-30'， 1.23, 9, 1);
INSERT 0 1

让我们选择数据来验证数据看起来是否正确:

Toptal=# select * from sales;
 Pksales | sales date | sales amount | orderid | itemid
---------+------------+------------+---------+--------
       1 | 2016-09-27 |      $1.23 |       1 |      1
       2 | 2016-09-27 |      $2.34 |       1 |      2
       3 | 2016-09-27 |      $1.23 |       2 |      1
       4 | 2016-09-27 |      $2.34 |       2 |      2
       5 | 2016-09-27 |      $3.45 |       2 |      3
       6 | 2016-09-28 |      $3.45 |       3 |      3
       7 | 2016-09-28 |      $4.56 |       3 |      4
       8 | 2016-09-28 |      $5.67 |       3 |      5
       9 | 2016-09-28 |      $1.23 |       4 |      1
      10 | 2016-09-28 |      $1.23 |       5 |      1
      11 | 2016-09-28 |      $1.23 |       6 |      1
      12 | 2016-09-29 |      $1.23 |       7 |      1
      13 | 2016-09-29 |      $2.34 |       7 |      2
      14 | 2016-09-29 |      $3.45 |       7 |      3
      15 | 2016-09-29 |      $4.56 |       7 |      4
      16 | 2016-09-29 |      $5.67 |       7 |      5
      17 | 2016-09-29 |      $6.78 |       7 |      6
      18 | 2016-09-29 |      $7.89 |       7 |      7
      19 | 2016-09-29 |      $7.89 |       8 |      7
      20 | 2016-09-30 |      $1.23 |       9 |      1
(20 rows)

数据看起来不错，所以让我们继续.

使用Sqoop导入HDFS

With the data source defined, we are now ready to import the data into the HDFS. The sqoop 下面列出了我们将要检查的命令, 我们将在接下来的要点中分解每个论点. 请注意，该命令应该在一个完整的行或, as shown below, with the backslash (the Linux command line continuation character) at the end of each line except the last.

Sqoop import——connect’jdbc:postgresql://aaa.bbb.ccc.ddd:5432/Toptal?ssl=true&sslfactory=org.postgresql.ssl.NonValidatingFactory' \
--username 'postgres' -P \
--table 'sales' \
--target-dir 'sales' \
--split-by 'pksales'

sqoop import - The executable is named sqoop, and we are instructing it to import the data from a table or view from a database to the HDFS.
--connect - With the --connect 参数时，我们将传入用于PostgreSQL的JDBC连接字符串. 在本例中，我们使用IP地址、端口号和数据库名称. We also need to specify that SSL 是被利用的还是需要供应的 SSLSocketFactory class to be used.
--username —本例中，用户名为PostgreSQL登录用户，而不是Windows登录用户. The user must have permissions to connect to the specified database and to select from the specified table.
-P -这将提示命令行用户输入密码. 如果很少执行Sqoop，这可能是一个不错的选择. There are multiple other ways to pass the password to the command automatically, 但我们试图在本文中保持简单.
--table -这是我们传递PostgreSQL表名的地方.
--target-dir —该参数指定要存放数据的HDFS目录.
--split-by - We must provide Sqoop with a unique identifier to help it distribute the workload. Later in the job output, we will see where Sqoop selects the minimum and maximum values to help set split boundaries.

It is a good idea to put the command in a script for repeatability and editing purposes, as shown below:

[hdfs@localhost:/sqoop]$ cat sqoopCommand.sh
Sqoop import——connect’jdbc:postgresql://aaa.bbb.ccc.ddd:5432/toptal?ssl=true&sslfactory=org.postgresql.ssl.NonValidatingFactory' \
--username 'postgres' -P \
--table 'sales' \
--target-dir 'sales' \
--split-by 'pksales' 
[hdfs@localhost:/sqoop]$

现在，是时候执行上面的Sqoop命令脚本了. Sqoop命令的输出如下所示.

[hdfs@localhost:/sqoop]$ ./sqoopCommand.sh
16/10/02 18:58:34 INFO sqoop.Sqoop:运行Sqoop版本:1.4.5-cdh5.2.0
Enter password: 
16/10/02 18:58:40信息经理.SqlManager:使用默认的1000 fetchSize
16/10/02 18:58:40 INFO tool.CodeGenTool:开始代码生成
16/10/02 18:58:41信息经理.SqlManager:执行SQL语句:SELECT t.* FROM "sales" AS t LIMIT 1
16/10/02 18:58:41 INFO orm.CompilationManager: HADOOP_MAPRED_HOME是/usr/lib/hadoop-0.20-mapreduce
注意:/ tmp / sqoop-training /编译/ 77 f9452788024792770d52da72ae871f /销售.java使用或覆盖已弃用的API.
注意:使用-Xlint:deprecation重新编译以获取详细信息.
16/10/02 18:58:43 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-training/compile/77f9452788024792770d52da72ae871f/sales.jar
16/10/02 18:58:43警告经理.PostgresqlManager:看起来你正在从postgresql导入.
16/10/02 18:58:43警告经理.PostgresqlManager:这个传输可以更快! Use the --direct
16/10/02 18:58:43警告经理.PostgresqlManager:选项，用于执行特定于postgresql的快速路径.
16/10/02 18:58:43 INFO mapreduce.ImportJobBase:销售的开始导入
16/10/02 18:58:45警告映射.JobClient:使用GenericOptionsParser解析参数. 应用程序应该实现相同的工具.
16/10/02 18:58:46 INFO db.DBInputFormat:使用读提交事务隔离
16/10/02 18:58:46 INFO db.DataDrivenDBInputFormat: BoundingValsQuery: SELECT MIN("pksales"), MAX("pksales") FROM "sales"
16/10/02 18:58:47信息映射.JobClient:正在运行的作业:job_201609280401_0005
16/10/02 18:58:48信息映射.JobClient:  map 0% reduce 0%
16/10/02 18:59:04信息已映射.JobClient: map 50% reduce 0%
16/10/02 18:59:14 INFO映射.JobClient: map 75% reduce 0%
16/10/02 18:59:15 INFO已映射.JobClient: map 100% reduce 0%
16/10/02 18:59:18信息映射.JobClient: Job complete: job_201609280401_0005
16/10/02 18:59:18信息映射.JobClient: Counters: 23
16/10/02 18:59:18信息映射.JobClient:文件系统计数器
16/10/02 18:59:18信息映射.JobClient: FILE:读取的字节数=0
16/10/02 18:59:18信息映射.JobClient: FILE: Number of bytes written=1190344
16/10/02 18:59:18信息映射.JobClient: FILE:读取操作数=0
16/10/02 18:59:18信息映射.JobClient: FILE: Number of large read操作=0
16/10/02 18:59:18信息映射.JobClient: FILE:写操作数=0
16/10/02 18:59:18信息映射.JobClient: HDFS: Number of bytes read=438
16/10/02 18:59:18信息映射.JobClient: HDFS: Number of bytes written=451
16/10/02 18:59:18信息映射.JobClient: HDFS: Number of read operations=4
16/10/02 18:59:18信息映射.JobClient: HDFS: Number of large read operations=0
16/10/02 18:59:18信息映射.JobClient: HDFS:写操作数=4
16/10/02 18:59:18信息映射.JobClient:   Job Counters 
16/10/02 18:59:18信息映射.JobClient:已启动的映射任务=4
16/10/02 18:59:18信息映射.JobClient:所有映射在已占用槽中花费的总时间(ms)=48877
16/10/02 18:59:18信息映射.JobClient:所有减少的占用槽的总时间(ms)=0
16/10/02 18:59:18信息映射.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
16/10/02 18:59:18信息映射.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
16/10/02 18:59:18信息映射.JobClient: Map-Reduce框架
16/10/02 18:59:18信息映射.JobClient:映射输入记录=20
16/10/02 18:59:18信息映射.JobClient:映射输出记录=20
16/10/02 18:59:18信息映射.JobClient:输入分割字节=438
16/10/02 18:59:18信息映射.JobClient:溢出记录=0
16/10/02 18:59:18信息映射.JobClient: CPU时间(ms)=3980
16/10/02 18:59:18信息映射.JobClient:物理内存(字节)快照=481574912
16/10/02 18:59:18信息映射.JobClient:虚拟内存(字节)快照=2949685248
16/10/02 18:59:18信息映射.JobClient:总提交堆使用(字节)=127401984
16/10/02 18:59:18信息映射uce.ImportJobBase:在33中传输了451字节.7555 seconds (13.3608 bytes/sec)
16/10/02 18:59:18信息映射uce.ImportJobBase:检索了20条记录.
[hdfs@localhost:/sqoop]$

注意，上面输出的最后一行显示检索了20条记录, which corresponds to the 20 records in the table on the PostgreSQL database.

在执行了Sqoop命令之后，我们可以执行 hdfs dfs -ls command to see the directory that was created by default with the table name on the HDFS.

[hdfs@localhost:/sqoop]$ HDFS DFS -ls
Found 1 items
DRWXRWXRWX -总数据0 2016-10-02 18:59销售
[hdfs@localhost:/sqoop]$

We can use the hdfs dfs -ls 命令中列出的内容 sales directory. If you look on the HDFS, you can notice the data is partitioned and spread across four files by default, not just contained in one.

[hdfs@localhost:/sqoop]$ HDFS DFS -ls sales
Found 6 items
-rw-rw-rw- 1总数据0 2016-10-02 18:59 sales/_SUCCESS
DRWXRWXRWX - total data 0 2016-10-02 18:58 sales/_logs . log
-rw-rw-rw- 1总数据110 2016-10-02 18:59销售/部分-00000
-rw-rw-rw- 1总数据111 2016-10-02 18:59销售/部分-00001
-rw-rw-rw- 1总数据115 2016-10-02 18:59销售/部分m-00002
-rw-rw-rw- 1汇总数据115 2016-10-02 18:59 sales/part-m-00003
[hdfs@localhost:/sqoop]$

The hdfs dfs -cat command will display all of the records in the first partition of the sales data on the HDFS.

[hdfs@localhost:/sqoop]$ HDFS DFS -cat sales/part-m-00000
1,2016-09-27,1.23,1,1
2,2016-09-27,2.34,1,2
3,2016-09-27,1.23,2,1
4,2016-09-27,2.34,2,2
5,2016-09-27,3.45,2,3
[hdfs@localhost:/sqoop]$

注意，默认的文件分隔符是逗号. Also, 注意，每个分区中只有5行, because the 20 rows in the source have been equally distributed across the four partitions.

To limit the number of rows that are output to the screen, we can pipe the output of the cat command to the head 命令，检查其他三个分区的内容，如下所示.

The -n 5 argument to the head 命令将屏幕输出限制为前五行.

(Note that in our case, this is unnecessary since there are only five rows in each partition to begin with. In practice, though, you will probably have many more rows than this in each partition and will want to just check the first few to make sure that they look right, 这就告诉你怎么做.)

[hdfs@localhost:/sqoop]$ HDFS DFS -cat sales/part-m-00001 |head -n
6,2016-09-28,3.45,3,3
7,2016-09-28,4.56,3,4
8,2016-09-28,5.67,3,5
9,2016-09-28,1.23,4,1
10,2016-09-28,1.23,5,1
[hdfs@localhost:/sqoop]$ HDFS DFS -cat sales/part-m-00002 |head -n
11,2016-09-28,1.23,6,1
12,2016-09-29,1.23,7,1
13,2016-09-29,2.34,7,2
14,2016-09-29,3.45,7,3
15,2016-09-29,4.56,7,4
[hdfs@localhost:/sqoop]$ HDFS DFS -cat sales/part-m-00003 |head -n
16,2016-09-29,5.67,7,5
17,2016-09-29,6.78,7,6
18,2016-09-29,7.89,7,7
19,2016-09-29,7.89,8,7
20,2016-09-30,1.23,9,1
[hdfs@localhost:/sqoop]$

If you need to run a query to extract data from multiple tables in the PostgreSQL database, 这可以用下面的命令来完成:

[hdfs@localhost:/sqoop]$ cat sqoopCommand.sh
Sqoop import——connect’jdbc:postgresql://aaa.bbb.ccc.ddd:5432/toptal?ssl=true&sslfactory=org.postgresql.ssl.NonValidatingFactory' \
--username 'postgres' -P \
——target-dir 'creditCardOrders' \
--split-by 'pksales' \
--query "select s.pksales, s.saledate, s.saleamount, o.shippingtype, o.从销售中付款的方法是内部连接订单.orderid=o.orderid where o.methodofpayment='credit card' and \$CONDITIONS"
[hdfs@localhost:/sqoop]$

In the above command, 我们对Sqoop命令使用了一些相同的参数, 但是当与SQL命令一起使用时，它们的重要性有所不同.

--target-dir - The target directory tells Sqoop in which directory on the HDFS to store the selected data. 在使用自由格式查询时，Sqoop需要此参数.
--split-by -即使我们选择的是销售表的主键, we still have to provide Sqoop with a unique identifier to help it distribute the workload.
--query -这是我们提供SQL查询的参数. 上面的查询用双引号括起来. Notice that there is not a backslash (the line continuation character) in the multiple lines containing the query. Also notice the and \$CONDITIONS at the end of the WHERE clause. 这是Sqoop所要求的，因为Sqoop会自动替换 $CONDITIONS 具有唯一表达式的令牌.

问题或没有问题:您应该考虑HDFS

与关系数据库相比，HDFS有很多优点. If you are doing data analysis, you should consider migrating your data to HDFS, today.

With the skills learned here, importing data from a relational database system into HDFS is a simple and straightforward process that can be accomplished with a single command. 虽然这些示例只有少量的行, the mechanics of importing large volumes of data to HDFS from a PostgreSQL database table remains the same.

You can even experiment with importing large tables and varying storage delimiters. Using Apache Sqoop is more efficient than exporting the database data to a file, 将文件从数据库服务器传输到HDFS, 然后将文件加载到HDFS.

Related: 用R提升你的数据存储

Dallas H. Snider

Verified Expert in Engineering

Pace, FL, United States

Member since June 18, 2020

About the author

Dallas拥有22年的数据库应用程序开发经验. 他使用过SQL服务器 & Oracle, in both Windows & Linux.

authors are vetted experts in their fields and write on topics in which they have demonstrated experience. All of our content is peer reviewed and validated by Toptal experts in the same field.

Expertise

Database Data Science

PREVIOUSLY AT

Hire Dallas

世界级的文章，每周发一次.

Join the Toptal^® community.

Hire a Developer or Apply as a Developer

Dallas H. Snider

Expertise

Introduction

PostgreSQL Data Source

使用Sqoop导入HDFS

问题或没有问题:您应该考虑HDFS

Tags

Dallas H. Snider

About the author

Expertise

Dallas H. Snider

Increase Developer Productivity With Generative AI: Tips From Leading Software Engineers

Toptal Developers

ByDallas H. Snider

Expertise

Introduction

PostgreSQL Data Source

使用Sqoop导入HDFS

问题或没有问题:您应该考虑HDFS

Tags

About the author

Expertise

Toptal Developers

Dallas H. Snider