[AWS]ECS で EC2 と Fargate の起動タイプのタスクを併用して同じアプリケーションを実行する

November 12, 2018, 6:59 am

≫ Next: [AWS]Presto on EMR で ”deserializer does not exist: org.openx.data.jsonserde.JsonSerDe” と怒られる

ECS で EC2 と Fargate の起動タイプのタスクを併用して同じアプリケーションを実行してみた。ワークロードがない場合は、EC2 の1タスクだけ実行され（Fargate のタスクは 0）、EC2 のCPU使用率が50%を超えると EC2 に1タスク追加され、EC2 のCPU使用率が80% を超えると Fargate のタスクが追加されていくというのを試してみた。

EC2インスタンスから ab で負荷をかけて、

ECS の EC2 インスタンスにログインして htop でワークロードを確認するとこんな感じ。

CloudWatch のダッシュボードを見るとこんな感じで、CPU使用率が上がり、

ECS の EC2 は +1、Fargate は +1 タスクがスケールアウトしている。

手順は以下の通り。

準備

キーペア作成

ECSを利用するリージョンでキーペアを作成していない場合は作成しておく。

セキュリティグループ作成

セキュリティグループ名: ECS-SG
VPC: デフォルトVPC

タイプ	プロトコル	ポート範囲	ソース
HTTP	TCP	80	マイIP
SSH	TCP	22	マイIP
すべてのTCP	TCP	すべて	自分自身のセキュリティグループID

ELB

ELB作成

種類: Application Load Balancer
名前: ec2-fg-mix-alb
プロトコル: HTTP
ポート: 80
アベイラビリティーゾーン
- VPC: デフォルト
- ap-northeast-1a, ap-northeast-1c, ap-northeast-1d
セキュリティグループ: ECS-SG
ターゲットグループ:新しいターゲットグループ
- 名前: ec2-fg-mix-tg

ECS

ECSタスク定義（EC2タイプ）作成

起動タイプ: EC2
タスク定義名: ec2-task
ネットワークモード: awsvpc
タスクメモリ (MiB): 4096
タスク CPU (単位): 2048
コンテナの定義
- コンテナ名: ec2-container
- イメージ: httpd
- メモリ制限: ソフト制限 128MB
- ポートマッピング: 80
- CPUユニット数: 2048

ECSタスク定義（Fargateタイプ）作成

起動タイプ: FARGATE
タスク定義名: fg-task
ネットワークモード: awsvpc
タスクメモリ (GB): 4GB
タスク CPU (vCPU): 2 VCPU
コンテナの定義
- コンテナ名: fg-container
- イメージ: httpd
- メモリ制限: ソフト制限 128MB
- ポートマッピング: 80
- CPUユニット数: 2048

クラスター作成

クラスターテンプレートの選択: EC2 Linux + ネットワーキング
クラスター名: ec2-fg-mix-cluster
プロビジョニングモデル: オンデマンドインスタンス
EC2 インスタンスタイプ: m4.xlarge
インスタンス数: 1
キーペア: 任意のキーペア
VPC: デフォルトVPC
サブネット: ap-northeast-1a
セキュリティグループ: ECS-SG

サービス（EC2起動タイプ）作成

起動タイプ: EC2
タスク定義: ec2-task
クラスター: ec2-fg-mix-cluster
サービス名: ec2-service
タスクの数: 1
ELBタイプ: Application Load Balancer
ELB名: ec2-fg-mix-alb
パスパターン: /
評価順:1
ヘルスチェックパス: /
Service Auto Scaling: Service Auto Scaling の設定を変更することで、サービスの必要数を調整する
タスクの最小数: 1
タスクの最大数: 2
自動タスクスケーリングポリシー
- ポリシー名: EC2ScaleOutPolicy
  - 既存のアラームの使用: EC2ScaleOutAlarm
  - スケーリングアクション: 追加 1 tasks、次の条件の場合 50 <= CPUUtilization
  - クールダウン期間: 30
- ポリシー名: EC2ScaleInPolicy
  - 既存のアラームの使用: EC2ScaleInAlarm
  - スケーリングアクション: 削除 1 tasks、次の条件の場合 40 <= CPUUtilization
  - クールダウン期間: 30

サービス（FARGATE起動タイプ）作成

起動タイプ: FARGATE
タスク定義: fargate-task
クラスター: ec2-fg-mix-cluster
サービス名: fg-service
タスクの数: 1
クラスター VPC: デフォルトVPC
サブネット: ap-northeast-1a のサブネット
パブリック IP の自動割り当て: ENABLED
セキュリティグループ: ECS-SG
ELBタイプ: Application Load Balancer
ELB名: ec2-fg-mix-alb
パスパターン: /
評価順:1
ヘルスチェックパス: /
Service Auto Scaling: Service Auto Scaling の設定を変更することで、サービスの必要数を調整する
タスクの最小数: 1
タスクの最大数: 10
自動タスクスケーリングポリシー
- ポリシー名: FgScaleOutPolicy
  - 既存のアラームの使用: FgScaleOutAlarm
  - スケーリングアクション: 追加 1 tasks、次の条件の場合 80 <= CPUUtilization
  - クールダウン期間: 30
- ポリシー名: Fg2ScaleInPolicy
  - 既存のアラームの使用: FgScaleInAlarm
  - スケーリングアクション: 削除 1 tasks、次の条件の場合 70 <= CPUUtilization
  - クールダウン期間: 30

CloudWatch Alarm

EC2起動タイプのスケールアウト用

名前: EC2ScaleOutAlarm
CPUUtilization: >= 50
期間: 1 / 1 データポイント
アクション
- アラームが次の時: 警告
- リソースタイプから: EC2 Container Service
- 次から: service/ec2-fg-mix-cluster/ec2-service
- 次のアクションを実行: EC2ScaleOutPolicy

EC2起動タイプのスケールイン用

名前: EC2ScaleInAlarm
CPUUtilization: <= 40
期間: 1 / 1 データポイント
アクション
- アラームが次の時: 警告
- リソースタイプから: EC2 Container Service
- 次から: service/ec2-fg-mix-cluster/ec2-service
- 次のアクションを実行: EC2ScaleInPolicy

Fargate起動タイプのスケールアウト用

名前: FgScaleOutAlarm
CPUUtilization: >= 80
期間: 1 / 1 データポイント
アクション
- アラームが次の時: 警告
- リソースタイプから: EC2 Container Service
- 次から: service/ec2-fg-mix-cluster/fg-service
- 次のアクションを実行: FgScaleOutPolicy

Fargate起動タイプのスケールイン用

名前: FgScaleInAlarm
CPUUtilization: <= 70
期間: 1 / 1 データポイント
アクション
- アラームが次の時: 警告
- リソースタイプから: EC2 Container Service
- 次から: service/ec2-fg-mix-cluster/fg-service
- 次のアクションを実行: FgScaleInPolicy

負荷をかけてスケールアウト／インを試す

ECSのEC2ホストに

$ sudo yum -y install htop

負荷をかけるためのEC2インスタンスを作成する
Apache をインストールする

$ sudo yum -y install httpd

負荷をかける

$ ab -k -n 900000000 -c 100 http://*****.ap-northeast-1.elb.amazonaws.com/ &
$ ab -k -n 900000000 -c 100 http://*****.ap-northeast-1.elb.amazonaws.com/ &
$ ab -k -n 900000000 -c 100 http://*****.ap-northeast-1.elb.amazonaws.com/ &
$ ab -k -n 900000000 -c 100 http://*****.ap-northeast-1.elb.amazonaws.com/ &
$ ab -k -n 900000000 -c 100 http://*****.ap-northeast-1.elb.amazonaws.com/ &
$ ab -k -n 900000000 -c 100 http://*****.ap-northeast-1.elb.amazonaws.com/ &

補足

ECSのサービスを作成時にスケールアウト／インポリシーを仮で作成後に、CloudWatch Alarm でスケールアウト／インのアラームを作成し、ECSのサービスを更新して作成した CloudWatch Alarm と紐付ける。
同じサービス名で削除して作り直すとマネジメントコンソールでは消えていても、バックグラウンドで削除完了していないと "Unable to Start a service that is still Draining." というエラーになる模様。しばらくして作り直すと成功した。
同一クラスターにEC2とFARGATEのタスクを共存させる場合に以下のネットワークモードの組合せは可能なことを確認済。
- FARGATE:awsvpc、EC2:awsvpc
- FARGATE:awsvpc、EC2:bridge
起動タイプがEC2でネットワークモードを awsvpc にした場合、1タスクで1つのENIを使うので、EC2インスタンスの ENI の最大数を超えるタスクを起動することはできない。ENIの最大数を超えるタスクを起動しようとすると RESOURCE:ENI エラーが発生する。

RESOURCE:* (container instance ID)

タスクによってリクエストされたリソースは、指定したコンテナインスタンスで使用できません。リソースが CPU、メモリ、ポート、または Ellastic Network Interface の場合は、コンテナインスタンスのクラスターへの追加が必要になる場合があります。RESOURCE:ENI エラーの場合、awsvpc ネットワークモードを使用するタスクで必要な Elastic Network Interface アタッチメントポイントが、クラスターで利用できないことを示しています。Amazon EC2 インスタンスにアタッチできるネットワークインターフェイスの数には制限があり、プライマリネットワークインターフェイスも 1 つ分としてカウントされます。各インスタンスタイプでサポートされる Network Interface の数の詳細については、Linux インスタンス用 Amazon EC2 ユーザーガイドの「各インスタンスタイプのネットワークインターフェイスごとの IP アドレス」を参照してください。
API failures エラーメッセージ - Amazon Elastic Container Service

ab コマンドは -k オプションをつけて keep alive にしたほうがTCP接続のオーバーヘッドが小さくサーバーサイド（ECS）の CPU 使用率を上げやすい。

↧

[AWS]Presto on EMR で ”deserializer does not exist: org.openx.data.jsonserde.JsonSerDe” と怒られる

November 21, 2018, 9:43 am

≫ Next: [AWS]EMRで PySpark を実行すると java.io.FileNotFoundException: /stderr (Permission denied) とエラーメッセージが出力される

≪ Previous: [AWS]ECS で EC2 と Fargate の起動タイプのタスクを併用して同じアプリケーションを実行する

事象

Presto on EMR で JSON の外部テーブルにクエリを実行すると "deserializer does not exist: org.openx.data.jsonserde.JsonSerDe" と怒られる。

$ hive
hive> CREATE EXTERNAL TABLE IF NOT EXISTS sh10.json_sales(
（中略）  
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION 's3://バケット名/data/json/sh10/sales/';
hive> exit;

$ presto-cli
presto> use hive.sh10;
presto:sh10> select count(*) from json_sales limit 10;

Query 20181121_173221_00003_j9ti8, FAILED, 2 nodes
Splits: 473 total, 0 done (0.00%)
0:06 [0 rows, 0B] [0 rows/s, 0B/s]

Query 20181121_173221_00003_j9ti8 failed: deserializer does not exist: org.openx.data.jsonserde.JsonSerDe

解決策

全て（マスター／コア／タスク）のノードで以下を実行する。

$ cd  /usr/lib/presto/plugin/hive-hadoop2/
$ sudo wget http://www.congiu.net/hive-json-serde/1.3.6-SNAPSHOT/cdh5/json-serde-1.3.6-SNAPSHOT-jar-with-dependencies.jar
$ sudo initctl stop presto-server
$ sudo initctl start presto-server

環境

emr-5.19.0
Hive 2.3.3, Spark 2.3.2, Presto 0.212, Ganglia 3.7.2, Tez 0.8.4

↧

[AWS]EMRで PySpark を実行すると java.io.FileNotFoundException: /stderr (Permission denied) とエラーメッセージが出力される

November 23, 2018, 5:02 am

≫ Next: [AWS]CloudFormation で EMR クラスター作成時に Presto の S3 Select Pushdown を有効化する

≪ Previous: [AWS]Presto on EMR で ”deserializer does not exist: org.openx.data.jsonserde.JsonSerDe” と怒られる

以下では解決しない。解決策が分かったら更新予定。

事象

EMR(emr-5.19.0) で PySpark を実行すると、"java.io.FileNotFoundException: /stderr (Permission denied)"、"java.io.FileNotFoundException: /stdout (Permission denied)" というエラーメッセージが出力される。

$ pyspark
Python 2.7.14 (default, May  2 2018, 18:31:34)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-11)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
log4j:ERROR setFile(null,true) call failed.
java.io.FileNotFoundException: /stderr (Permission denied)
at java.io.FileOutputStream.open0(Native Method)
at java.io.FileOutputStream.open(FileOutputStream.java:270)
at java.io.FileOutputStream.<init>(FileOutputStream.java:213)
at java.io.FileOutputStream.<init>(FileOutputStream.java:133)
at org.apache.log4j.FileAppender.setFile(FileAppender.java:294)
at org.apache.log4j.FileAppender.activateOptions(FileAppender.java:165)
at org.apache.log4j.DailyRollingFileAppender.activateOptions(DailyRollingFileAppender.java:223)
at org.apache.log4j.config.PropertySetter.activate(PropertySetter.java:307)
at org.apache.log4j.config.PropertySetter.setProperties(PropertySetter.java:172)
at org.apache.log4j.config.PropertySetter.setProperties(PropertySetter.java:104)
at org.apache.log4j.PropertyConfigurator.parseAppender(PropertyConfigurator.java:842)
at org.apache.log4j.PropertyConfigurator.parseCategory(PropertyConfigurator.java:768)
at org.apache.log4j.PropertyConfigurator.parseCatsAndRenderers(PropertyConfigurator.java:672)
at org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:516)
at org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:580)
at org.apache.log4j.helpers.OptionConverter.selectAndConfigure(OptionConverter.java:526)
at org.apache.log4j.LogManager.<clinit>(LogManager.java:127)
at org.apache.spark.internal.Logging$class.initializeLogging(Logging.scala:120)
at org.apache.spark.internal.Logging$class.initializeLogIfNecessary(Logging.scala:108)
at org.apache.spark.deploy.SparkSubmit$.initializeLogIfNecessary(SparkSubmit.scala:71)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:128)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
log4j:ERROR Either File or DatePattern options are not set for appender [DRFA-stderr].
log4j:ERROR setFile(null,true) call failed.
java.io.FileNotFoundException: /stdout (Permission denied)
at java.io.FileOutputStream.open0(Native Method)
at java.io.FileOutputStream.open(FileOutputStream.java:270)
at java.io.FileOutputStream.<init>(FileOutputStream.java:213)
at java.io.FileOutputStream.<init>(FileOutputStream.java:133)
at org.apache.log4j.FileAppender.setFile(FileAppender.java:294)
at org.apache.log4j.FileAppender.activateOptions(FileAppender.java:165)
at org.apache.log4j.DailyRollingFileAppender.activateOptions(DailyRollingFileAppender.java:223)
at org.apache.log4j.config.PropertySetter.activate(PropertySetter.java:307)
at org.apache.log4j.config.PropertySetter.setProperties(PropertySetter.java:172)
at org.apache.log4j.config.PropertySetter.setProperties(PropertySetter.java:104)
at org.apache.log4j.PropertyConfigurator.parseAppender(PropertyConfigurator.java:842)
at org.apache.log4j.PropertyConfigurator.parseCategory(PropertyConfigurator.java:768)
at org.apache.log4j.PropertyConfigurator.parseCatsAndRenderers(PropertyConfigurator.java:672)
at org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:516)
at org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:580)
at org.apache.log4j.helpers.OptionConverter.selectAndConfigure(OptionConverter.java:526)
at org.apache.log4j.LogManager.<clinit>(LogManager.java:127)
at org.apache.spark.internal.Logging$class.initializeLogging(Logging.scala:120)
at org.apache.spark.internal.Logging$class.initializeLogIfNecessary(Logging.scala:108)
at org.apache.spark.deploy.SparkSubmit$.initializeLogIfNecessary(SparkSubmit.scala:71)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:128)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

解決策

/etc/hadoop/conf/yarn-site.xml に spark.yarn.app.container.log.dir のパスを定義する。

  <property>
    <name>spark.yarn.app.container.log.dir</name>
    <value>/var/log/hadoop-yarn</value>
  </property>

pyspark を実行してもエラーが出なくなる。

$ pyspark
Python 2.7.12 (default, Sep  1 2016, 22:14:00)
[GCC 4.8.3 20140911 (Red Hat 4.8.3-9)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
18/11/23 13:01:53 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.

補足

CloudFromation(YAML) で書く場合は以下の通り。

Resources:
  cluster:
    Type: AWS::EMR::Cluster
    Properties:
      Configurations:
        - Classification: yarn-site
          ConfigurationProperties:
            spark.yarn.app.container.log.dir: /var/log/hadoop-yarn

参考

amazon web services - Can't get a SparkContext in new AWS EMR Cluster - Stack Overflow

↧

[AWS]CloudFormation で EMR クラスター作成時に Presto の S3 Select Pushdown を有効化する

November 24, 2018, 1:18 am

≫ Next: [AWS]CloudFormation で EMR クラスター作成時に Bootstrap Action を実行する

≪ Previous: [AWS]EMRで PySpark を実行すると java.io.FileNotFoundException: /stderr (Permission denied) とエラーメッセージが出力される

CloudFormation で EMR クラスター（EMR リリース 5.18.0以降）作成時に Presto の S3 Select Pushdown を有効化する方法をメモ。

AWSTemplateFormatVersion: '2010-09-09'
Description: Stack to create EMR Cluster.
Parameters:
  InstanceType:
    Type: String

（中略）

Resources:
  cluster:
    Type: AWS::EMR::Cluster
    Properties:
     Configurations:
        - Classification: presto-connector-hive
          ConfigurationProperties:
            hive.s3select-pushdown.enabled: true
            hive.s3select-pushdown.max-connections: 500

EMR クラスター作成後にマスターノードに ssh でログインして設定を確認してみる。

$ cat /etc/presto/conf/catalog/hive.properties
（中略）
hive.s3select-pushdown.max-connections = 500
hive.s3select-pushdown.enabled = true

参考

EMR リリース 5.18.0 では、S3 Select を Hive および Presto と共にお使いいただけます。S3 Select では、アプリケーションは Amazon S3 に保存されたオブジェクトに含まれるデータのサブセットのみを取得できます。これにより、Hive および Presto のクエリ実行時に EMR クラスターに転送してプロセスされる必要のあるデータ量が減るため、パフォーマンスが向上します。これらの機能の詳細については、S3 Select with Hive および S3 Select with Presto のページをご覧ください。
Amazon EMR リリース 5.18.0 にて、Flink 1.6.0、Zeppelin 0.8.0、S3 Select と Hive および Presto の併用をサポート

Enabling S3 Select Pushdown With Presto

To enable S3 Select Pushdown for Presto on Amazon EMR, use the presto-connector-hive configuration classification to set hive.s3select-pushdown.enabled to true as shown in the example below. For more information, see Configuring Applications. The hive.s3select-pushdown.max-connections value must also be set. For most applications, the default setting of 500 should be adequate. For more information, see Understanding and tuning hive.s3select-pushdown.max-connections below.
[
    {
        "classification": "presto-connector-hive",
        "properties": {
            "hive.s3select-pushdown.enabled": "true",
            "hive.s3select-pushdown.max-connections": "500"
        }
    }
]
Understanding and tuning hive.s3select-pushdown.max-connections

By default, Presto uses EMRFS as its file system. The setting fs.s3.maxConnections in the emrfs-site configuration classification specifies the maximum allowable client connections to Amazon S3 through EMRFS for Presto. By default, this is 500. S3 Select Pushdown bypasses EMRFS when accessing Amazon S3 for predicate operations. In this case, the value of hive.s3select-pushdown.max-connections determines the maximum number of client connections allowed for those operations from worker nodes. However, any requests to Amazon S3 that Presto initiates that are not pushed down—for example, GET operations—continue to be governed by the value of fs.s3.maxConnections.

If your application experiences the error "Timeout waiting for connection from pool," increase the value of both hive.s3select-pushdown.max-connections and fs.s3.maxConnections.
Using S3 Select Pushdown with Presto to Improve Performance - Amazon EMR

Configurations は、AWS::EMR::Cluster リソースのプロパティで、Amazon EMR (Amazon EMR) クラスターのソフトウェア設定を指定します。設定の例については、Amazon EMR Release Guide の「Configuring Applications」を参照してください。

構文

JSON
{
  "Classification" : String,
  "ConfigurationProperties" : { 文字列: 文字列, ... },
  "Configurations" : [ Configuration, ... ]
}
YAML
Classification: String
ConfigurationProperties:
  文字列: 文字列
Configurations:
  - Configuration
Amazon EMR クラスターの設定 - AWS CloudFormation

https://github.com/awslabs/aws-cloudformation-templates/tree/master/aws/services/EMR

↧

[AWS]CloudFormation で EMR クラスター作成時に Bootstrap Action を実行する

November 24, 2018, 6:09 am

≫ Next: [AWS]s3 cp でクロスアカウントでバケット間コピーする際のコピー元バケットに参照権限を付与するバケットポリシー

≪ Previous: [AWS]CloudFormation で EMR クラスター作成時に Presto の S3 Select Pushdown を有効化する

CloudFormation で EMR クラスター作成時に Bootstrap Action を実行したメモ。

master.yaml

---
AWSTemplateFormatVersion: '2010-09-09'
Description: Main Template For Workshop

Parameters:

（中略）

  CFnS3Bucket:
    Description: Specify an Amazon S3 template URL
    Type: String
    Default: cfnBucket20181124

（中略）

  EMRStack:
    Type: AWS::CloudFormation::Stack
    Properties:
      TemplateURL: !Sub https://s3-ap-northeast-1.amazonaws.com/${CFnS3Bucket}/emr.yaml
      Parameters:

（中略）

        CFnS3Bucket: 
          Ref: CFnS3Bucket

emr.yaml

AWSTemplateFormatVersion: '2010-09-09'
Description: Stack to create EMR Cluster.
Parameters:

（中略）

  CFnS3Bucket:
    Type: String
Resources:
  cluster:
    Type: AWS::EMR::Cluster
    Properties:

（中略）

      BootstrapActions:
        - Name: BootstrapAction
          ScriptBootstrapAction:
            Path: !Sub s3://${CFnS3Bucket}/emrBootstrapAction.sh

emrBootstrapAction.sh

#!/usr/bin/env bash

sudo wget http://www.congiu.net/hive-json-serde/1.3.6-SNAPSHOT/cdh5/json-serde-1.3.6-SNAPSHOT-jar-with-dependencies.jar -P /usr/lib/presto/plugin/hive-hadoop2/

参考

Amazon EMR クラスターの ScriptBootstrapActionConfig - AWS CloudFormation

↧

[AWS]s3 cp でクロスアカウントでバケット間コピーする際のコピー元バケットに参照権限を付与するバケットポリシー

November 24, 2018, 7:45 am

≫ Next: [AWS]s3 cp でクロスアカウントでバケット間コピーすると ”An error occurred (AccessDenied) when calling the UploadPartCopy operation” と怒られる

≪ Previous: [AWS]CloudFormation で EMR クラスター作成時に Bootstrap Action を実行する

AWS CLI（aws s3 cp --recursive s3://... s3://...）でクロスアカウントでS3バケット間コピーする際にコピー元バケットで参照権限のみ許可するバケットポリシーの例。

{
    "Version": "2012-10-17",
    "Id": "Policy1543071477610",
    "Statement": [
        {
            "Sid": "Stmt1543071407969",
            "Effect": "Allow",
            "Principal": {
                "AWS": "arn:aws:iam::012345678901:root"
            },
            "Action": [
                "s3:List*",
                "s3:Get*"
            ],
            "Resource": [
                "arn:aws:s3:::test-bucket-20181125",
                "arn:aws:s3:::test-bucket-20181125/*"
            ]
        }
    ]
}

参考

ポリシーでのアクセス許可の指定 - Amazon Simple Storage Service

↧

[AWS]s3 cp でクロスアカウントでバケット間コピーすると ”An error occurred (AccessDenied) when calling the UploadPartCopy operation” と怒られる

November 24, 2018, 7:56 am

≫ Next: [AWS]Hive on EMR で S3 Select を有効化してI/O量を削減する

≪ Previous: [AWS]s3 cp でクロスアカウントでバケット間コピーする際のコピー元バケットに参照権限を付与するバケットポリシー

事象

s3 cp でクロスアカウントでバケット間コピーすると "Access Denied" と怒られる。コピー元バケットからEC2へのコピーや、EC2からコピー先のバケットへのコピーは成功する。

$ aws s3 cp --recursive s3://cp-from/ s3://cp-to/
copy failed: s3://cp-from/test.txt to s3://cp-to/test.txt An error occurred (AccessDenied) when calling the UploadPartCopy operation: Access Denied

解決策

コピー先の自アカウントで「バケットにパブリックポリシーがある場合、パブリックアクセスとクロスアカウントアクセスをブロックする (推奨) 」をOFFにする。

前提

コピー元が他アカウントで、コピー先が自アカウントとする。

↧

[AWS]Hive on EMR で S3 Select を有効化してI/O量を削減する

November 25, 2018, 12:31 pm

≫ Next: [AWS]ssh接続せずにAWSマネジメントコンソールからシェルを実行する

≪ Previous: [AWS]s3 cp でクロスアカウントでバケット間コピーすると ”An error occurred (AccessDenied) when calling the UploadPartCopy operation” と怒られる

Hive on EMR で S3 Select を有効化すると、I/O量が削減され、実行時間が短縮することを確認した*1。

検証結果

通常

hive> select count(tax_region) from sh10.json_sales★ where tax_region = 'US';
Query ID = hadoop_20181125201846_ceb61407-d775-4399-a4ff-b123de4794ea
Total jobs = 1
Launching Job 1 out of 1
Status: Running (Executing on YARN cluster with App id application_1543070548885_0006)

----------------------------------------------------------------------------------------------
        VERTICES      MODE        STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED
----------------------------------------------------------------------------------------------
Map 1 .......... container     SUCCEEDED     64         64        0        0       0       0
Reducer 2 ...... container     SUCCEEDED      1          1        0        0       0       0
----------------------------------------------------------------------------------------------
VERTICES: 02/02  [==========================>>] 100%  ELAPSED TIME: 177.90 s
----------------------------------------------------------------------------------------------
OK
145998
Time taken: 181.039 seconds★, Fetched: 1 row(s)

S3 Select 有効

hive> SET s3select.filter=true;
hive> select count(tax_region) from sh10.json_sales_s3select★ where tax_region = 'US';
Query ID = hadoop_20181125203338_a4b89db5-5f2e-46e2-b1a8-c86965d74225
Total jobs = 1
Launching Job 1 out of 1
Status: Running (Executing on YARN cluster with App id application_1543070548885_0006)

----------------------------------------------------------------------------------------------
        VERTICES      MODE        STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED
----------------------------------------------------------------------------------------------
Map 1 .......... container     SUCCEEDED     64         64        0        0       0       0
Reducer 2 ...... container     SUCCEEDED      1          1        0        0       0       0
----------------------------------------------------------------------------------------------
VERTICES: 02/02  [==========================>>] 100%  ELAPSED TIME: 54.13 s
----------------------------------------------------------------------------------------------
OK
145998
Time taken: 54.658 seconds★, Fetched: 1 row(s)

準備

hive シェルを起動する。

$ hive

S3のJSONデータに対して外部テーブルを定義する。

CREATE EXTERNAL TABLE IF NOT EXISTS sh10.json_sales(
  prod_id int,
  cust_id int,
  time_id string,
  channel_id int,
  promo_id int,
  quantity_sold double,
  seller int,
  fulfillment_center int,
  courier_org int,
  tax_country string,
  tax_region string,
  amount_sold double
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION 's3://sb20181126/data/json/sh10/sales/';

同じS3のJSONデータに対して外部テーブルを定義する（S3 Select有効化）。

CREATE EXTERNAL TABLE IF NOT EXISTS sh10.json_sales_s3select(
  prod_id int,
  cust_id int,
  time_id string,
  channel_id int,
  promo_id int,
  quantity_sold double,
  seller int,
  fulfillment_center int,
  courier_org int,
  tax_country string,
  tax_region string,
  amount_sold double
)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
STORED AS
INPUTFORMAT
  'com.amazonaws.emr.s3select.hive.S3SelectableTextInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 's3://sb20181126/data/json/sh10/sales/'
TBLPROPERTIES (
  "s3select.format" = "json"
);

補足

テーブル定義を S3 Select 対応しても、s3select.filter を有効にしないと S3 Select は効かない。

hive> SET s3select.filter=false;
hive> select count(tax_region) from sh10.json_sales_s3select★ where tax_region = 'US';
Query ID = hadoop_20181125203003_e28e36fc-8fd4-46ea-966f-0c65bfdc9024
Total jobs = 1
Launching Job 1 out of 1
Status: Running (Executing on YARN cluster with App id application_1543070548885_0006)

----------------------------------------------------------------------------------------------
        VERTICES      MODE        STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED
----------------------------------------------------------------------------------------------
Map 1 .......... container     SUCCEEDED     64         64        0        0       0       0
Reducer 2 ...... container     SUCCEEDED      1          1        0        0       0       0
----------------------------------------------------------------------------------------------
VERTICES: 02/02  [==========================>>] 100%  ELAPSED TIME: 171.59 s
----------------------------------------------------------------------------------------------
OK
145998
Time taken: 172.17 seconds★, Fetched: 1 row(s)

参考

Specifying S3 Select in Your Code

To use S3 select in your Hive table, create the table by specifying com.amazonaws.emr.s3select.hive.S3SelectableTextInputFormat as the INPUTFORMAT class name, and specify a value for the s3select.format property using the TBLPROPERTIES clause.

By default, S3 Select is disabled when you run queries. Enable S3 Select by setting s3select.filter to true in your Hive session as shown below. The examples below demonstrate how to specify S3 Select when creating a table from underlying CSV and JSON files and then querying the table using a simple select statement.

Example CREATE TABLE Statement for CSV-Based Table
CREATE TABLE mys3selecttable (
col1 string,
col2 int,
col3 boolean
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS
INPUTFORMAT
  'com.amazonaws.emr.s3select.hive.S3SelectableTextInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 's3://path/to/mycsvfile/'
TBLPROPERTIES (
  "s3select.format" = "csv",
  "s3select.headerInfo" = "ignore"
);
Example CREATE TABLE Statement for JSON-Based Table
CREATE TABLE mys3selecttable (
col1 string,
col2 int,
col3 boolean
)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
STORED AS
INPUTFORMAT
  'com.amazonaws.emr.s3select.hive.S3SelectableTextInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 's3://path/to/json/'
TBLPROPERTIES (
  "s3select.format" = "json"
);
Example SELECT TABLE Statement
SET s3select.filter=true;
SELECT * FROM mys3selecttable WHERE col2 > 10;
Using S3 Select with Hive to Improve Performance - Amazon EMR

*1：当り前の話ですが、クエリで絞り込みの効くフィルタ条件が指定されて push down されるケースで効果が出ます

↧

[AWS]ssh接続せずにAWSマネジメントコンソールからシェルを実行する

November 27, 2018, 4:26 am

≫ Next: [AWS] Google 認証を使ってALBで認証する

≪ Previous: [AWS]Hive on EMR で S3 Select を有効化してI/O量を削減する

AWSマネジメントコンソールにログインして Systems Manager をクリックする。

[セッションの開始]をクリックする。

640]

参考

SSH不要時代がくるか!?AWS Systems Manager セッションマネージャーがリリースされました! ｜ DevelopersIO

↧

[AWS] Google 認証を使ってALBで認証する

November 28, 2018, 11:45 am

≫ Next: ssh で Amazon Linux on EC2 に接続しようとすると "Permission denied (publickey)"エラー

≪ Previous: [AWS]ssh接続せずにAWSマネジメントコンソールからシェルを実行する

Google Identity Platformで OAuthクライアントID発行

Google Cloud Platform にアクセスし Google アカウントでログインする。
任意の名前でプロジェクトを作成する。

[認証情報を作成]-[OAuth クライアント ID]をクリックする。

[同意画面を設定]をクリックする。

認証情報を設定する。
- アプリケーション名: 任意
- 承認済みドメイン: ロードバランサーのDNS名とamazonaws.com（入力して Enter 押下）
- [アプリケーションホームページ] リンク: ロードバランサーのDNS名（とりあえず）
- [アプリケーションプライバシーポリシー] リンク: ロードバランサーのDNS名（とりあえず）
- [アプリケーション利用規約] リンク: ロードバランサーのDNS名（とりあえず）

OAuthクライアントIDの作成
- アプリケーションの種類: ウェブアプリケーション
- https://ロードバランサーのDNS名/oauth2/idpresponse

クライアントIDとクライアントシークレットをコピーしておく。

Google OpenID Providerの各エンドポイントを確認

$ curl https://accounts.google.com/.well-known/openid-configuration
{
 "issuer": "https://accounts.google.com", ★
 "authorization_endpoint": "https://accounts.google.com/o/oauth2/v2/auth",★
 "token_endpoint": "https://oauth2.googleapis.com/token",★
 "userinfo_endpoint": "https://openidconnect.googleapis.com/v1/userinfo",★
 "revocation_endpoint": "https://oauth2.googleapis.com/revoke",
 "jwks_uri": "https://www.googleapis.com/oauth2/v3/certs",
 "response_types_supported": [
  "code",
  "token",
  "id_token",
  "code token",
  "code id_token",
  "token id_token",
  "code token id_token",
  "none"
 ],

AWS

EC2

EC2インスタンスを作成してApacheをインストールして起動する。

$ sudo yum -y install httpd
$ sudo service httpd start

証明書を作成する。

$ openssl genrsa -out server.key 2048
$ openssl req -new -key server.key -out server.csr #いろいろ聞かれるのですべて入力せずに Enter 押下
$ openssl x509 -in server.csr -days 365000 -req -signkey server.key > server.crt

EC2インスタンスにアタッチしているセキュリティグループを設定してHTTPでアクセスできるようにする。
「http://EC2のパブリック DNS」にアクセスしてページが表示されることを確認する。

ターゲットグループ

[EC2]-[ターゲットグループ]-[ターゲットグループの作成]をクリックし、ターゲットグループを作成する。
- ターゲットグループ名: 任意
- VPC: 作成したEC2インスタンスと同じVPCを選択

作成したターゲットグループを選択して、[ターゲット]タブを選択して、[編集]をクリックする。
- 作成したEC2インスタンスを選択して、[登録済みに追加]をクリック

ロードバランサー

[EC2]-[ロードバランサー]-[ロードバランサーの作成]をクリックし、[ロードバランサーの種類の選択]で"Application Load Balancer"を選択する。
- 名前: 任意
- ロードバランサーのプロトコル: HTTPS
- ロードバランサーのポート: 443
- VPC: 作成したEC2インスタンスと同じVPCを選択
- アベイラビリティゾーン: 全て選択

セキュリティ設定の構成
- 証明書タイプ: IAM に証明書をアップロードする
- 証明書の名前: 任意
- プライベートキー: EC2で作成した server.key の内容をコピー&ペースト
- 証明書本文: EC2で作成した server.crt の内容をコピー&ペースト

[アクションの追加]-[認証]を選択する。

Google OpenID Providerの各エンドポイントとクライアントID、クライアントシークレットをコピー&ペーストする。

EC2インスタンスと同じセキュリティグループを選択する。

ルーティングの設定
- ターゲットグループ: 既存のターゲットグループ
- 名前: 作成したターゲットグループを選択

ウイザードに従って作成を完了する。

テスト

ロードバランサーのDNS名をコピーする。

Google認証した上で「https://ロードバランサーのDNS 名/」にアクセスすると成功する。

参考

↧

ssh で Amazon Linux on EC2 に接続しようとすると "Permission denied (publickey)"エラー

December 1, 2018, 9:43 am

≫ Next: Presto で Parquet にクエリするとファイル中の必要な Column chunk のみを読んでいるか

≪ Previous: [AWS] Google 認証を使ってALBで認証する

Amazon Linuxや EC2 は関係ない話だが、Public key を Amazon Linux on EC2 に登録して、Macから ssh接続しようとすると、

$ ssh -i ~/.ssh/id_rsa.pub tmsuser@ec2-**-***-174-72.ap-northeast-1.compute.amazonaws.com
（中略）
Permission denied (publickey).

以下の通りローカルで鍵を登録すると接続できた。

ssh-add -K ~/.ssh/id_rsa

参考

↧

Presto で Parquet にクエリするとファイル中の必要な Column chunk のみを読んでいるか

December 1, 2018, 9:43 am

≫ Next: SCT でレポートを PDF で保存しようとすると "No glyph for U+5348 in font AmazonEmber-Regular"と怒られる

≪ Previous: ssh で Amazon Linux on EC2 に接続しようとすると "Permission denied (publickey)"エラー

Presto から見ると (parquetの)page単位で IO して、HDFSの APIを叩いて、HDFSは DSDataInputStream とかで読んで、OSレイヤーからみると sendfile(2) で xfs などのファイルシステムのファイルを読んでということになってるのではないかと推察。
— yohei.az (@yoheia) 2018年10月9日

1)flame graphでhdfsのioシステムコールを発行元コールスタックを特定
2)straceでioを発行したファイルとioサイズ、シーケンシャルorランダムを確認
3)blktrace+bttでio範囲(bno)を確認
4)AthenaとPrestoのscanサイズを比較
5)Prestoのスキャンサイズとディスクio量を比較
— yohei.az (@yoheia) 2018年10月14日

Column chunks
Column chunks are composed of pages written back to back. The pages share a common header and readers can skip over page they are not interested in. The data for the page follows the header and can be compressed and/or encoded. The compression and encoding is specified in the page metadata.
https://parquet.apache.org/documentation/latest/

https://events.static.linuxfound.org/sites/events/files/slides/Presto.pdf

The columnar roadmap: Apache Parquet and Apache Arrow from Julien Le Dem

Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie Strickland from Spark Summit

Parquet performance tuning: the missing guide von Ryan Blue

(BDT303) Running Spark and Presto on the Netflix Big Data Platform from Amazon Web Services

(1) read only required columns in Parquet and build columnar blocks on the fly, saving CPU and memory to transform row-based Parquet records into columnar blocks, and (2) evaluate the predicate using columnar blocks in the Presto engine.
Engineering Data Analytics with Presto and Parquet at Uber

New Hive Parquet Reader
We have added a new Parquet reader implementation. The new reader supports vectorized reads, lazy loading, and predicate push down, all of which make the reader more efficient and typically reduces wall clock time for a query. Although the new reader has been heavily tested, it is an extensive rewrite of the Apache Hive Parquet reader, and may have some latent issues, so it is not enabled by default. If you are using Parquet we suggest you test out the new reader on a per-query basis by setting the .parquet_optimized_reader_enabled session property, or you can enable the reader by default by setting the Hive catalog property hive.parquet-optimized-reader.enabled=true. To enable Parquet predicate push down there is a separate session property .parquet_predicate_pushdown_enabled and configuration property hive.parquet-predicate-pushdown.enabled=true.
https://prestodb.io/docs/current/release/release-0.138.html

Hadoop Internals for Oracle Developers and DBAs: Strata Conference + Hadoop World 2013 - O'Reilly Conferences, October 28 - 30, 2013, New York, NY

BinaryColumnReader

/* * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. * You may obtain a copy of the License at * *     http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. */package com.facebook.presto.parquet.reader;

import com.facebook.presto.parquet.RichColumnDescriptor;
import com.facebook.presto.spi.block.BlockBuilder;
import com.facebook.presto.spi.type.Type;
import io.airlift.slice.Slice;
import parquet.io.api.Binary;

import static com.facebook.presto.spi.type.Chars.isCharType;
import static com.facebook.presto.spi.type.Chars.truncateToLengthAndTrimSpaces;
import static com.facebook.presto.spi.type.Varchars.isVarcharType;
import static com.facebook.presto.spi.type.Varchars.truncateToLength;
import static io.airlift.slice.Slices.EMPTY_SLICE;
import static io.airlift.slice.Slices.wrappedBuffer;

publicclass BinaryColumnReader
        extends PrimitiveColumnReader
{
    public BinaryColumnReader(RichColumnDescriptor descriptor)
    {
        super(descriptor);
    }

    @Overrideprotectedvoid readValue(BlockBuilder blockBuilder, Type type)
    {
        if (definitionLevel == columnDescriptor.getMaxDefinitionLevel()) {
            Binary binary = valuesReader.readBytes();
            Slice value;
            if (binary.length() == 0) {
                value = EMPTY_SLICE;
            }
            else {
                value = wrappedBuffer(binary.getBytes());
            }
            if (isVarcharType(type)) {
                value = truncateToLength(value, type);
            }
            if (isCharType(type)) {
                value = truncateToLengthAndTrimSpaces(value, type);
            }
            type.writeSlice(blockBuilder, value);
        }
        elseif (isValueNull()) {
            blockBuilder.appendNull();
        }
    }

    @Overrideprotectedvoid skipValue()
    {
        if (definitionLevel == columnDescriptor.getMaxDefinitionLevel()) {
            valuesReader.readBytes();
        }
    }
}

PrimitiveColumnReader.java

/* * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. * You may obtain a copy of the License at * *     http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. */package com.facebook.presto.parquet.reader;

import com.facebook.presto.parquet.DataPage;
import com.facebook.presto.parquet.DataPageV1;
import com.facebook.presto.parquet.DataPageV2;
import com.facebook.presto.parquet.DictionaryPage;
import com.facebook.presto.parquet.Field;
import com.facebook.presto.parquet.ParquetEncoding;
import com.facebook.presto.parquet.ParquetTypeUtils;
import com.facebook.presto.parquet.RichColumnDescriptor;
import com.facebook.presto.parquet.dictionary.Dictionary;
import com.facebook.presto.spi.PrestoException;
import com.facebook.presto.spi.block.BlockBuilder;
import com.facebook.presto.spi.type.DecimalType;
import com.facebook.presto.spi.type.Type;
import io.airlift.slice.Slice;
import it.unimi.dsi.fastutil.ints.IntArrayList;
import it.unimi.dsi.fastutil.ints.IntList;
import parquet.bytes.BytesUtils;
import parquet.column.ColumnDescriptor;
import parquet.column.values.ValuesReader;
import parquet.column.values.rle.RunLengthBitPackingHybridDecoder;
import parquet.io.ParquetDecodingException;

import java.io.ByteArrayInputStream;
import java.io.IOException;
import java.util.Optional;
import java.util.function.Consumer;

import static com.facebook.presto.parquet.ParquetTypeUtils.createDecimalType;
import static com.facebook.presto.parquet.ValuesType.DEFINITION_LEVEL;
import static com.facebook.presto.parquet.ValuesType.REPETITION_LEVEL;
import static com.facebook.presto.parquet.ValuesType.VALUES;
import static com.facebook.presto.spi.StandardErrorCode.NOT_SUPPORTED;
import static com.google.common.base.Preconditions.checkArgument;
import static com.google.common.base.Verify.verify;
import static java.util.Objects.requireNonNull;

publicabstractclass PrimitiveColumnReader
{
    privatestaticfinalint EMPTY_LEVEL_VALUE = -1;
    protectedfinal RichColumnDescriptor columnDescriptor;

    protectedint definitionLevel = EMPTY_LEVEL_VALUE;
    protectedint repetitionLevel = EMPTY_LEVEL_VALUE;
    protected ValuesReader valuesReader;

    privateint nextBatchSize;
    private LevelReader repetitionReader;
    private LevelReader definitionReader;
    privatelong totalValueCount;
    private PageReader pageReader;
    private Dictionary dictionary;
    privateint currentValueCount;
    private DataPage page;
    privateint remainingValueCountInPage;
    privateint readOffset;

    protectedabstractvoid readValue(BlockBuilder blockBuilder, Type type);

    protectedabstractvoid skipValue();

    protectedboolean isValueNull()
    {
        return ParquetTypeUtils.isValueNull(columnDescriptor.isRequired(), definitionLevel, columnDescriptor.getMaxDefinitionLevel());
    }

    publicstatic PrimitiveColumnReader createReader(RichColumnDescriptor descriptor)
    {
        switch (descriptor.getType()) {
            case BOOLEAN:
                returnnew BooleanColumnReader(descriptor);
            case INT32:
                return createDecimalColumnReader(descriptor).orElse(new IntColumnReader(descriptor));
            case INT64:
                return createDecimalColumnReader(descriptor).orElse(new LongColumnReader(descriptor));
            case INT96:
                returnnew TimestampColumnReader(descriptor);
            case FLOAT:
                returnnew FloatColumnReader(descriptor);
            case DOUBLE:
                returnnew DoubleColumnReader(descriptor);
            case BINARY:
                return createDecimalColumnReader(descriptor).orElse(new BinaryColumnReader(descriptor));
            case FIXED_LEN_BYTE_ARRAY:
                return createDecimalColumnReader(descriptor)
                        .orElseThrow(() ->new PrestoException(NOT_SUPPORTED, " type FIXED_LEN_BYTE_ARRAY supported as DECIMAL; got " + descriptor.getPrimitiveType().getOriginalType()));
            default:
                thrownew PrestoException(NOT_SUPPORTED, "Unsupported parquet type: " + descriptor.getType());
        }
    }

    privatestatic Optional<PrimitiveColumnReader> createDecimalColumnReader(RichColumnDescriptor descriptor)
    {
        Optional<Type> type = createDecimalType(descriptor);
        if (type.isPresent()) {
            DecimalType decimalType = (DecimalType) type.get();
            return Optional.of(DecimalColumnReaderFactory.createReader(descriptor, decimalType.getPrecision(), decimalType.getScale()));
        }
        return Optional.empty();
    }

    public PrimitiveColumnReader(RichColumnDescriptor columnDescriptor)
    {
        this.columnDescriptor = requireNonNull(columnDescriptor, "columnDescriptor");
        pageReader = null;
    }

    public PageReader getPageReader()
    {
        return pageReader;
    }

    publicvoid setPageReader(PageReader pageReader)
    {
        this.pageReader = requireNonNull(pageReader, "pageReader");
        DictionaryPage dictionaryPage = pageReader.readDictionaryPage();

        if (dictionaryPage != null) {
            try {
                dictionary = dictionaryPage.getEncoding().initDictionary(columnDescriptor, dictionaryPage);
            }
            catch (IOException e) {
                thrownew ParquetDecodingException("could not decode the dictionary for " + columnDescriptor, e);
            }
        }
        else {
            dictionary = null;
        }
        checkArgument(pageReader.getTotalValueCount() > 0, "page is empty");
        totalValueCount = pageReader.getTotalValueCount();
    }

    publicvoid prepareNextRead(int batchSize)
    {
        readOffset = readOffset + nextBatchSize;
        nextBatchSize = batchSize;
    }

    public ColumnDescriptor getDescriptor()
    {
        return columnDescriptor;
    }

    public ColumnChunk readPrimitive(Field field)
            throws IOException
    {
        IntList definitionLevels = new IntArrayList();
        IntList repetitionLevels = new IntArrayList();
        seek();
        BlockBuilder blockBuilder = field.getType().createBlockBuilder(null, nextBatchSize);
        int valueCount = 0;
        while (valueCount < nextBatchSize) {
            if (page == null) {
                readNextPage();
            }
            int valuesToRead = Math.min(remainingValueCountInPage, nextBatchSize - valueCount);
            readValues(blockBuilder, valuesToRead, field.getType(), definitionLevels, repetitionLevels);
            valueCount += valuesToRead;
        }
        checkArgument(valueCount == nextBatchSize, "valueCount %s not equals to batchSize %s", valueCount, nextBatchSize);

        readOffset = 0;
        nextBatchSize = 0;
        returnnew ColumnChunk(blockBuilder.build(), definitionLevels.toIntArray(), repetitionLevels.toIntArray());
    }

    privatevoid readValues(BlockBuilder blockBuilder, int valuesToRead, Type type, IntList definitionLevels, IntList repetitionLevels)
    {
        processValues(valuesToRead, ignored -> {
            readValue(blockBuilder, type);
            definitionLevels.add(definitionLevel);
            repetitionLevels.add(repetitionLevel);
        });
    }

    privatevoid skipValues(int valuesToRead)
    {
        processValues(valuesToRead, ignored -> skipValue());
    }

    privatevoid processValues(int valuesToRead, Consumer<Void> valueConsumer)
    {
        if (definitionLevel == EMPTY_LEVEL_VALUE && repetitionLevel == EMPTY_LEVEL_VALUE) {
            definitionLevel = definitionReader.readLevel();
            repetitionLevel = repetitionReader.readLevel();
        }
        int valueCount = 0;
        for (int i = 0; i < valuesToRead; i++) {
            do {
                valueConsumer.accept(null);
                valueCount++;
                if (valueCount == remainingValueCountInPage) {
                    updateValueCounts(valueCount);
                    if (!readNextPage()) {
                        return;
                    }
                    valueCount = 0;
                }
                repetitionLevel = repetitionReader.readLevel();
                definitionLevel = definitionReader.readLevel();
            }
            while (repetitionLevel != 0);
        }
        updateValueCounts(valueCount);
    }

    privatevoid seek()
    {
        checkArgument(currentValueCount <= totalValueCount, "Already read all values in column chunk");
        if (readOffset == 0) {
            return;
        }
        int valuePosition = 0;
        while (valuePosition < readOffset) {
            if (page == null) {
                readNextPage();
            }
            int offset = Math.min(remainingValueCountInPage, readOffset - valuePosition);
            skipValues(offset);
            valuePosition = valuePosition + offset;
        }
        checkArgument(valuePosition == readOffset, "valuePosition %s must be equal to readOffset %s", valuePosition, readOffset);
    }

    privateboolean readNextPage()
    {
        verify(page == null, "readNextPage has to be called when page is null");
        page = pageReader.readPage();
        if (page == null) {
            // we have read all pagesreturnfalse;
        }
        remainingValueCountInPage = page.getValueCount();
        if (page instanceof DataPageV1) {
            valuesReader = readPageV1((DataPageV1) page);
        }
        else {
            valuesReader = readPageV2((DataPageV2) page);
        }
        returntrue;
    }

    privatevoid updateValueCounts(int valuesRead)
    {
        if (valuesRead == remainingValueCountInPage) {
            page = null;
            valuesReader = null;
        }
        remainingValueCountInPage -= valuesRead;
        currentValueCount += valuesRead;
    }

    private ValuesReader readPageV1(DataPageV1 page)
    {
        ValuesReader rlReader = page.getRepetitionLevelEncoding().getValuesReader(columnDescriptor, REPETITION_LEVEL);
        ValuesReader dlReader = page.getDefinitionLevelEncoding().getValuesReader(columnDescriptor, DEFINITION_LEVEL);
        repetitionReader = new LevelValuesReader(rlReader);
        definitionReader = new LevelValuesReader(dlReader);
        try {
            byte[] bytes = page.getSlice().getBytes();
            rlReader.initFromPage(page.getValueCount(), bytes, 0);
            int offset = rlReader.getNextOffset();
            dlReader.initFromPage(page.getValueCount(), bytes, offset);
            offset = dlReader.getNextOffset();
            return initDataReader(page.getValueEncoding(), bytes, offset, page.getValueCount());
        }
        catch (IOException e) {
            thrownew ParquetDecodingException("Error reading parquet page " + page + " in column " + columnDescriptor, e);
        }
    }

    private ValuesReader readPageV2(DataPageV2 page)
    {
        repetitionReader = buildLevelRLEReader(columnDescriptor.getMaxRepetitionLevel(), page.getRepetitionLevels());
        definitionReader = buildLevelRLEReader(columnDescriptor.getMaxDefinitionLevel(), page.getDefinitionLevels());
        return initDataReader(page.getDataEncoding(), page.getSlice().getBytes(), 0, page.getValueCount());
    }

    private LevelReader buildLevelRLEReader(int maxLevel, Slice slice)
    {
        if (maxLevel == 0) {
            returnnew LevelNullReader();
        }
        returnnew LevelRLEReader(new RunLengthBitPackingHybridDecoder(BytesUtils.getWidthFromMaxInt(maxLevel), new ByteArrayInputStream(slice.getBytes())));
    }

    private ValuesReader initDataReader(ParquetEncoding dataEncoding, byte[] bytes, int offset, int valueCount)
    {
        ValuesReader valuesReader;
        if (dataEncoding.usesDictionary()) {
            if (dictionary == null) {
                thrownew ParquetDecodingException("Dictionary is missing for Page");
            }
            valuesReader = dataEncoding.getDictionaryBasedValuesReader(columnDescriptor, VALUES, dictionary);
        }
        else {
            valuesReader = dataEncoding.getValuesReader(columnDescriptor, VALUES);
        }

        try {
            valuesReader.initFromPage(valueCount, bytes, offset);
            return valuesReader;
        }
        catch (IOException e) {
            thrownew ParquetDecodingException("Error reading parquet page in column " + columnDescriptor, e);
        }
    }
}

ParquetPageSource.java

/* * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. * You may obtain a copy of the License at * *     http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. */package com.facebook.presto.hive.parquet;

import com.facebook.presto.hive.HiveColumnHandle;
import com.facebook.presto.parquet.Field;
import com.facebook.presto.parquet.ParquetCorruptionException;
import com.facebook.presto.parquet.reader.ParquetReader;
import com.facebook.presto.spi.ConnectorPageSource;
import com.facebook.presto.spi.Page;
import com.facebook.presto.spi.PrestoException;
import com.facebook.presto.spi.block.Block;
import com.facebook.presto.spi.block.LazyBlock;
import com.facebook.presto.spi.block.LazyBlockLoader;
import com.facebook.presto.spi.block.RunLengthEncodedBlock;
import com.facebook.presto.spi.predicate.TupleDomain;
import com.facebook.presto.spi.type.Type;
import com.facebook.presto.spi.type.TypeManager;
import com.google.common.collect.ImmutableList;
import parquet.io.MessageColumnIO;
import parquet.schema.MessageType;

import java.io.IOException;
import java.io.UncheckedIOException;
import java.util.List;
import java.util.Optional;
import java.util.Properties;

import static com.facebook.presto.hive.HiveColumnHandle.ColumnType.REGULAR;
import static com.facebook.presto.hive.HiveErrorCode.HIVE_BAD_DATA;
import static com.facebook.presto.hive.HiveErrorCode.HIVE_CURSOR_ERROR;
import static com.facebook.presto.hive.parquet.ParquetPageSourceFactory.getParquetType;
import static com.facebook.presto.parquet.ParquetTypeUtils.getFieldIndex;
import static com.facebook.presto.parquet.ParquetTypeUtils.lookupColumnByName;
import static com.google.common.base.Preconditions.checkState;
import static java.util.Objects.requireNonNull;
import static parquet.io.ColumnIOConverter.constructField;

publicclass ParquetPageSource
        implements ConnectorPageSource
{
    privatestaticfinalint MAX_VECTOR_LENGTH = 1024;

    privatefinal ParquetReader parquetReader;
    privatefinal MessageType fileSchema;
    // for debugging heap dumpprivatefinal List<String> columnNames;
    privatefinal List<Type> types;
    privatefinal List<Optional<Field>> fields;

    privatefinal Block[] constantBlocks;
    privatefinalint[] hiveColumnIndexes;

    privateint batchId;
    privateboolean closed;
    privatelong readTimeNanos;
    privatefinalboolean useParquetColumnNames;

    public ParquetPageSource(
            ParquetReader parquetReader,
            MessageType fileSchema,
            MessageColumnIO messageColumnIO,
            TypeManager typeManager,
            Properties splitSchema,
            List<HiveColumnHandle> columns,
            TupleDomain<HiveColumnHandle> effectivePredicate,
            boolean useParquetColumnNames)
    {
        requireNonNull(splitSchema, "splitSchema is null");
        requireNonNull(columns, "columns is null");
        requireNonNull(effectivePredicate, "effectivePredicate is null");
        this.parquetReader = requireNonNull(parquetReader, "parquetReader is null");
        this.fileSchema = requireNonNull(fileSchema, "fileSchema is null");
        this.useParquetColumnNames = useParquetColumnNames;

        int size = columns.size();
        this.constantBlocks = new Block[size];
        this.hiveColumnIndexes = newint[size];

        ImmutableList.Builder<String> namesBuilder = ImmutableList.builder();
        ImmutableList.Builder<Type> typesBuilder = ImmutableList.builder();
        ImmutableList.Builder<Optional<Field>> fieldsBuilder = ImmutableList.builder();
        for (int columnIndex = 0; columnIndex < size; columnIndex++) {
            HiveColumnHandle column = columns.get(columnIndex);
            checkState(column.getColumnType() == REGULAR, "column type must be regular");

            String name = column.getName();
            Type type = typeManager.getType(column.getTypeSignature());

            namesBuilder.add(name);
            typesBuilder.add(type);
            hiveColumnIndexes[columnIndex] = column.getHiveColumnIndex();

            if (getParquetType(column, fileSchema, useParquetColumnNames) == null) {
                constantBlocks[columnIndex] = RunLengthEncodedBlock.create(type, null, MAX_VECTOR_LENGTH);
                fieldsBuilder.add(Optional.empty());
            }
            else {
                String columnName = useParquetColumnNames ? name : fileSchema.getFields().get(column.getHiveColumnIndex()).getName();
                fieldsBuilder.add(constructField(type, lookupColumnByName(messageColumnIO, columnName)));
            }
        }
        types = typesBuilder.build();
        fields = fieldsBuilder.build();
        columnNames = namesBuilder.build();
    }

    @Overridepubliclong getCompletedBytes()
    {
        return parquetReader.getDataSource().getReadBytes();
    }

    @Overridepubliclong getReadTimeNanos()
    {
        return readTimeNanos;
    }

    @Overridepublicboolean isFinished()
    {
        return closed;
    }

    @Overridepubliclong getSystemMemoryUsage()
    {
        return parquetReader.getSystemMemoryContext().getBytes();
    }

    @Overridepublic Page getNextPage()
    {
        try {
            batchId++;
            long start = System.nanoTime();

            int batchSize = parquetReader.nextBatch();

            readTimeNanos += System.nanoTime() - start;

            if (closed || batchSize <= 0) {
                close();
                returnnull;
            }

            Block[] blocks = new Block[hiveColumnIndexes.length];
            for (int fieldId = 0; fieldId < blocks.length; fieldId++) {
                if (constantBlocks[fieldId] != null) {
                    blocks[fieldId] = constantBlocks[fieldId].getRegion(0, batchSize);
                }
                else {
                    Type type = types.get(fieldId);
                    Optional<Field> field = fields.get(fieldId);
                    int fieldIndex;
                    if (useParquetColumnNames) {
                        fieldIndex = getFieldIndex(fileSchema, columnNames.get(fieldId));
                    }
                    else {
                        fieldIndex = hiveColumnIndexes[fieldId];
                    }
                    if (fieldIndex != -1&& field.isPresent()) {
                        blocks[fieldId] = new LazyBlock(batchSize, new ParquetBlockLoader(field.get()));
                    }
                    else {
                        blocks[fieldId] = RunLengthEncodedBlock.create(type, null, batchSize);
                    }
                }
            }
            returnnew Page(batchSize, blocks);
        }
        catch (PrestoException e) {
            closeWithSuppression(e);
            throw e;
        }
        catch (RuntimeException e) {
            closeWithSuppression(e);
            thrownew PrestoException(HIVE_CURSOR_ERROR, e);
        }
    }

    privatevoid closeWithSuppression(Throwable throwable)
    {
        requireNonNull(throwable, "throwable is null");
        try {
            close();
        }
        catch (RuntimeException e) {
            // Self-suppression not permittedif (e != throwable) {
                throwable.addSuppressed(e);
            }
        }
    }

    @Overridepublicvoid close()
    {
        if (closed) {
            return;
        }
        closed = true;

        try {
            parquetReader.close();
        }
        catch (IOException e) {
            thrownew UncheckedIOException(e);
        }
    }

    privatefinalclass ParquetBlockLoader
            implements LazyBlockLoader<LazyBlock>
    {
        privatefinalint expectedBatchId = batchId;
        privatefinal Field field;
        privateboolean loaded;

        public ParquetBlockLoader(Field field)
        {
            this.field = requireNonNull(field, "field is null");
        }

        @Overridepublicfinalvoid load(LazyBlock lazyBlock)
        {
            if (loaded) {
                return;
            }

            checkState(batchId == expectedBatchId);

            try {
                Block block = parquetReader.readBlock(field);
                lazyBlock.setBlock(block);
            }
            catch (ParquetCorruptionException e) {
                thrownew PrestoException(HIVE_BAD_DATA, e);
            }
            catch (IOException e) {
                thrownew PrestoException(HIVE_CURSOR_ERROR, e);
            }
            loaded = true;
        }
    }
}

sendfile(2) - Linux manual page

↧

SCT でレポートを PDF で保存しようとすると "No glyph for U+5348 in font AmazonEmber-Regular"と怒られる

December 1, 2018, 9:43 am

≫ Next: Aurora MySQL互換でスロークエリログを取得する

≪ Previous: Presto で Parquet にクエリするとファイル中の必要な Column chunk のみを読んでいるか

事象

AWS Schema Conversion Tool で Assessment Report Viewを表示して PDF で保存しようとすると "No glyph for U+5348 in font AmazonEmber-Regular"と怒られる。

回避策

言語を英語、地域をUSにしてOS再起動して、SCTを再実行する。

環境

macOS Sierra 10.12.6
AWS Schema Conversion Tool 1.0 Build 619

↧

Aurora MySQL互換でスロークエリログを取得する

December 1, 2018, 9:43 am

≫ Next: Aurora PostgreSQL互換でクエリログを取得する

≪ Previous: SCT でレポートを PDF で保存しようとすると "No glyph for U+5348 in font AmazonEmber-Regular"と怒られる

Aurora MySQL 5.7互換で、スロークエリログを CloudWatch Logs に出力してみた。

設定

DB Parameter Group をパラメータグループファミリー"aurora-mysql5.7"で作成する
- slow_query_log=1
- long_query_time=0 # 全てのクエリーを出力する
DBインスタンスの DB Parameter Group を作成したものに変えて再起動する。

スロークエリログを見てみる

5.7.12-log awsuser: [mysql] 17:58> select count(*) from (select a.* from time_zone_name a, time_zone_name b, time_zone_name c) d;
+------------+
| count(*)   |
+------------+
| 5526456832 |
+------------+1rowinset (5 min 2.16 sec)

AWSマネジメントコンソールでDBインスタンスを選択して、[CloudWatch ログ]で[スロークエリ]をクリックする。
「time_zone_name」でフィルタする。

# Time: 2018-10-24T09:03:34.076987Z
# User@Host: awsuser[awsuser] @ [**.*.*.145] Id: 26
# Query_time: 302.087264 Lock_time: 0.000229 Rows_sent: 1 Rows_examined: 88400
use mysql;
SET timestamp=1540371814;
select count(*) from (select a.* from time_zone_name a, time_zone_name b, time_zone_name c) d;

参考

【アップデート】Amazon Auroraでスロークエリや一般ログがCloudWatch Logsへ出力可能に｜ DevelopersIO

↧

Aurora PostgreSQL互換でクエリログを取得する

December 1, 2018, 9:43 am

≫ Next: S3 のデフォルト暗号化キーに使っている BYOK した CMK を入れ替える

≪ Previous: Aurora MySQL互換でスロークエリログを取得する

Aurora PostgreSQL 9.6 互換でクエリログを取得してみた。

設定

DB Parameter Group をパラメータグループファミリー"aurora-postgresql9.6"で作成する
- log_statement=1
- log_min_duration_statement=0
- log_destination=csvlog
- log_duration=1
- log_error_verbosity=verbose

DBインスタンスの DB Parameter Group を作成したものに変えて再起動する。

クエリログを確認する

クエリを実行する

> select * from (select *  from pg_tables a, pg_tables b, pg_tables c) d;

AWSマネジメントコンソールで[ログ]の error/postgresql.log.*.csvをクリックする

2018-10-24 17:15:00.599 UTC,"awsuser","mydb",26380,"**.*.*.145:54214",5bd0a876.670c,5,"idle",2018-10-24 17:14:30 UTC,6/26561,0,LOG,00000,"statement: select count(*) from (select * from pg_tables a, pg_tables b) d;",,,,,,,,"exec_simple_query, postgres.c:942","psql"
2018-10-24 17:15:00.600 UTC,"awsuser","mydb",26380,"**.*.*.145:54214",5bd0a876.670c,6,"SELECT",2018-10-24 17:14:30 UTC,6/0,0,LOG,00000,"duration: 1.014 ms",,,,,,,,"exec_simple_query, postgres.c:1171","psql"

18.8.4. CSV書式のログ出力の利用
log_destinationリストにcsvlogを含めることは、ログファイルをデータベーステーブルにインポートする簡便な方法を提供します。このオプションはカンマ区切り値書式（CSV）で以下の列を含むログ行を生成します。ミリ秒単位のtimestamp、ユーザ名、データベース名、プロセス識別子、クライアントホスト：ポート番号、セッション識別子、セッション前行番号、コマンドタグ、セッション開始時間、仮想トランザクション識別子、通常トランザクション識別子、エラーの深刻度、 SQL状態コード、エラーメッセージ、詳細エラーメッセージ、ヒント、エラーとなった内部的な問い合わせ（もしあれば）、内部問い合わせにおけるエラー位置の文字数、エラーの文脈、 PostgreSQL ソースコード上のエラー発生場所（log_error_verbosityがverboseに設定されているならば）アプリケーション名。以下にcsvlog出力を格納するためのテーブル定義のサンプルを示します。
CREATE TABLE postgres_log
(
  log_time timestamp(3) with time zone,
  user_name text,
  database_name text,
  process_id integer,
  connection_from text,
  session_id text,
  session_line_num bigint,
  command_tag text,
  session_start_time timestamp with time zone,
  virtual_transaction_id text,
  transaction_id bigint,
  error_severity text,
  sql_state_code text,
  message text,
  detail text,
  hint text,
  internal_query text,
  internal_query_pos integer,
  context text,
  query text,
  query_pos integer,
  location text,
  application_name text,
  PRIMARY KEY (session_id, session_line_num)
);
https://www.postgresql.jp/document/9.4/html/runtime-config-logging.html

参考

↧

S3 のデフォルト暗号化キーに使っている BYOK した CMK を入れ替える

December 1, 2018, 9:43 am

≫ Next: EBS ボリュームの CMK を KMS で生成したものから BYOK に変更する

≪ Previous: Aurora PostgreSQL互換でクエリログを取得する

KMS に BYOK(Bring Your Own Key) した CMK(Customer Master Key) で S3 バケットを SSE-KMS(Server-Side Encryption with AWS KMS) でデフォルト暗号化し、ファイルを S3 に Put した後、S3 のデフォルト暗号化キーを BYOK した別の CMK に変更して、S3 のオブジェクトを Get/Put して新しいキーで暗号化後に古い CMK を無効化してみた。

手順

KMS にCMKをインポートする

KMS に CMK を空で作成する。

$ aws kms create-key --origin EXTERNAL --description imported_key
{"KeyMetadata": {"Origin": "EXTERNAL",
        "KeyId": "d07a6b28-a314-44c0-899f-4c0b4fa18f23",
        "Description": "imported_key",
        "KeyManager": "CUSTOMER",
        "Enabled": false,
        "KeyUsage": "ENCRYPT_DECRYPT",
        "KeyState": "PendingImport",
        "CreationDate": 1540532700.471,
        "Arn": "arn:aws:kms:ap-northeast-1:123456789012:key/d07a6b28-a314-44c0-899f-4c0b4fa18f23",
        "AWSAccountId": "123456789012"}}

作成した CMK にエイリアス名をつける。

$ aws kms create-alias--alias-namealias/rotation-test-key --target-key-id d07a6b28-a314-44c0-899f-4c0b4fa18f23

PublicKey と ImportToken をダウンロードする。

$ aws kms get-parameters-for-import \
--key-id d07a6b28-a314-44c0-899f-4c0b4fa18f23 \
--wrapping-algorithm RSAES_PKCS1_V1_5 \
--wrapping-key-spec RSA_2048
{"ParametersValidTo": 1540625003.173,
    "PublicKey": "...",
    "ImportToken": "..."}

上記のPublicKey を PublicKey1.b64、ImportToken を ImportToken1.b64 というファイル名で保存する。
PublicKey と ImportToken をそれぞれbase64デコードして、新しいファイルで保存する。

$ openssl enc -d-a-A-in PublicKey1.b64 -out PublicKey1.bin
$ openssl enc -d-a-A-in ImportToken1.b64 -out ImportToken1.bin

KMS にインポートする CMK を作成する

$ openssl rand -out PlaintextKeyMaterial1.bin 32

生成した CMK を、デコードした PublicKey を使って暗号化する。

$ openssl rsautl -encrypt\
-in PlaintextKeyMaterial1.bin \
-pkcs \
-inkey PublicKey1.bin \
-keyform DER \
-pubin \
-out EncryptedKeyMaterial1.bin

暗号化した CMK を KMS にインポートする。

$ aws kms import-key-material --key-id d07a6b28-a314-44c0-899f-4c0b4fa18f23 \
--encrypted-key-material fileb://EncryptedKeyMaterial1.bin \
--import-token fileb://ImportToken1.bin \
--expiration-model KEY_MATERIAL_EXPIRES \
--valid-to 2018-11-01T00:00:00-00:00

S3バケットを作成し BYOK した CMK でデフォルト暗号化設定する

S3 にバケットを作成する

$ aws s3 mb s3://kms-key-rotation-test-bucket
make_bucket: kms-key-rotation-test-bucket

作成した S3 バケットのデフォルト暗号化キーに作成した CMK を指定する。

$ aws s3api put-bucket-encryption --bucket kms-key-rotation-test-bucket --server-side-encryption-configuration'{"Rules": [     {"ApplyServerSideEncryptionByDefault": {"SSEAlgorithm": "aws:kms","KMSMasterKeyID": "d07a6b28-a314-44c0-899f-4c0b4fa18f23"       }     }   ]}'

S3 にファイルを Put する

ファイルを S3 に Put する

$ echo'kms key rotation test 1'> kms-key-rotation-test1.txt
$ aws s3 cp kms-key-rotation-test1.txt s3://kms-key-rotation-test-bucket/
$ echo'kms key rotation test 2'> kms-key-rotation-test2.txt
$ aws s3 cp kms-key-rotation-test2.txt s3://kms-key-rotation-test-bucket/

Put したオブジェクトの暗号化に使われている CMK を確認する

$ aws s3api head-object --bucket kms-key-rotation-test-bucket --key kms-key-rotation-test1.txt
{"AcceptRanges": "bytes",
    "ContentType": "text/plain",
    "LastModified": "Fri, 26 Oct 2018 07:47:48 GMT",
    "ContentLength": 24,
    "ETag": "\"97138b7a0ea21fd4607973bd6cbd83b5\"",
    "ServerSideEncryption": "aws:kms",
    "SSEKMSKeyId": "arn:aws:kms:ap-northeast-1:123456789012:key/d07a6b28-a314-44c0-899f-4c0b4fa18f23",
    "Metadata": {}}

新しい CMK をインポートする

KMS に CMK を空で作成する。

$ aws kms create-key --origin EXTERNAL --description imported_key
{"KeyMetadata": {"Origin": "EXTERNAL",
        "KeyId": "1d3829b7-a35d-44fb-8f1d-41fc5ed229d8",
        "Description": "imported_key",
        "KeyManager": "CUSTOMER",
        "Enabled": false,
        "KeyUsage": "ENCRYPT_DECRYPT",
        "KeyState": "PendingImport",
        "CreationDate": 1540540340.387,
        "Arn": "arn:aws:kms:ap-northeast-1:123456789012:key/1d3829b7-a35d-44fb-8f1d-41fc5ed229d8",
        "AWSAccountId": "123456789012"}}

PublicKey と ImportToken をダウンロードする。

$ aws kms get-parameters-for-import \
--key-id 1d3829b7-a35d-44fb-8f1d-41fc5ed229d8 \
--wrapping-algorithm RSAES_PKCS1_V1_5 \
--wrapping-key-spec RSA_2048
{"ParametersValidTo": 1540626803.766,
    "PublicKey": "...",
    "ImportToken": "..."}

上記のPublicKey を PublicKey2.b64、ImportToken を ImportToken2.b64 というファイル名で保存する。
PublicKey と ImportToken をそれぞれbase64デコードして、新しいファイルで保存する。

$ openssl enc -d-a-A-in PublicKey2.b64 -out PublicKey2.bin
$ openssl enc -d-a-A-in ImportToken2.b64 -out ImportToken2.bin

KMSにインポートする CMK を作成する

$ openssl rand -out PlaintextKeyMaterial2.bin 32

生成した CMK を、デコードした PublicKey を使って暗号化する。

$ openssl rsautl -encrypt\
-in PlaintextKeyMaterial2.bin \
-pkcs \
-inkey PublicKey2.bin \
-keyform DER \
-pubin \
-out EncryptedKeyMaterial2.bin

暗号化した CMK を KMS にインポートする。

$ aws kms import-key-material --key-id 1d3829b7-a35d-44fb-8f1d-41fc5ed229d8 \
--encrypted-key-material fileb://EncryptedKeyMaterial2.bin \
--import-token fileb://ImportToken2.bin \
--expiration-model KEY_MATERIAL_EXPIRES \
--valid-to 2018-11-01T00:00:00-00:00

S3 のデフォルト暗号化キーを変更する

S3 のデフォルト暗号化キーを新しく作成した CMK に変更する。

$ aws s3api put-bucket-encryption --bucket kms-key-rotation-test-bucket --server-side-encryption-configuration'{"Rules": [     {"ApplyServerSideEncryptionByDefault": {"SSEAlgorithm": "aws:kms","KMSMasterKeyID": "1d3829b7-a35d-44fb-8f1d-41fc5ed229d8"       }     }   ]}'

エイリアス"rotation-test-key"のキーIDを確認する。

$ aws kms list-aliases | jq '.Aliases[] | select(.AliasName=="alias/rotation-test-key")'{"AliasArn": "arn:aws:kms:ap-northeast-1:123456789012:alias/rotation-test-key",
  "AliasName": "alias/rotation-test-key",
  "TargetKeyId": "d07a6b28-a314-44c0-899f-4c0b4fa18f23"}

エイリアスに紐づく CMK を変更する。

$ aws kms update-alias--alias-namealias/rotation-test-key --target-key-id 1d3829b7-a35d-44fb-8f1d-41fc5ed229d8

エイリアス"rotation-test-key"のキーIDが変わったことを確認する

$ aws kms list-aliases | jq '.Aliases[] | select(.AliasName=="alias/rotation-test-key")'{"AliasArn": "arn:aws:kms:ap-northeast-1:123456789012:alias/rotation-test-key",
  "AliasName": "alias/rotation-test-key",
  "TargetKeyId": "1d3829b7-a35d-44fb-8f1d-41fc5ed229d8"}

新しい CMK でオブジェクトを暗号化し直す

オブジェクト暗号化に使われているキーを確認する

$ aws s3api head-object --bucket kms-key-rotation-test-bucket --key kms-key-rotation-test1.txt
{"AcceptRanges": "bytes",
    "ContentType": "text/plain",
    "LastModified": "Fri, 26 Oct 2018 07:47:48 GMT",
    "ContentLength": 24,
    "ETag": "\"97138b7a0ea21fd4607973bd6cbd83b5\"",
    "ServerSideEncryption": "aws:kms",
    "SSEKMSKeyId": "arn:aws:kms:ap-northeast-1:123456789012:key/d07a6b28-a314-44c0-899f-4c0b4fa18f23",
    "Metadata": {}}

オブジェクトを Get する

$ aws s3 cp s3://kms-key-rotation-test-bucket/kms-key-rotation-test1.txt ./
download: s3://kms-key-rotation-test-bucket/kms-key-rotation-test1.txt to ./kms-key-rotation-test1.txt

オブジェクトを Put する

$ aws s3 cp kms-key-rotation-test1.txt s3://kms-key-rotation-test-bucket/
upload: ./kms-key-rotation-test1.txt to s3://kms-key-rotation-test-bucket/kms-key-rotation-test1.txt

Put し直したオブジェクトが新しいキーで暗号化されていることを確認する

$ aws s3api head-object --bucket kms-key-rotation-test-bucket --key kms-key-rotation-test1.txt
{"AcceptRanges": "bytes",
    "ContentType": "text/plain",
    "LastModified": "Fri, 26 Oct 2018 08:43:18 GMT",
    "ContentLength": 24,
    "ETag": "\"a66d75d4b8183e6d8475ec2da993e553\"",
    "ServerSideEncryption": "aws:kms",
    "SSEKMSKeyId": "arn:aws:kms:ap-northeast-1:123456789012:key/1d3829b7-a35d-44fb-8f1d-41fc5ed229d8",
    "Metadata": {}}

古いキーを無効化する

古いキーを無効化する

$ aws kms  disable-key --key-id d07a6b28-a314-44c0-899f-4c0b4fa18f23

新しいキーで暗号化し直したオブジェクトはダウンロードできる。

$ aws s3 cp s3://kms-key-rotation-test-bucket/kms-key-rotation-test1.txt ./
download: s3://kms-key-rotation-test-bucket/kms-key-rotation-test1.txt to ./kms-key-rotation-test1.txt

古い CMK で Put したオブジェクトはダウンロードできない。

$ aws s3 cp s3://kms-key-rotation-test-bucket/kms-key-rotation-test2.txt ./
download failed: s3://kms-key-rotation-test-bucket/kms-key-rotation-test2.txt to ./kms-key-rotation-test2.txt An error occurred (KMS.DisabledException) when calling the GetObject operation: arn:aws:kms:ap-northeast-1:123456789012:key/d07a6b28-a314-44c0-899f-4c0b4fa18f23 is disabled.

参考

↧

EBS ボリュームの CMK を KMS で生成したものから BYOK に変更する

December 1, 2018, 9:43 am

≫ Next: Redshift の CMK を KMS で生成したものから BYOK に変更する

≪ Previous: S3 のデフォルト暗号化キーに使っている BYOK した CMK を入れ替える

EBS ボリュームの CMK(Customer Master Key) を KMS で生成したものから BYOK(Bring Your Own Key) したものに変更してみた。EBS のスナップショットを取得して、スナップショットをコピーする際に CMK を変更し、EBS ボリュームを作成して EC2 にアタッチすることで CMK を変更できた。

手順

KMSにキーを作成する

CMK を作成する。

$ aws kms create-key
{"KeyMetadata": {"Origin": "AWS_KMS",
        "KeyId": "2ae86c99-8b8f-4bba-9998-005b2dd768ac",
        "Description": "",
        "KeyManager": "CUSTOMER",
        "Enabled": true,
        "KeyUsage": "ENCRYPT_DECRYPT",
        "KeyState": "Enabled",
        "CreationDate": 1540638331.395,
        "Arn": "arn:aws:kms:ap-northeast-1:123456789012:key/2ae86c99-8b8f-4bba-9998-005b2dd768ac",
        "AWSAccountId": "123456789012"}}

エイリアス名をつける。

$ aws kms create-alias--alias-namealias/ebs-kms-key --target-key-id 2ae86c99-8b8f-4bba-9998-005b2dd768ac

EC2インスタンスを作成する

EC2インスタンスを作成する
- EBSボリュームを ebs-kms-key(2ae86c99-8b8f-4bba-9998-005b2dd768ac) で暗号化する。

EBSのボリュームを初期化してマウントする

$ sudo mkfs -t ext4 /dev/sdb
$ sudo mkdir /ebs-enc
$ sudo chmod o+w /ebs-enc
$ sudo vi /etc/fstab
/dev/sdb    /ebs-enc    ext4    defaults        11# 追記
$ sudo mount -a
$ df -h
Filesystem      Size  Used Avail Use% Mounted on
devtmpfs        3.9G   68K  3.9G   1% /dev
tmpfs           3.9G     0  3.9G   0% /dev/shm
/dev/xvda1      7.8G  1.1G  6.7G  14% /
/dev/xvdb        59G   52M   56G   1% /ebs-enc

EBSボリュームにデータを置く

$ aws s3 ls--recursive--human-readable--summarize s3://amazon-reviews-pds/tsv
（中略）
Total Objects: 55
   Total Size: 32.2 GiB
$ aws s3 cp --recursive s3://amazon-reviews-pds/tsv /ebs-enc/
$ ls-lh /ebs-enc|head -5
total 33G
-rw-rw-r-- 1 ec2-user ec2-user 231M Nov 242017 amazon_reviews_multilingual_DE_v1_00.tsv.gz
-rw-rw-r-- 1 ec2-user ec2-user  68M Nov 242017 amazon_reviews_multilingual_FR_v1_00.tsv.gz
-rw-rw-r-- 1 ec2-user ec2-user  91M Nov 242017 amazon_reviews_multilingual_JP_v1_00.tsv.gz
-rw-rw-r-- 1 ec2-user ec2-user 334M Nov 242017 amazon_reviews_multilingual_UK_v1_00.tsv.gz

KMSにCMKをBYOKする

KMS に CMK を空で作成する。

$ aws kms create-key --origin EXTERNAL --description byok_key
{"KeyMetadata": {"Origin": "EXTERNAL",
        "KeyId": "35e6c88a-0cd9-4614-9e7a-ec5782d7c1da",
        "Description": "byok_key",
        "KeyManager": "CUSTOMER",
        "Enabled": false,
        "KeyUsage": "ENCRYPT_DECRYPT",
        "KeyState": "PendingImport",
        "CreationDate": 1540639989.29,
        "Arn": "arn:aws:kms:ap-northeast-1:123456789012:key/35e6c88a-0cd9-4614-9e7a-ec5782d7c1da",
        "AWSAccountId": "123456789012"}}

作成した CMK にエイリアス名をつける。

$ aws kms create-alias--alias-namealias/ebs-byok-key --target-key-id 35e6c88a-0cd9-4614-9e7a-ec5782d7c1da

PublicKey と ImportToken をダウンロードする。

$ aws kms get-parameters-for-import \
--key-id 35e6c88a-0cd9-4614-9e7a-ec5782d7c1da \
--wrapping-algorithm RSAES_PKCS1_V1_5 \
--wrapping-key-spec RSA_2048

上記のPublicKey を PublicKey.b64、ImportToken を ImportToken.b64 というファイル名で保存する。
PublicKey と ImportToken をそれぞれbase64デコードして、新しいファイルで保存する。

$ openssl enc -d-a-A-in PublicKey.b64 -out PublicKey.bin
$ openssl enc -d-a-A-in ImportToken.b64 -out ImportToken.bin

KMS にインポートする CMK を作成する

$ openssl rand -out PlaintextKeyMaterial.bin 32

生成した CMK を、デコードした PublicKey を使って暗号化する。

$ openssl rsautl -encrypt\
-in PlaintextKeyMaterial.bin \
-pkcs \
-inkey PublicKey.bin \
-keyform DER \
-pubin \
-out EncryptedKeyMaterial.bin

暗号化した CMK を KMS にインポートする。

$ aws kms import-key-material --key-id 35e6c88a-0cd9-4614-9e7a-ec5782d7c1da \
--encrypted-key-material fileb://EncryptedKeyMaterial.bin \
--import-token fileb://ImportToken.bin \
--expiration-model KEY_MATERIAL_EXPIRES \
--valid-to 2018-11-01T00:00:00-00:00

AWSマネジメントコンソールのIAMの暗号化キーで確認する。

EBSをBYOKしたCMKで暗号化し直す

アンマウントする。

$ sudo umount /ebs-enc

スナップショットを取得する。

スナップショット取得完了を確認する。

スナップショットを選択して、[アクション]-[コピー]を選択して[マスターキー]をBYOKしたCMKに変更して[コピー]をクリックする。

スナップショットのコピーが完了したら、

BYOKで暗号化したスナップショットを選択して、[アクション]-[ボリュームの作成]を選択して、スナップショットからEBSボリュームを作成する

作成したEBSボリュームをEC2インスタンスにアタッチする。

作成したEBSボリュームがBYOKしたCMKで暗号化されていることを確認する。

EBSボリュームをマウントして確認する。

$ sudo mkdir /ebs-byok
$ sudo chmod o+w /ebs-byok
$ sudo vi /etc/fstab
/dev/sdf   /ebs-byok    ext4   defaults        11#追記
$ sudo mount -a
$ df -h
Filesystem      Size  Used Avail Use% Mounted on
devtmpfs        3.9G   72K  3.9G   1% /dev
tmpfs           3.9G     0  3.9G   0% /dev/shm
/dev/xvda1      7.8G  1.1G  6.7G  14% /
/dev/xvdb        59G   33G   24G  58% /ebs-enc
/dev/xvdf        59G   33G   24G  58% /ebs-byok
$ cd /ebs-byok
$ ls-lh|head -5
total 33G
-rw-rw-r-- 1 ec2-user ec2-user 231M Nov 242017 amazon_reviews_multilingual_DE_v1_00.tsv.gz
-rw-rw-r-- 1 ec2-user ec2-user  68M Nov 242017 amazon_reviews_multilingual_FR_v1_00.tsv.gz
-rw-rw-r-- 1 ec2-user ec2-user  91M Nov 242017 amazon_reviews_multilingual_JP_v1_00.tsv.gz
-rw-rw-r-- 1 ec2-user ec2-user 334M Nov 242017 amazon_reviews_multilingual_UK_v1_00.tsv.gz

ファーストタッチペナルティを回避するためブロックデバイスのデータにアクセスする。

$ sudo dd if=/dev/xvdf of=/dev/null bs=1M

参考

Amazon EBS ボリュームの初期化 - Amazon Elastic Compute Cloud

↧

Redshift の CMK を KMS で生成したものから BYOK に変更する

December 1, 2018, 9:43 am

≫ Next: KMSで別のキーマテリアルを CMK にインポートすることはできない

≪ Previous: EBS ボリュームの CMK を KMS で生成したものから BYOK に変更する

Redshift の場合、EBSのようにスナップショットのコピー時に CMK(Customer Master Key) を変更できないので、エンドポイントを変更したくない場合は、既存のクラスター名を変更して、元のクラスター名で BYOK を指定してクラスターを作成して、旧クラスターで UNLOAD したデータを COPY でロードしてやるとよい。

↧

KMSで別のキーマテリアルを CMK にインポートすることはできない

December 1, 2018, 9:42 am

≫ Next: grep でパターンにマッチする・しないファイルをリストする

≪ Previous: Redshift の CMK を KMS で生成したものから BYOK に変更する

CMK を作成してインポートする

CMK を作成する。

$ aws kms create-key --origin EXTERNAL --description byok_key_test1
{"KeyMetadata": {"Origin": "EXTERNAL",
        "KeyId": "02c66c29-81b3-404b-beba-35355b4a56cc",
        "Description": "byok_key_test1",
        "KeyManager": "CUSTOMER",
        "Enabled": false,
        "KeyUsage": "ENCRYPT_DECRYPT",
        "KeyState": "PendingImport",
        "CreationDate": 1540823706.373,
        "Arn": "arn:aws:kms:ap-northeast-1:123456789012:key/02c66c29-81b3-404b-beba-35355b4a56cc",
        "AWSAccountId": "123456789012"}}

エイリアス名をつける。

$ aws kms create-alias--alias-namealias/byok-key-test1 --target-key-id 02c66c29-81b3-404b-beba-35355b4a56cc

PublicKey と ImportToken をダウンロードする。

$ aws kms get-parameters-for-import \
--key-id 02c66c29-81b3-404b-beba-35355b4a56cc \
--wrapping-algorithm RSAES_PKCS1_V1_5 \
--wrapping-key-spec RSA_2048

上記のPublicKey を PublicKey.b64、ImportToken を ImportToken.b64 というファイル名で保存する。
PublicKey と ImportToken をそれぞれbase64デコードして、新しいファイルで保存する。

$ openssl enc -d-a-A-in PublicKey.b64 -out PublicKey.bin
$ openssl enc -d-a-A-in ImportToken.b64 -out ImportToken.bin

KMS にインポートする CMK を作成する

$ openssl rand -out PlaintextKeyMaterial.bin 32

生成した CMK を、デコードした PublicKey を使って暗号化する。

$ openssl rsautl -encrypt\
-in PlaintextKeyMaterial.bin \
-pkcs \
-inkey PublicKey.bin \
-keyform DER \
-pubin \
-out EncryptedKeyMaterial.bin

CMK を KMS にインポートする。

$ aws kms import-key-material --key-id 02c66c29-81b3-404b-beba-35355b4a56cc \
--encrypted-key-material fileb://EncryptedKeyMaterial.bin \
--import-token fileb://ImportToken.bin \
--expiration-model KEY_MATERIAL_EXPIRES \
--valid-to 2018-11-01T00:00:00-00:00

キーの定義を確認する

$ aws kms describe-key --key-id 02c66c29-81b3-404b-beba-35355b4a56cc
{"KeyMetadata": {"Origin": "EXTERNAL",
        "KeyId": "02c66c29-81b3-404b-beba-35355b4a56cc",
        "Description": "byok_key_test1",
        "KeyManager": "CUSTOMER",
        "ExpirationModel": "KEY_MATERIAL_EXPIRES",
        "ValidTo": 1541030400.0,
        "Enabled": true,
        "KeyUsage": "ENCRYPT_DECRYPT",
        "KeyState": "Enabled",
        "CreationDate": 1540823706.373,
        "Arn": "arn:aws:kms:ap-northeast-1:123456789012:key/02c66c29-81b3-404b-beba-35355b4a56cc",
        "AWSAccountId": "123456789012"}}

CMK を削除して再インポートする

キーマテリアルを削除する

$ aws kms delete-imported-key-material --key-id 02c66c29-81b3-404b-beba-35355b4a56cc

キーの定義を確認する

$ aws kms describe-key --key-id 02c66c29-81b3-404b-beba-35355b4a56cc
{"KeyMetadata": {"Origin": "EXTERNAL",
        "KeyId": "02c66c29-81b3-404b-beba-35355b4a56cc",
        "Description": "byok_key_test1",
        "KeyManager": "CUSTOMER",
        "Enabled": false, ★
        "KeyUsage": "ENCRYPT_DECRYPT",
        "KeyState": "PendingImport", ★
        "CreationDate": 1540823706.373,
        "Arn": "arn:aws:kms:ap-northeast-1:123456789012:key/02c66c29-81b3-404b-beba-35355b4a56cc",
        "AWSAccountId": "123456789012"}}

PublicKey と ImportToken をダウンロードする。

aws kms get-parameters-for-import \
--key-id 02c66c29-81b3-404b-beba-35355b4a56cc \
--wrapping-algorithm RSAES_PKCS1_V1_5 \
--wrapping-key-spec RSA_2048

上記のPublicKey を PublicKey.b64、ImportToken を ImportToken.b64 というファイル名で保存する。
PublicKey と ImportToken をそれぞれbase64デコードして、新しいファイルで保存する。

$ openssl enc -d-a-A-in PublicKey.b64 -out PublicKey.bin
$ openssl enc -d-a-A-in ImportToken.b64 -out ImportToken.bin

KMS にインポートする CMK を作成する

$ openssl rand -out PlaintextKeyMaterial.bin 32

生成した CMK を、デコードした PublicKey を使って暗号化する。

$ openssl rsautl -encrypt\
-in PlaintextKeyMaterial.bin \
-pkcs \
-inkey PublicKey.bin \
-keyform DER \
-pubin \
-out EncryptedKeyMaterial.bin

CMK を KMS にインポートする。

$ aws kms import-key-material --key-id 02c66c29-81b3-404b-beba-35355b4a56cc \
--encrypted-key-material fileb://EncryptedKeyMaterial.bin \
--import-token fileb://ImportToken.bin \
--expiration-model KEY_MATERIAL_EXPIRES \
--valid-to 2018-11-01T00:00:00-00:00

An error occurred (IncorrectKeyMaterialException) when calling the ImportKeyMaterial operation: ★同じキーマテリアルでないため失敗する

↧

grep でパターンにマッチする・しないファイルをリストする

December 1, 2018, 9:42 am

≫ Next: S3のデフォルト暗号化キーはエイリアス名ではなくキーIDと紐付いてそう

≪ Previous: KMSで別のキーマテリアルを CMK にインポートすることはできない

マッチするファイルのリスト

$ grep-l END *.sql

マッチしないファイルのリスト

$ grep-L END *.sql

参考

» grepのオプションおさらい TECHSCORE BLOG

↧