Quantcast
Channel: ablog
Viewing all 1154 articles
Browse latest View live

[Hadoop]Cloudera Altus Director のログの場所

$
0
0

macOS Sierra での Cloudera Altus Director 6.0 のログ出力先は以下の通り。

  • /usr/local/Cellar/cloudera-director-client/6.0.0/libexec/logs
  • /usr/local/Cellar/cloudera-director-server/6.0.0/libexec/logs

以下は調べてみたメモ。

  • /usr/local/bin/cloudera-director-server-start
SOURCE=$0 ★
while [ -h "$SOURCE" ]; do
    LOOKUP=$(ls -ld "$SOURCE")
    TARGET=$(expr "$LOOKUP" : '.*-> \(.*\)$')
    if expr "${TARGET:-.}/" : '/.*/$' > /dev/null; then
        SOURCE=${TARGET:-.}
    else
        SOURCE=$(dirname "$SOURCE")/${TARGET:-.}
    fi
done
CLOUDERA_DIRECTOR_HOME=$(cd "$(dirname "$SOURCE")"; cd ..; pwd) ★
cd "${CLOUDERA_DIRECTOR_HOME}" || exit 1

if [ -f "/etc/default/cloudera-director-server" ]; then
  # shellcheck disable=SC1091
  . /etc/default/cloudera-director-server
fi

SERVER_OUTPUT="${CLOUDERA_DIRECTOR_HOME}/logs/output.txt"
PID_DIR=${DIRECTOR_SERVER_PID_DIR:-$CLOUDERA_DIRECTOR_HOME}
PID_FILE="${PID_DIR}/application.pid"

# Do a quick check to see if the process is already running

if [ -f "${PID_FILE}" ] && kill -0 "$(cat "${PID_FILE}")"; then
    echo "Cloudera Altus Director Server is already running"
    exit 0
fi

# If not start in the background

# Wait for it to start (or exit on failure)
# We will perform up to MAX_ITERATIONS iterations of 3 seconds waits
MAX_ITERATIONS=${MAX_ITERATIONS:-10}
SLEEP_TIME=3
(( timeout = MAX_ITERATIONS * SLEEP_TIME ))

printf "Starting Cloudera Altus Director Server in background with timeout %d seconds ..." "${timeout}"

mkdir -p "${CLOUDERA_DIRECTOR_HOME}/logs" ★

$ which cloudera-director-server-start
/usr/local/bin/cloudera-director-server-start
$ ls -l /usr/local/bin/cloudera-director-server-start
lrwxr-xr-x  1 azekyohe  admin  75  9 21 16:03 /usr/local/bin/cloudera-director-server-start -> ../Cellar/cloudera-director-server/6.0.0/bin/cloudera-director-server-start
$ find /usr/local/Cellar -type d -name logs
/usr/local/Cellar/awscli/1.11.74/libexec/lib/python2.7/site-packages/awscli/examples/logs
/usr/local/Cellar/awscli/1.11.74/libexec/lib/python2.7/site-packages/botocore/data/logs
/usr/local/Cellar/awscli/1.11.74/share/awscli/examples/logs
/usr/local/Cellar/cloudera-director-client/6.0.0/libexec/logs
/usr/local/Cellar/cloudera-director-server/6.0.0/libexec/logs

[Hadoop]Cloudera Altus Director でクラスターの作成に失敗する

$
0
0

事象

ログを確認すると "In order to use this AWS Marketplace product you need to accept terms and subscribe. To do so please visit https://aws.amazon.com/marketplace/pp?sku=aw0evgkw8e5c1q413zgy5pjce" とメッセージが出力されている。

  • /usr/local/Cellar/cloudera-director-server/6.0.0/libexec/logs/application.log
[2018-09-22 16:26:40.129 +0900] WARN  [p-4510363df311-AllocateInstances] 79706ff4-5da3-4981-8f05-7d8b9d1a54c9 POST /api/d6.0/import com.cloudera.launchpad.bootstrap.AllocateInstances$AllocateAndWaitForInstan
cesToRun - c.c.l.bootstrap.AllocateInstances: Error while attempting to allocate instances for group workers. Attempting to continue.
com.cloudera.director.spi.v2.model.exception.TransientProviderException: Problem allocating on-demand instances
        at com.cloudera.director.aws.AWSExceptions.propagate(AWSExceptions.java:137)
        at com.cloudera.director.aws.ec2.allocation.ondemand.OnDemandAllocator.allocateOnDemandInstances(OnDemandAllocator.java:307)
        at com.cloudera.director.aws.ec2.allocation.ondemand.OnDemandAllocator.allocate(OnDemandAllocator.java:100)
        at com.cloudera.director.aws.ec2.provider.EC2Provider.allocate(EC2Provider.java:582)
        at com.cloudera.director.aws.ec2.provider.EC2Provider.allocate(EC2Provider.java:1)
        at com.cloudera.launchpad.pluggable.compute.PluggableComputeProvider.allocate(PluggableComputeProvider.java:890)
        at com.cloudera.launchpad.pluggable.compute.PluggableComputeProvider.allocateInstancesForTemplate(PluggableComputeProvider.java:708)
        at com.cloudera.launchpad.pluggable.compute.PluggableComputeProvider.allocate(PluggableComputeProvider.java:644)
        at com.cloudera.launchpad.pluggable.compute.PluggableComputeProvider.allocate(PluggableComputeProvider.java:351)
        at com.cloudera.launchpad.bootstrap.AllocateInstances$AllocateAndWaitForInstancesToRun.run(AllocateInstances.java:228)
        at com.cloudera.launchpad.bootstrap.AllocateInstances$AllocateAndWaitForInstancesToRun.run(AllocateInstances.java:203)
        at com.cloudera.launchpad.pipeline.job.Job3.runUnchecked(Job3.java:32)
        at com.cloudera.launchpad.pipeline.job.Job3$$FastClassBySpringCGLIB$$54178503.invoke(<generated>)
        at org.springframework.cglib.proxy.MethodProxy.invoke(MethodProxy.java:204)
        at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.invokeJoinpoint(CglibAopProxy.java:746)
        at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:163)
        at org.springframework.aop.aspectj.MethodInvocationProceedingJoinPoint.proceed(MethodInvocationProceedingJoinPoint.java:88)
        at com.cloudera.launchpad.pipeline.PipelineJobProfiler.profileJobRun(PipelineJobProfiler.java:60)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.springframework.aop.aspectj.AbstractAspectJAdvice.invokeAdviceMethodWithGivenArgs(AbstractAspectJAdvice.java:644)
        at org.springframework.aop.aspectj.AbstractAspectJAdvice.invokeAdviceMethod(AbstractAspectJAdvice.java:633)
        at org.springframework.aop.aspectj.AspectJAroundAdvice.invoke(AspectJAroundAdvice.java:70)
        at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:185)
        at org.springframework.aop.interceptor.ExposeInvocationInterceptor.invoke(ExposeInvocationInterceptor.java:92)
        at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:185)
        at org.springframework.aop.framework.CglibAopProxy$DynamicAdvisedInterceptor.intercept(CglibAopProxy.java:688)
        at com.cloudera.launchpad.bootstrap.AllocateInstances$AllocateAndWaitForInstancesToRun$$EnhancerBySpringCGLIB$$e12d2e0b.runUnchecked(<generated>)
        at com.cloudera.launchpad.pipeline.util.PipelineRunner$JobCallable.call(PipelineRunner.java:202)
        at com.cloudera.launchpad.pipeline.util.PipelineRunner$JobCallable.call(PipelineRunner.java:173)
        at com.github.rholder.retry.AttemptTimeLimiters$NoAttemptTimeLimit.call(AttemptTimeLimiters.java:78)
        at com.github.rholder.retry.Retryer.call(Retryer.java:160)
        at com.cloudera.launchpad.pipeline.util.PipelineRunner.attemptMultipleJobExecutionsWithRetries(PipelineRunner.java:136)
        at com.cloudera.launchpad.pipeline.DatabasePipelineRunner.doRun(DatabasePipelineRunner.java:214)
        at com.cloudera.launchpad.pipeline.DatabasePipelineRunner.run(DatabasePipelineRunner.java:154)
        at com.cloudera.launchpad.ExceptionHandlingRunnable.run(ExceptionHandlingRunnable.java:57)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
        Suppressed: com.cloudera.launchpad.pluggable.common.ExceptionConditions$DetailHolderException: Exception details:
  key: null
    PluginExceptionCondition{type=ERROR, exceptionInfo={message=Encountered AWS exception, awsErrorCode=OptInRequired, awsErrorMessage=In order to use this AWS Marketplace product you need to accept terms and subscribe. To do so please visit https://aws.amazon.com/marketplace/pp?sku=aw0evgkw8e5c1q413zgy5pjce}} ★

解決策

Cloudera Altus Director でクラスターの作成中に ” java.net.ConnectException: Connection refused” と怒られる

$
0
0

事象

Cloudera Altus Director でクラスターの作成中に " java.net.ConnectException: Connection refused" と怒られる。

  • /usr/local/Cellar/cloudera-director-server/6.0.0/libexec/logs/application.log
[2018-09-23 02:16:09.087 +0900] ERROR [p-201411dce9c9-WaitForSshToSucceed] fc6ce2d0-02ab-4933-92db-6f1fe5d011ff POST /api/d6.0/import com.cloudera.launchpad.bootstrap.AllocateInstances$WaitForSshCredentialInstallation - c.c.l.pipeline.util.PipelineRunner: Attempt to execute job failed
java.lang.RuntimeException: java.net.ConnectException: Connection refused (Connection refused)
at com.google.common.base.Throwables.propagate(Throwables.java:241)
at com.cloudera.launchpad.sshj.SshJClient.connect(SshJClient.java:258)
at com.cloudera.launchpad.common.ssh.ForwardingSshClient.connect(ForwardingSshClient.java:68)
at com.cloudera.launchpad.bootstrap.AllocateInstances$WaitForSshCredentialInstallation.run(AllocateInstances.java:599)
at com.cloudera.launchpad.bootstrap.AllocateInstances$WaitForSshCredentialInstallation.run(AllocateInstances.java:568)
at com.cloudera.launchpad.pipeline.job.Job2.runUnchecked(Job2.java:31)
at com.cloudera.launchpad.pipeline.job.Job2$$FastClassBySpringCGLIB$$54178502.invoke(<generated>)
at org.springframework.cglib.proxy.MethodProxy.invoke(MethodProxy.java:204)
at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.invokeJoinpoint(CglibAopProxy.java:746)
at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:163)
at org.springframework.aop.aspectj.MethodInvocationProceedingJoinPoint.proceed(MethodInvocationProceedingJoinPoint.java:88)
at com.cloudera.launchpad.pipeline.PipelineJobProfiler.profileJobRun(PipelineJobProfiler.java:60)
at sun.reflect.GeneratedMethodAccessor384.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.springframework.aop.aspectj.AbstractAspectJAdvice.invokeAdviceMethodWithGivenArgs(AbstractAspectJAdvice.java:644)
at org.springframework.aop.aspectj.AbstractAspectJAdvice.invokeAdviceMethod(AbstractAspectJAdvice.java:633)
at org.springframework.aop.aspectj.AspectJAroundAdvice.invoke(AspectJAroundAdvice.java:70)
at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:185)
at org.springframework.aop.interceptor.ExposeInvocationInterceptor.invoke(ExposeInvocationInterceptor.java:92)
at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:185)
at org.springframework.aop.framework.CglibAopProxy$DynamicAdvisedInterceptor.intercept(CglibAopProxy.java:688)
at com.cloudera.launchpad.bootstrap.AllocateInstances$WaitForSshCredentialInstallation$$EnhancerBySpringCGLIB$$5e6d2516.runUnchecked(<generated>)
at com.cloudera.launchpad.pipeline.util.PipelineRunner$JobCallable.call(PipelineRunner.java:202)
at com.cloudera.launchpad.pipeline.util.PipelineRunner$JobCallable.call(PipelineRunner.java:173)
at com.github.rholder.retry.AttemptTimeLimiters$NoAttemptTimeLimit.call(AttemptTimeLimiters.java:78)
at com.github.rholder.retry.Retryer.call(Retryer.java:160)
at com.cloudera.launchpad.pipeline.util.PipelineRunner.attemptMultipleJobExecutionsWithRetries(PipelineRunner.java:136)
at com.cloudera.launchpad.pipeline.DatabasePipelineRunner.doRun(DatabasePipelineRunner.java:214)
at com.cloudera.launchpad.pipeline.DatabasePipelineRunner.run(DatabasePipelineRunner.java:154)
at com.cloudera.launchpad.ExceptionHandlingRunnable.run(ExceptionHandlingRunnable.java:57)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.net.ConnectException: Connection refused (Connection refused)
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:589)
at net.schmizz.sshj.SocketClient.connect(SocketClient.java:126)
at com.cloudera.launchpad.sshj.SshJClient.attemptConnection(SshJClient.java:332)
at com.cloudera.launchpad.sshj.SshJClient.attemptConnection(SshJClient.java:307)
at com.cloudera.launchpad.sshj.SshJClient.access$000(SshJClient.java:67)
at com.cloudera.launchpad.sshj.SshJClient$1.call(SshJClient.java:245)
at com.cloudera.launchpad.sshj.SshJClient$1.call(SshJClient.java:240)
at com.github.rholder.retry.AttemptTimeLimiters$NoAttemptTimeLimit.call(AttemptTimeLimiters.java:78)
at com.github.rholder.retry.Retryer.call(Retryer.java:160)
at com.cloudera.launchpad.sshj.SshJClient.connect(SshJClient.java:240)
... 34 common frames omitted

原因

  • CentOS の AMI を使うのに OS ユーザー名を ec2-user にしてた。

解決策

  • OSユーザー名を centos に修正する。

f:id:yohei-a:20180923023130p:image:w640


関連

db tech showcase 2018 Day 2

$
0
0

2018/9/20(木)に開催された db tech showcase 2018 Day 2 のメモ。


GPUとNVMEでPostgreSQLの限界に挑む 〜クエリ処理速度10GB/sを越えて〜

概要
  • 講師: 海外 浩平さん(HeteroDB,Inc - チーフアーキテクト 兼 代表取締役社長)
  • 講師略歴: PostgreSQL開発者コミュニティにおけるMajor Contributorで、セキュリティ機能やFDW、CustomScan等の機能においてコア機能の開発に貢献。数年前からGPUによるクエリ高速化モジュールであるPG-Stromを開発。この技術の実用化を目指し、2017年に HeteroDB 社を設立。現在に至る。
  • 内容: 本セッションではまず、GPUを用いたPostgreSQL高速化モジュールであるPG-Stromの中核機能で、GPUとNVME-SSDを密連携てPCIeバスの限界に近いクエリ処理スループットを実現する『SSD-to-GPUダイレクトSQL実行』機能を紹介します。次に、PostgreSQLのパーティショニング機能およびI/O拡張ボックスとこの機能を併用する事で、従来の限界を遥かに越えた処理性能を実現する新しいアプローチとそのベンチマーク結果をご紹介します。
スライド


DBエンジニアのためにSSD Q&A集

概要
  • 浅野 浩延さん: 株式会社インサイトテクノロジー
  • 講師略歴: 1987年米DEC社(現HP社)の日本法人に入社。コンピュータやストレージシステム全般ビジネスに従事。その後、SMART社、Microsoft、PSTC、Solnac、フィックスターズ等で営業や企画業務に従事。日経BP社主催のSSD関連のセミナーなどで講師として登壇する傍ら。 日経xTECHオンラインコラムを執筆。2016年7月インサイトテクノロジー入社後は、Insight Qubeの製品企画から保守までビジネス全般に関わっている。
スライド
  • 非公開
参考

Apache Spark 2.3 and beyond - What's new? -

f:id:yohei-a:20180923050422j:image:w640

概要
  • 講師: 猿田 浩輔 (株式会社NTTデータ - 技術開発本部 / Apache Sparkコミッター)
  • 講師略歴: これまでApache HadoopやSparkをはじめとしたOSS並列分散処理基盤に関連した技術支援や開発活動を行ってきた。昨今は新しいタイプのハードウェアの登場など、パラダイムシフトに伴うOSSミドルウェアの動向に注目している。
  • 内容: 今年2月にリリースされたApache Spark 2.3では、Kubernetesサポートや低レイテンシ・ストリーム処理を実現する動作モードの導入をはじめ、大きな改良がこれまでにないほど多く加えられました。また間もなくのリリースが見込まれる2.4からは「Project Hyrdogen」と称し、機械学習/ディープラーニングを応用したワークロードにおける効率的なデータ処理を実現するための開発が進められています。本セッションではSpark 2.3や2.4以降で予定されているのアップデートのうち、特にユーザにインパクトの大きなものをピックアップして紹介します。
スライド
  • 非公開


P.S.

ランチは久しぶりに秋葉原の 雲林坊 の汁なし担々麺。

f:id:yohei-a:20180923050821j:image:w640

[AWS]Parquet

$
0
0

検証結果

  • Athena
  • Athena
#クエリ実行時間I/O量
1select count(*) from amazon_reviews_parquet
2select count(year) from amazon_reviews_parquet
3select count(review_body) from amazon_reviews_parquet
4select * from amazon_reviews_parquet limit 10000
5select year from amazon_reviews_parquet limit 10000
6select review_body from amazon_reviews_parquet limit 10000

準備手順

$ aws s3 mb s3://amazon-reviews-pds-az
$ aws s3 cp --recursive s3://amazon-reviews-pds/ s3://amazon-reviews-pds-az
CREATE EXTERNAL TABLE amazon_reviews_parquet(
  marketplace string, 
  customer_id string, 
  review_id string, 
  product_id string, 
  product_parent string, 
  product_title string, 
  star_rating int, 
  helpful_votes int, 
  total_votes int, 
  vine string, 
  verified_purchase string, 
  review_headline string, 
  review_body string, 
  review_date bigint, 
  year int)
PARTITIONED BY (product_category string)
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
  's3://amazon-reviews-pds-az/parquet/'
MSCK REPAIR TABLE  amazon_reviews_parquet

検証パターン

  • Athena/PySpark
  • S3/HDFS/ファイルシステム
  • クエリ
select * from amazon_reviews_parquet limit 10000
select year from amazon_reviews_parquet limit 10000
select product_title from amazon_reviews_parquet limit 10000
select count(*) from amazon_reviews_parquet
select count(year) from amazon_reviews_parquet
select sum(year) from amazon_reviews_parquet
select * from amazon_reviews_parquet


参考

[Parquet]Amazon Linux で PyArrow を使ってみる

$
0
0

Amazon Linux で PyArrow を使ってみたメモ。


準備

  • PyArrow をインストールする
$ sudo pip install --upgrade pip
$ sudo yum install python36 python36-virtualenv python36-pip
$ sudo python3 -m pip install pandas pyarrow
  • データをコピーする
$ mkdir amazon-reviews-pds-az
$ cd amazon-reviews-pds-az/
$ aws s3 cp --recursive s3://amazon-reviews-pds/parquet ./
  • test.py を作成する。
#!/usr/bin/python

import pyarrow as pa
import pyarrow.parquet as pq
import pandas as pd

table = pq.read_table('~/amazon-reviews-pds-az/product_category=Apparel/part-00000-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet')
df = table.to_pandas()

print(len(df))
print(df.describe())

実行する

$ python3 test.py
589900
         star_rating  helpful_votes    total_votes           year
count  589900.000000  589900.000000  589900.000000  589900.000000
mean        4.105531       0.985847       1.179207    2013.943150
std         1.258572      10.724705      11.296609       1.374692
min         1.000000       0.000000       0.000000    2001.000000
25%         4.000000       0.000000       0.000000    2014.000000
50%         5.000000       0.000000       0.000000    2014.000000
75%         5.000000       0.000000       1.000000    2015.000000
max         5.000000    3846.000000    3882.000000    2015.000000

環境

  • Amazon Linux AMI release 2018.03 (4.14.62-65.117.amzn1.x86_64)

関連

[AWS]PySpark から Parquet ファイル on HFDS にクエリを実行してみる

$
0
0

準備

  • EMRクラスターを作成する。
  • EMR のセキュリティグループで ssh でのアクセスを許可する。
  • マスターノードに ssh でログインする。
$ ssh -i ~/us-east-1.pem hadoop@ec2-**-***-**-**.compute-1.amazonaws.com
  • HDFS にディレクトリを作成して S3 からデータをコピーする。
$ hadoop fs -mkdir /amazon-reviews-pds-az/
$ s3-dist-cp --src s3://amazon-reviews-pds/ --dest /amazon-reviews-pds-az/
  • コピーしたファイルを確認する。
$ hadoop fs -ls -h -R /amazon-reviews-pds-az

実行

  • 以下のコードを実行する。
from pyspark.sql.types import *

df = sqlContext.read.parquet("/amazon-reviews-pds-az/parquet/")
df.createOrReplaceTempView("reviews")

print sqlContext.sql("SELECT * FROM reviews where product_category == 'Books'").count()

f:id:yohei-a:20180924023501p:image:w640



参考

[Hadoop]HDFS の下の OS レイヤーを覗いてみる

$
0
0
Big Data Forensics: Learning Hadoop Investigations

Big Data Forensics: Learning Hadoop Investigations

  • HDFS collections through the host operating system

Targeted collection from a Hadoop client

The third method for collecting HDFS data from the host operating system is a targeted collection. The HDFS data is stored in defined locations within the host operating system. This data can be collected on a per-node basis through logical file copies. Every node needs to be collected to ensure the HDFS files can be reconstructed in the analysis phase.

The same process is conducted for both targeted collections and imaging collections, except for a couple of differences. With imaging collections, entire disk volumes are collected and hashed. Targeted collections involve the copying of individual files and directories. In both methods, the investigator collects the data, documents the process, and computes MD5/SHA-1 hash values. However, there are differences. In targeted collections, MD5/SHA-1 is computed on the files but not the volumes, the collection process requires multiple copies rather than a single image file, and certain metadata is not preserved. Also, investigators typically perform the targeted collection using scripts rather than manually typing the commands at runtime.

The first step for performing the targeted collection is to identify the location where the host operating system stores the HDFS files. For Linux, Unix, OS X, and other Unix variants, this can be found in the hdfs-site.xml file. While typically stored in the /etc/hadoop directory, it can be stored in other locations, so the investigator first needs to find this location before beginning. In Windows, this information is typically located in the Windows Hadoop installation directory c:\hadoop. To find the directory location from the command line, run the following command:


(中略)


The investigator should collect the entire DataNode tree structure. The structure is comprised of the following directories and files:

  • BP-<integer>-<IP Address>-<creation time>: This directory is the block pool that collects the blocks of data belonging to that DataNode.
  • finalized/rbw: The actual data blocks are stored in these directories. The finalized directory stores the blocks that have been completely written to disk. The rbw directory stands for replica being written and stores the blocks that are currently being written to HDFS.
  • VERSION: This text file stores property information. Each DataNode has a DataNode-wide VERSION file and also VERSION files for each block pool.
  • blk_<block ID>: The binary data blocks content files.
  • blk_<block ID>.meta: The binary data blocks metadata files.
  • dncp_block_verification: This file tracks the times in which the block was last verified via checksum.
  • in_use.lock: This is a lock file used by the DataNode process to prevent multiple DataNode processes from modifying the directory.

実際にちょっと見てみた。

  • /etc/hadoop/conf/hdfs-site.xml
(中略)

  <property>
    <name>dfs.name.dir</name>
    <value>/mnt/namenode</value>
  </property>

  <property>
    <name>dfs.data.dir</name>
    <value>/mnt/hdfs</value>
  </property>

(中略)
  • /mnt/hdfs 以下のディレクトリ階層
[root@ip-***-**-*-133 hdfs]# tree -d /mnt/hdfs
/mnt/hdfs
└── current
    └── BP-747367826-172.31.6.167-1537719042716
        ├── current
        │&#160;&#160; ├── finalized&#160;&#160; │&#160;&#160; └── subdir0&#160;&#160; │&#160;&#160;     ├── subdir0&#160;&#160; │&#160;&#160;     ├── subdir1&#160;&#160; │&#160;&#160;     ├── subdir3&#160;&#160; │&#160;&#160;     ├── subdir4&#160;&#160; │&#160;&#160;     ├── subdir5&#160;&#160; │&#160;&#160;     ├── subdir6&#160;&#160; │&#160;&#160;     ├── subdir7&#160;&#160; │&#160;&#160;     └── subdir8&#160;&#160; └── rbw
        └── tmp

15 directories
  • ファイルを確認する
[root@ip-***-**-*-133 subdir7]# pwd
/mnt/hdfs/current/BP-747367826-***.**.*.167-1537719042716/current/finalized/subdir0/subdir7
[root@ip-***-**-*-133 subdir7]# ls -lh|head
total 15G
-rw-r--r-- 1 hdfs hdfs  128M Sep 23 16:23 blk_1073743618
-rw-r--r-- 1 hdfs hdfs  1.1M Sep 23 16:23 blk_1073743618_2794.meta
-rw-r--r-- 1 hdfs hdfs  128M Sep 23 16:23 blk_1073743619
-rw-r--r-- 1 hdfs hdfs  1.1M Sep 23 16:23 blk_1073743619_2795.meta
-rw-r--r-- 1 hdfs hdfs  128M Sep 23 16:23 blk_1073743620
-rw-r--r-- 1 hdfs hdfs  1.1M Sep 23 16:23 blk_1073743620_2796.meta
-rw-r--r-- 1 hdfs hdfs  128M Sep 23 16:23 blk_1073743622
-rw-r--r-- 1 hdfs hdfs  1.1M Sep 23 16:23 blk_1073743622_2798.meta
-rw-r--r-- 1 hdfs hdfs  128M Sep 23 16:24 blk_1073743624
  • /mnt は HDFS のデータが保存されているのでサイズが大きい。
[root@ip-***-**-*-133 hdfs]# df
Filesystem      Size  Used Avail Use% Mounted on
devtmpfs         16G   76K   16G   1% /dev
tmpfs            16G     0   16G   0% /dev/shm
/dev/xvda1       99G  3.7G   95G   4% /
/dev/xvdb1      5.0G   37M  5.0G   1% /emr
/dev/xvdb2      495G   43G  452G   9% /mnt ★
  • ファイルシステムは XFS。
[root@ip-***-**-*-133 hdfs]# mount
proc on /proc type proc (rw,relatime)
sysfs on /sys type sysfs (rw,relatime)
devtmpfs on /dev type devtmpfs (rw,relatime,size=16460148k,nr_inodes=4115037,mode=755)
devpts on /dev/pts type devpts (rw,relatime,gid=5,mode=620,ptmxmode=000)
tmpfs on /dev/shm type tmpfs (rw,relatime)
/dev/xvda1 on / type ext4 (rw,noatime,data=ordered)
devpts on /dev/pts type devpts (rw,relatime,gid=5,mode=620,ptmxmode=000)
none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw,relatime)
/dev/xvdb1 on /emr type xfs (rw,relatime,attr2,inode64,noquota)
/dev/xvdb2 on /mnt type xfs★ (rw,relatime,attr2,inode64,noquota)
cgroup on /cgroup/blkio type cgroup (rw,relatime,blkio)
cgroup on /cgroup/cpu type cgroup (rw,relatime,cpu)
cgroup on /cgroup/cpuacct type cgroup (rw,relatime,cpuacct)
cgroup on /cgroup/cpuset type cgroup (rw,relatime,cpuset)
cgroup on /cgroup/devices type cgroup (rw,relatime,devices)
cgroup on /cgroup/freezer type cgroup (rw,relatime,freezer)
cgroup on /cgroup/hugetlb type cgroup (rw,relatime,hugetlb)
cgroup on /cgroup/memory type cgroup (rw,relatime,memory)
cgroup on /cgroup/perf_event type cgroup (rw,relatime,perf_event)

環境

  • リリースラベル: emr-5.17.0
  • Hadoop ディストリビューション: Amazon 2.8.4

[Hadoop]HDFS キャッシング

$
0
0

HDFS のブロックはファイルシステムに保存されるため、Linux カーネルのページキャッシュを自然に使っていたが、ユーザー空間から制御できないため、HDFSキャッシング(Hadoop 2.3.0 以降)という機能がある。

HDFS上のデータの読み書きの際には、ディスクから読み出されたデータは、Linuxのカーネル内のページキャッシュ(原文ではBuffer cacheとなってます)にキャッシュされます。(これにより毎回ディスクアクセスを避けることが期待できます)

HDFSが高速に?キャッシュメカニズムの追加 | Tech Blog

Hadoop 2.3.0 以降には「HDFSキャッシング」と呼ばれる、HDFSにキャッシュ機構が搭載されています。

(中略)

HDFSの中央キャッシュ管理は、ユーザが明示的に指定したパスを、HDFSによって明示的にキャッシュする仕組みです。ネームノードはブロックをディスクに持つデータノードと通信して、そのブロックを「オフピーク (off-heap)」キャッシュにキャッシュします。

オフピークキャッシュは各データノードにある、JVMのVMヒープ対象外のメモリ領域です。ユーザーがコマンドからキャッシュに登録するパスを指定することにより、ブロックがこの領域にキャッシュされます。

HDFSの新しい機能 - HDFSキャッシング | Tech Blog

https://www.ibm.com/support/knowledgecenter/ja/SSPT3X_4.1.0/com.ibm.swg.im.infosphere.biginsights.dev.doc/doc/biga_hdfscache.html を参考に手元の EMR(emr-5.17.0) のマスターノードで、hdfs-site.xml に dfs.client.mmap.enabled や dfs.datanode.max.locked.memory のエントリはなく、キャッシュ・プールも無かったので、意図的に使わないと使われない模様。

$ hdfs cacheadmin -listPools
Found 0 results.

Parquet についてのメモ

[AWS]EMR の Web インターフェース

$
0
0

[Hadoop]HDFS の I/O サイズ

$
0
0

emr-5.17.0 で /etc/hadoop/conf/core-site.xml を確認すると以下の通り*1

  <property>
    <name>io.file.buffer.size</name>
    <value>65536</value>
  </property>

Note that HDFS Readers do not read whole blocks of data at a time, and instead stream the data via a buffered read (64k-128k typically). That the block size is X MB does not translate into a memory requirement unless you are explicitly storing the entire block in memory when streaming the read.

Solved: Hadoop read IO size - Cloudera Community
  @Override
  public FSDataInputStream open(Path f, final int bufferSize)
      throws IOException {
    statistics.incrementReadOps(1);
    Path absF = fixRelativePart(f);
    return new FileSystemLinkResolver<FSDataInputStream>() {
      @Override
      public FSDataInputStream doCall(final Path p)
          throws IOException, UnresolvedLinkException {
        final DFSInputStream dfsis =
          dfs.open(getPathName(p), bufferSize, verifyChecksum);
        return dfs.createWrappedInputStream(dfsis);
      }
      @Override
      public FSDataInputStream next(final FileSystem fs, final Path p)
          throws IOException {
        return fs.open(p, bufferSize);
      }
    }.resolve(this, absF);
  }
https://github.com/cloudera/hadoop-common/blob/cdh5.4.5-release/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DistributedFileSystem.java#L294-L303

これは, 各 Mapper は独立にシーケンシャル I/O をバッファサイズ単位(注4)で発行しているため, ディスクドライブに対しては完全なシーケンシャルアクセスにはならず, Mapper 数(同時 I/O ストリーム数) の増加により, シーケンシャリティは低下するためだと考えられる. つまり, 前述の性能律速は, 複数ストリーム I/O によるディスクシークの増加から起因される問題であると考えられる.

一般的に, 前述のような複数ストリーム I/O による性能律速の問題に対しては, I/O バッファサイズの増加, または, I/O のスケジューリングにより解決が図られる. 以下に, それぞれの解決策を Hadoop において検証した結果を示す.図 5 に I/O バッファサイズを変化させた時の I/O 転送レートを示す. I/O バッファサイズの増加に伴い, I/O 転送レートが低下の傾向にあることがわかる. また, 32MB の時は, 大きな性能低下が確認された. これは, 各 Mapper の逐次的な I/O処理モデル(注5) が起因していると考えられる. つまり, 小さい単位で I/O を発行した場合は, I/O 発行の間隔が短いことからOS の先読み機構が効率的に機能し, ディスクドライブに対するシーケンシャリティの増加に加えて, I/O 処理が CPU 処理にオーバーラップして実行される可能性が高いが, 大きい単位で I/O を発行した場合は, ディスクドライブに対するシーケンシャリティは増加するものの, I/O 発行の間隔が長いため OSの先読み機構が効率的に機能せず, I/O 処理と CPU 処理が逐次的に処理されるため, 小さい単位で I/O を発行した場合と比べて性能が低下してしまうと考えられる. このことから, 現状の Hadoop の I/O 処理モデルでは, I/O バッファサイズの増加のアプローチは当該問題に対する解決策とはならないと考えられる.

(注4):デフォルトは 4KB.

(注5):Hadoop の各 Mapper は I/O 処理と CPU 処理を同時には実行せず,逐次的に実行することで処理を進めるデータ処理アーキテクチャとなっている.

(注6):Hadoop は, 通常 HDFS のブロックを Spilt とし, Split ごとに Mapperを割り当てて処理を進める.

f:id:yohei-a:20180924183353p:image:w640

並列データインテンシブ処理基盤のI/O性能評価に関する実験的考察

*1:マスターノードとコアノードで確認した

パッブリックデータセット

[Hadoop]Prestoでparquetファイルにクエリをかける

$
0
0
  • データをコピーする
$ s3-dist-cp --src s3://amazon-reviews-pds/parquet/ --dest /amazon-reviews-pds/parquet/
  • hive shell を起動する。
$ hive
  • テーブルを作成する
hive> CREATE EXTERNAL TABLE parquet.amazon_reviews_parquet(
  marketplace string, 
  customer_id string, 
  review_id string, 
  product_id string, 
  product_parent string, 
  product_title string, 
  star_rating int, 
  helpful_votes int, 
  total_votes int, 
  vine string, 
  verified_purchase string, 
  review_headline string, 
  review_body string, 
  review_date bigint, 
  year int)
PARTITIONED BY (product_category string)
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
  'hdfs:///amazon-reviews-pds/parquet';
  • パーティションを認識させる。
hive> MSCK REPAIR TABLE amazon_reviews_parquet;
  • hive shell を終了する
hive> quit

Presto からクエリを投げてみる

  • presto-cli を起動する
$ presto-cli
  • データベースとスキーマを指定する。
presto> use hive.parquet;
  • クエリを実行する
presto:parquet> select count(star_rating) from amazon_reviews_parquet2;
   _col0
-----------
 160796570
(1 row)

Query 20180930_061857_00021_ypxzu, FINISHED, 2 nodes
http://ip-172-31-13-113.ec2.internal:8889/ui/query.html?20180930_061857_00021_ypxzu
Splits: 1,127 total, 1,127 done (100.00%)
CPU Time: 15.3s total, 10.5M rows/s, 3.64MB/s, 32% active
Per Node: 1.3 parallelism, 14.2M rows/s,  4.9MB/s
Parallelism: 2.7
Peak Memory: 24B
0:06 [161M rows, 55.7MB★] [28.3M rows/s, 9.8MB/s]

f:id:yohei-a:20180930153032p:image:w640

presto:parquet> select count(review_body) from amazon_reviews_parquet2;
   _col0
-----------
 160789772
(1 row)

Query 20180930_060907_00020_ypxzu, FINISHED, 2 nodes
http://ip-172-31-13-113.ec2.internal:8889/ui/query.html?20180930_060907_00020_ypxzu
Splits: 1,143 total, 1,143 done (100.00%)
CPU Time: 335.7s total,  479K rows/s,  104MB/s, 8% active
Per Node: 0.7 parallelism,  330K rows/s, 71.6MB/s
Parallelism: 1.4
Peak Memory: 24B
4:03 [161M rows, 34GB★] [661K rows/s, 143MB/s]

f:id:yohei-a:20180930153026p:image:w640


インストール

$ sudo  yum -y install iotop
  • htop をインストールする。
$ sudo yum -y install htop

データ

$ hadoop fs -ls -h -R /amazon-reviews-pds/parquet/|head -50
drwxr-xr-x   - hadoop hadoop          0 2018-09-29 20:12 /amazon-reviews-pds/parquet/product_category=Apparel
-rw-r--r--   1 hadoop hadoop    115.0 M 2018-09-29 20:11 /amazon-reviews-pds/parquet/product_category=Apparel/part-00000-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet
-rw-r--r--   1 hadoop hadoop    115.3 M 2018-09-29 20:12 /amazon-reviews-pds/parquet/product_category=Apparel/part-00001-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet
-rw-r--r--   1 hadoop hadoop    114.9 M 2018-09-29 20:12 /amazon-reviews-pds/parquet/product_category=Apparel/part-00002-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet
-rw-r--r--   1 hadoop hadoop    115.2 M 2018-09-29 20:12 /amazon-reviews-pds/parquet/product_category=Apparel/part-00003-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet
-rw-r--r--   1 hadoop hadoop    115.3 M 2018-09-29 20:12 /amazon-reviews-pds/parquet/product_category=Apparel/part-00004-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet
-rw-r--r--   1 hadoop hadoop    115.3 M 2018-09-29 20:12 /amazon-reviews-pds/parquet/product_category=Apparel/part-00005-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet
-rw-r--r--   1 hadoop hadoop    115.4 M 2018-09-29 20:12 /amazon-reviews-pds/parquet/product_category=Apparel/part-00006-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet
-rw-r--r--   1 hadoop hadoop    114.8 M 2018-09-29 20:12 /amazon-reviews-pds/parquet/product_category=Apparel/part-00007-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet
-rw-r--r--   1 hadoop hadoop    115.3 M 2018-09-29 20:12 /amazon-reviews-pds/parquet/product_category=Apparel/part-00008-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet
-rw-r--r--   1 hadoop hadoop    115.3 M 2018-09-29 20:12 /amazon-reviews-pds/parquet/product_category=Apparel/part-00009-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet
drwxr-xr-x   - hadoop hadoop          0 2018-09-29 20:13 /amazon-reviews-pds/parquet/product_category=Automotive
-rw-r--r--   1 hadoop hadoop     80.8 M 2018-09-29 20:12 /amazon-reviews-pds/parquet/product_category=Automotive/part-00000-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet
-rw-r--r--   1 hadoop hadoop     81.2 M 2018-09-29 20:12 /amazon-reviews-pds/parquet/product_category=Automotive/part-00001-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet
-rw-r--r--   1 hadoop hadoop     80.9 M 2018-09-29 20:12 /amazon-reviews-pds/parquet/product_category=Automotive/part-00002-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet
-rw-r--r--   1 hadoop hadoop     81.1 M 2018-09-29 20:13 /amazon-reviews-pds/parquet/product_category=Automotive/part-00003-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet
-rw-r--r--   1 hadoop hadoop     81.1 M 2018-09-29 20:13 /amazon-reviews-pds/parquet/product_category=Automotive/part-00004-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet
-rw-r--r--   1 hadoop hadoop     80.8 M 2018-09-29 20:13 /amazon-reviews-pds/parquet/product_category=Automotive/part-00005-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet
-rw-r--r--   1 hadoop hadoop     81.1 M 2018-09-29 20:13 /amazon-reviews-pds/parquet/product_category=Automotive/part-00006-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet
-rw-r--r--   1 hadoop hadoop     80.6 M 2018-09-29 20:13 /amazon-reviews-pds/parquet/product_category=Automotive/part-00007-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet
-rw-r--r--   1 hadoop hadoop     80.9 M 2018-09-29 20:13 /amazon-reviews-pds/parquet/product_category=Automotive/part-00008-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet
-rw-r--r--   1 hadoop hadoop     81.3 M 2018-09-29 20:13 /amazon-reviews-pds/parquet/product_category=Automotive/part-00009-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet
drwxr-xr-x   - hadoop hadoop          0 2018-09-29 20:12 /amazon-reviews-pds/parquet/product_category=Baby
-rw-r--r--   1 hadoop hadoop     48.9 M 2018-09-29 20:11 /amazon-reviews-pds/parquet/product_category=Baby/part-00000-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet
-rw-r--r--   1 hadoop hadoop     49.5 M 2018-09-29 20:11 /amazon-reviews-pds/parquet/product_category=Baby/part-00001-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet
-rw-r--r--   1 hadoop hadoop     49.3 M 2018-09-29 20:11 /amazon-reviews-pds/parquet/product_category=Baby/part-00002-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet
-rw-r--r--   1 hadoop hadoop     49.0 M 2018-09-29 20:11 /amazon-reviews-pds/parquet/product_category=Baby/part-00003-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet
-rw-r--r--   1 hadoop hadoop     49.1 M 2018-09-29 20:12 /amazon-reviews-pds/parquet/product_category=Baby/part-00004-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet
-rw-r--r--   1 hadoop hadoop     49.1 M 2018-09-29 20:12 /amazon-reviews-pds/parquet/product_category=Baby/part-00005-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet
-rw-r--r--   1 hadoop hadoop     49.3 M 2018-09-29 20:12 /amazon-reviews-pds/parquet/product_category=Baby/part-00006-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet
-rw-r--r--   1 hadoop hadoop     49.0 M 2018-09-29 20:12 /amazon-reviews-pds/parquet/product_category=Baby/part-00007-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet
-rw-r--r--   1 hadoop hadoop     49.0 M 2018-09-29 20:12 /amazon-reviews-pds/parquet/product_category=Baby/part-00008-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet
-rw-r--r--   1 hadoop hadoop     48.9 M 2018-09-29 20:12 /amazon-reviews-pds/parquet/product_category=Baby/part-00009-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet
drwxr-xr-x   - hadoop hadoop          0 2018-09-29 20:12 /amazon-reviews-pds/parquet/product_category=Beauty
-rw-r--r--   1 hadoop hadoop    127.1 M 2018-09-29 20:12 /amazon-reviews-pds/parquet/product_category=Beauty/part-00000-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet
-rw-r--r--   1 hadoop hadoop    127.3 M 2018-09-29 20:12 /amazon-reviews-pds/parquet/product_category=Beauty/part-00001-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet
-rw-r--r--   1 hadoop hadoop    127.2 M 2018-09-29 20:12 /amazon-reviews-pds/parquet/product_category=Beauty/part-00002-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet
-rw-r--r--   1 hadoop hadoop    126.9 M 2018-09-29 20:12 /amazon-reviews-pds/parquet/product_category=Beauty/part-00003-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet
-rw-r--r--   1 hadoop hadoop    127.0 M 2018-09-29 20:12 /amazon-reviews-pds/parquet/product_category=Beauty/part-00004-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet
-rw-r--r--   1 hadoop hadoop    127.0 M 2018-09-29 20:13 /amazon-reviews-pds/parquet/product_category=Beauty/part-00005-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet
-rw-r--r--   1 hadoop hadoop    126.8 M 2018-09-29 20:13 /amazon-reviews-pds/parquet/product_category=Beauty/part-00006-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet
-rw-r--r--   1 hadoop hadoop    127.0 M 2018-09-29 20:13 /amazon-reviews-pds/parquet/product_category=Beauty/part-00007-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet
-rw-r--r--   1 hadoop hadoop    127.4 M 2018-09-29 20:13 /amazon-reviews-pds/parquet/product_category=Beauty/part-00008-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet
-rw-r--r--   1 hadoop hadoop    127.5 M 2018-09-29 20:13 /amazon-reviews-pds/parquet/product_category=Beauty/part-00009-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet
drwxr-xr-x   - hadoop hadoop          0 2018-09-29 20:14 /amazon-reviews-pds/parquet/product_category=Books
-rw-r--r--   1 hadoop hadoop      1.0 G 2018-09-29 20:14 /amazon-reviews-pds/parquet/product_category=Books/part-00000-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet
-rw-r--r--   1 hadoop hadoop      1.0 G 2018-09-29 20:14 /amazon-reviews-pds/parquet/product_category=Books/part-00001-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet
-rw-r--r--   1 hadoop hadoop      1.0 G 2018-09-29 20:15 /amazon-reviews-pds/parquet/product_category=Books/part-00002-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet
-rw-r--r--   1 hadoop hadoop      1.0 G 2018-09-29 20:15 /amazon-reviews-pds/parquet/product_category=Books/part-00003-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet
-rw-r--r--   1 hadoop hadoop      1.0 G 2018-09-29 20:15 /amazon-reviews-pds/parquet/product_category=Books/part-00004-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet

[Hadoop]presto-cli で pager を off にする


[Hadoop]Hive テーブル作成時に ”java.lang.IllegalArgumentException: java.net.UnknownHostException” と怒られる

$
0
0

事象

Hive テーブルを作成しようとすると "FAILED: SemanticException java.lang.IllegalArgumentException: java.net.UnknownHostException: " と怒られる。

hive> CREATE TABLE parquet.amazon_reviews_parquet(
  marketplace string, 
  customer_id string, 
  review_id string, 
  product_id string, 
  product_parent string, 
  product_title string, 
  star_rating int, 
  helpful_votes int, 
  total_votes int, 
  vine string, 
  verified_purchase string, 
  review_headline string, 
  review_body string, 
  review_date bigint, 
  year int)
PARTITIONED BY (product_category string)
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
  'hdfs://amazon-reviews-pds/parquet';

FAILED: SemanticException java.lang.IllegalArgumentException: java.net.UnknownHostException: amazon-reviews-pds

原因

"hdfs://ホスト名/パス" という書式なので、"hdfs://amazon-reviews-pds/parquet" と書くと "amazon-reviews-pds" はホスト名になるため。ローカルパスの場合は "hdfs:///amazon-reviews-pds/parquet" と書けば良い。

There is one additional / after hdfs://, which is a protocol name. You must go to /tmp/... via hdfs:// protocol, that's why URL needs additional /. Without this, Spark is trying to reach host tmp, not folder

scala - java.lang.IllegalArgumentException: java.net.UnknownHostException: tmp - Stack Overflow

解決策

"hdfs://amazon-reviews-pds/parquet" を "hdfs:///amazon-reviews-pds/parquet" に書き換える。

hive> CREATE TABLE parquet.amazon_reviews_parquet(
  marketplace string, 
  customer_id string, 
  review_id string, 
  product_id string, 
  product_parent string, 
  product_title string, 
  star_rating int, 
  helpful_votes int, 
  total_votes int, 
  vine string, 
  verified_purchase string, 
  review_headline string, 
  review_body string, 
  review_date bigint, 
  year int)
PARTITIONED BY (product_category string)
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
  'hdfs:///amazon-reviews-pds/parquet';
OK
Time taken: 0.051 seconds

[Hadoop]Prestoで結果セットをファイルに出力する

$
0
0

Simple answer :

presto --execute "select * from foo" --output-format CSV > foo.csv

You can use these formats :

ALIGNED
VERTICAL
CSV
TSV
CSV_HEADER
TSV_HEADER
How to export result of select statement in prestodb.io - Stack Overflow

とすればよいらしい。


EMR だとこんな感じでいけた。

$ presto-cli --catalog hive --schema parquet --execute "select count(*) from amazon_reviews_parquet" --output-format CSV > foo.csv

[AWS]Athena で CloudTrail のイベントを集計する

$
0
0

CloudTrail を S3 に保存しておき(設定方法はコチラ)、Athena で集計してみた。

  • eventsource で集計
select  eventsource, count(1) as cnt 
from default.cloudtrail_logs_cloudtrail_do_not_delete 
group by eventsource
order by cnt desc
eventsourcecount
s3.amazonaws.com1111063
ec2.amazonaws.com86762
sts.amazonaws.com52597
athena.amazonaws.com10359
ssm.amazonaws.com8277
glue.amazonaws.com2114
cloudformation.amazonaws.com1882
kms.amazonaws.com1604
elasticmapreduce.amazonaws.com1136
cloudtrail.amazonaws.com1100
monitoring.amazonaws.com991
autoscaling.amazonaws.com634
iam.amazonaws.com447
rds.amazonaws.com430
logs.amazonaws.com262
lambda.amazonaws.com216
config.amazonaws.com136
elasticloadbalancing.amazonaws.com120
signin.amazonaws.com95
redshift.amazonaws.com88
sns.amazonaws.com79
quicksight.amazonaws.com28
sqs.amazonaws.com9
route53.amazonaws.com7
dynamodb.amazonaws.com6
elasticbeanstalk.amazonaws.com6
route53domains.amazonaws.com4
xray.amazonaws.com2
ds.amazonaws.com1

  • eventsource、eventname で集計
select  eventsource, eventname, count(1) as cnt 
from default.cloudtrail_logs_cloudtrail_do_not_delete 
group by eventsource, eventname
order by cnt desc
eventsourceeventnamecount
s3.amazonaws.comGetObject856835
s3.amazonaws.comHeadObject92374
s3.amazonaws.comPutObject79119
sts.amazonaws.comAssumeRole52501
ec2.amazonaws.comDescribeAddresses30794
s3.amazonaws.comListObjects30767
ec2.amazonaws.comDescribeInstances22475
ec2.amazonaws.comDescribeInstanceStatus15614
s3.amazonaws.comHeadBucket10675
s3.amazonaws.comUploadPartCopy10362
ec2.amazonaws.comDescribeNetworkInterfaces9756
athena.amazonaws.comGetQueryExecution9040
s3.amazonaws.comCopyObject6353
ssm.amazonaws.comUpdateInstanceInformation5658
ssm.amazonaws.comListInstanceAssociations2612
ec2.amazonaws.comDescribeVolumes2446
cloudformation.amazonaws.comDescribeStackResource1793
ec2.amazonaws.comDescribeInstanceAttribute1249
ec2.amazonaws.comDescribeKeyPairs1053
monitoring.amazonaws.comDescribeAlarms990
kms.amazonaws.comGenerateDataKey885
s3.amazonaws.comGetBucketPolicy709
ec2.amazonaws.comDescribeVolumeStatus694
s3.amazonaws.comGetBucketAcl693
kms.amazonaws.comDecrypt616
cloudtrail.amazonaws.comGetTrailStatus543
ec2.amazonaws.comDescribeSecurityGroups504
s3.amazonaws.comDeleteObject437
s3.amazonaws.comCreateMultipartUpload421
athena.amazonaws.comGetQueryResults421
s3.amazonaws.comCompleteMultipartUpload420
athena.amazonaws.comStartQueryExecution417
elasticmapreduce.amazonaws.comDescribeCluster407
cloudtrail.amazonaws.comDescribeTrails394
s3.amazonaws.comListBuckets379
ec2.amazonaws.comDescribeTags338
autoscaling.amazonaws.comDescribeAutoScalingGroups282
glue.amazonaws.comGetCatalogImportStatus266
autoscaling.amazonaws.comDescribeNotificationConfigurations264
s3.amazonaws.comGetBucketEncryption264
glue.amazonaws.comGetCrawlerMetrics258
ec2.amazonaws.comDescribeImages246
elasticmapreduce.amazonaws.comListInstanceGroups234
glue.amazonaws.comGetDatabases226
elasticmapreduce.amazonaws.comListBootstrapActions224
athena.amazonaws.comListQueryExecutions204
glue.amazonaws.comGetJobRuns202
glue.amazonaws.comGetCrawler182
ec2.amazonaws.comRunInstances181
ec2.amazonaws.comDescribeVpcs156
glue.amazonaws.comGetConnections149
ec2.amazonaws.comDescribeRegions140
lambda.amazonaws.comListFunctions20150331132
glue.amazonaws.comGetClassifiers131
logs.amazonaws.comCreateLogStream124
rds.amazonaws.comDescribeDBEngineVersions122
athena.amazonaws.comBatchGetQueryExecution122
glue.amazonaws.comGetCrawlers121
elasticloadbalancing.amazonaws.comDescribeLoadBalancers120
rds.amazonaws.comDescribeOrderableDBInstanceOptions115
ec2.amazonaws.comDescribeSubnets112
ec2.amazonaws.comDescribeAvailabilityZones109
glue.amazonaws.comGetTable107
ec2.amazonaws.comDescribeSnapshots94
ec2.amazonaws.comDescribeStaleSecurityGroups93
s3.amazonaws.comUploadPart91
rds.amazonaws.comDescribeDBInstances90
ec2.amazonaws.comDescribeAccountAttributes77
logs.amazonaws.comDescribeMetricFilters76
glue.amazonaws.comGetJobs76
glue.amazonaws.comGetTriggers75
kms.amazonaws.comListAliases74
ec2.amazonaws.comDescribeRouteTables71
cloudtrail.amazonaws.comLookupEvents65
config.amazonaws.comDescribeConfigurationRecorders65
elasticmapreduce.amazonaws.comListSteps64
ec2.amazonaws.comDescribeIdFormat62
glue.amazonaws.comGetSecurityConfigurations61
config.amazonaws.comDescribeConfigurationRecorderStatus61
signin.amazonaws.comRenewRole59
ec2.amazonaws.comDescribeLaunchTemplates56
elasticmapreduce.amazonaws.comListEventsPrivate54
elasticmapreduce.amazonaws.comListYarnApplicationsPrivate54
glue.amazonaws.comGetTables51
s3.amazonaws.comGetBucketVersioning50
iam.amazonaws.comListRolePolicies50
s3.amazonaws.comGetBucketWebsite46
s3.amazonaws.comGetBucketTagging44
sns.amazonaws.comListTopics43
iam.amazonaws.comListInstanceProfiles42
s3.amazonaws.comListObjectVersions41
s3.amazonaws.comGetObjectAcl40
iam.amazonaws.comListInstanceProfilesForRole40
cloudtrail.amazonaws.comGetEventSelectors38
s3.amazonaws.comGetBucketLocation37
athena.amazonaws.comGetQueryResultsStream37
autoscaling.amazonaws.comDescribeScalingPolicies36
autoscaling.amazonaws.comDescribePolicies36
ec2.amazonaws.comDescribeNetworkAcls35
glue.amazonaws.comGetTableVersions35
ec2.amazonaws.comTerminateInstances34
s3.amazonaws.comGetBucketNotification34
glue.amazonaws.comGetPartitions32
sns.amazonaws.comListSubscriptions32
elasticmapreduce.amazonaws.comListSecurityConfigurations32
logs.amazonaws.comDescribeExportTasks32
ec2.amazonaws.comDescribeDhcpOptions31
iam.amazonaws.comGetRole31
iam.amazonaws.comListRoles30
s3.amazonaws.comGetBucketCors30
cloudformation.amazonaws.comDescribeStacks30
iam.amazonaws.comListAttachedRolePolicies30
iam.amazonaws.comGetPolicyVersion29
ec2.amazonaws.comCreateTags29
signin.amazonaws.comSwitchRole28
ec2.amazonaws.comDescribeHosts27
ec2.amazonaws.comDescribeVolumesModifications27
ec2.amazonaws.comDescribePlacementGroups27
redshift.amazonaws.comDescribeClusters24
cloudformation.amazonaws.comDescribeStackEvents24
ec2.amazonaws.comDescribeInstanceCreditSpecifications24
lambda.amazonaws.comGetPolicy20150331v223
athena.amazonaws.comCreateNamedQuery22
rds.amazonaws.comDescribeDBSecurityGroups22
iam.amazonaws.comListAccountAliases22
iam.amazonaws.comGetAccountPasswordPolicy21
iam.amazonaws.comGetAccountSummary21
s3.amazonaws.comGetBucketRequestPayment20
s3.amazonaws.comGetBucketLogging20
lambda.amazonaws.comGetFunction20150331v219
glue.amazonaws.comGetJob19
ec2.amazonaws.comDescribeVpcAttribute19
cloudformation.amazonaws.comListStacks19
iam.amazonaws.comListPolicyVersions18
s3.amazonaws.comGetBucketLifecycle18
elasticmapreduce.amazonaws.comListReleases18
ec2.amazonaws.comAuthorizeSecurityGroupIngress18
elasticmapreduce.amazonaws.comListInstances16
logs.amazonaws.comDescribeLogStreams15
iam.amazonaws.comGetRolePolicy15
ec2.amazonaws.comDescribeSpotPriceHistory15
iam.amazonaws.comGetPolicy13
kms.amazonaws.comListKeys13
kms.amazonaws.comDescribeKey13
ec2.amazonaws.comDescribeVpcEndpoints13
glue.amazonaws.comGetDevEndpoints13
glue.amazonaws.comUpdateCrawler12
glue.amazonaws.comStartCrawler12
redshift.amazonaws.comDescribeEvents12
s3.amazonaws.comGetBucketReplication12
cloudformation.amazonaws.comDescribeStackResources11
iam.amazonaws.comListPolicies11
glue.amazonaws.comStartJobRun11
s3.amazonaws.comDeleteObjects10
quicksight.amazonaws.comGetAnalysis10
rds.amazonaws.comDescribeOptionGroups10
autoscaling.amazonaws.comDescribeScalingActivities10
lambda.amazonaws.comListTags201703319
rds.amazonaws.comDescribeDBClusters9
redshift.amazonaws.comDescribeLoggingStatus9
iam.amazonaws.comListEntitiesForPolicy9
logs.amazonaws.comCreateLogGroup9
redshift.amazonaws.comDescribeClusterDbRevisions9
redshift.amazonaws.comDescribeClusterParameterGroups9
rds.amazonaws.comDescribeEvents9
lambda.amazonaws.comListVersionsByFunction201503319
iam.amazonaws.comAttachRolePolicy9
lambda.amazonaws.comListAliases201503319
lambda.amazonaws.comListEventSourceMappings201503319
glue.amazonaws.comGetDevEndpoint8
elasticmapreduce.amazonaws.comListSparkStagesPrivate8
ec2.amazonaws.comModifyInstanceAttribute8
s3.amazonaws.comCreateBucket8
quicksight.amazonaws.comCreateDataSource8
signin.amazonaws.comExitRole8
rds.amazonaws.comDescribePendingMaintenanceActions7
glue.amazonaws.comGetDataflowGraph7
rds.amazonaws.comDescribeDBClusterSnapshots7
rds.amazonaws.comDescribeCertificates7
glue.amazonaws.comBatchDeleteTable7
rds.amazonaws.comDescribeRecommendationGroups7
athena.amazonaws.comBatchGetNamedQuery6
ec2.amazonaws.comDescribeVpcPeeringConnections6
athena.amazonaws.comListNamedQueries6
ec2.amazonaws.comDescribeVpcEndpointServiceConfigurations6
ec2.amazonaws.comDeleteNetworkInterface6
ec2.amazonaws.comDescribeEgressOnlyInternetGateways6
glue.amazonaws.comCreateTable6
glue.amazonaws.comGetDatabase6
elasticmapreduce.amazonaws.comListSparkExecutorsPrivate6
ec2.amazonaws.comDescribeFlowLogs6
s3.amazonaws.comPutBucketNotification6
ec2.amazonaws.comDescribeCustomerGateways6
glue.amazonaws.comCreateCrawler6
iam.amazonaws.comGetInstanceProfile6
ec2.amazonaws.comDescribeNatGateways6
rds.amazonaws.comDescribeDBSnapshots6
ec2.amazonaws.comDescribeVpnConnections6
ec2.amazonaws.comDescribeInternetGateways6
ec2.amazonaws.comDescribeVpnGateways6
redshift.amazonaws.comDescribeClusterSecurityGroups6
cloudtrail.amazonaws.comListTags6
ec2.amazonaws.comRevokeSecurityGroupIngress6
glue.amazonaws.comUpdateConnection5
rds.amazonaws.comDescribeDBClusterParameterGroups5
glue.amazonaws.comGetDataCatalogEncryptionSettings5
logs.amazonaws.comDescribeLogGroups5
rds.amazonaws.comDescribeAccountAttributes5
rds.amazonaws.comDescribeDBParameterGroups5
ec2.amazonaws.comDescribeVpcClassicLinkDnsSupport5
dynamodb.amazonaws.comDescribeTable4
elasticmapreduce.amazonaws.comDescribeSparkApplicationPrivate4
elasticmapreduce.amazonaws.comListSparkJobsPrivate4
iam.amazonaws.comCreateRole4
ec2.amazonaws.comAssociateAddress4
glue.amazonaws.comCreateJob4
quicksight.amazonaws.comCreateAnalysis4
iam.amazonaws.comDetachUserPolicy4
iam.amazonaws.comListGroupsForUser4
iam.amazonaws.comListUsers4
glue.amazonaws.comCreateDevEndpoint4
glue.amazonaws.comGetConnection4
sqs.amazonaws.comDeleteQueue4
redshift.amazonaws.comDescribeClusterSubnetGroups4
quicksight.amazonaws.comCreateDataSet4
iam.amazonaws.comCreatePolicyVersion4
redshift.amazonaws.comDescribeEventSubscriptions3
redshift.amazonaws.comDescribeReservedNodes3
ec2.amazonaws.comDescribePrefixLists3
glue.amazonaws.comGetPlan3
redshift.amazonaws.comDescribeHsmClientCertificates3
redshift.amazonaws.comDescribeHsmConfigurations3
glue.amazonaws.comGetMapping3
ec2.amazonaws.comCreateNetworkInterface3
redshift.amazonaws.comDescribeClusterSnapshots3
elasticmapreduce.amazonaws.comRunJobFlow3
sqs.amazonaws.comCreateQueue3
iam.amazonaws.comListSSHPublicKeys2
iam.amazonaws.comDeleteAccessKey2
elasticbeanstalk.amazonaws.comDescribeEnvironments2
iam.amazonaws.comListGroups2
route53domains.amazonaws.comListDomains2
iam.amazonaws.comListServiceSpecificCredentials2
glue.amazonaws.comDeleteJob2
lambda.amazonaws.comAddPermission20150331v22
autoscaling.amazonaws.comDeleteAutoScalingGroup2
cloudformation.amazonaws.comDeleteStack2
iam.amazonaws.comListUserPolicies2
elasticmapreduce.amazonaws.comListSparkTasksPrivate2
route53.amazonaws.comGetHealthCheckCount2
sns.amazonaws.comDeleteTopic2
rds.amazonaws.comDescribeDBLogFiles2
config.amazonaws.comDescribeConfigRules2
sqs.amazonaws.comSetQueueAttributes2
elasticmapreduce.amazonaws.comListSparkExecutorSummaryPrivate2
ec2.amazonaws.comDeleteSecurityGroup2
elasticmapreduce.amazonaws.comSetTerminationProtection2
cloudformation.amazonaws.comGetTemplateSummary2
lambda.amazonaws.comRemovePermission20150331v22
s3.amazonaws.comPutBucketPolicy2
ec2.amazonaws.comCreateSecurityGroup2
iam.amazonaws.comDeleteLoginProfile2
sns.amazonaws.comGetTopicAttributes2
iam.amazonaws.comListAttachedUserPolicies2
elasticbeanstalk.amazonaws.comDescribeApplications2
route53.amazonaws.comListTrafficPolicies2
dynamodb.amazonaws.comDeleteTable2
iam.amazonaws.comListMFADevices2
config.amazonaws.comDescribePendingAggregationRequests2
elasticmapreduce.amazonaws.comTerminateJobFlows2
s3.amazonaws.comDeleteBucket2
iam.amazonaws.comListSigningCertificates2
iam.amazonaws.comListAccessKeys2
rds.amazonaws.comListTagsForResource2
iam.amazonaws.comPutRolePolicy2
iam.amazonaws.comDeleteUser2
route53domains.amazonaws.comListOperations2
quicksight.amazonaws.comUpdateAnalysis2
xray.amazonaws.comGetEncryptionConfig2
elasticbeanstalk.amazonaws.comDeleteApplication2
autoscaling.amazonaws.comUpdateAutoScalingGroup2
route53.amazonaws.comGetHostedZoneCount2
ec2.amazonaws.comReleaseAddress2
autoscaling.amazonaws.comDeleteLaunchConfiguration2
iam.amazonaws.comDetachRolePolicy2
cloudtrail.amazonaws.comStartLogging1
iam.amazonaws.comRemoveUserFromGroup1
ec2.amazonaws.comDescribeVpcClassicLink1
cloudtrail.amazonaws.comUpdateTrail1
ec2.amazonaws.comCreateKeyPair1
s3.amazonaws.comDeleteBucketPolicy1
glue.amazonaws.comCreateConnection1
glue.amazonaws.comCreateDatabase1
s3.amazonaws.comAbortMultipartUpload1
monitoring.amazonaws.comPutDashboard1
ds.amazonaws.comDescribeDirectories1
iam.amazonaws.comAddRoleToInstanceProfile1
ec2.amazonaws.comAssociateIamInstanceProfile1
athena.amazonaws.comStopQueryExecution1
iam.amazonaws.comCreateInstanceProfile1
iam.amazonaws.comCreateServiceLinkedRole1
cloudtrail.amazonaws.comCreateTrail1
lambda.amazonaws.comUpdateFunctionCode20150331v21
cloudtrail.amazonaws.comPutEventSelectors1
route53.amazonaws.comGetTrafficPolicyInstanceCount1
lambda.amazonaws.comCreateFunction201503311
glue.amazonaws.comDeleteCrawler1

補足

以下の perl ワンライナーで CSV をはてな記法に変換した。

perl -i.org -pe 's/(\",\"|^\"|\"$)/|/g' 08dadeff-aae3-42f8-95ce-716d9a52ab21.csv

参考

[AWS]アクティビティ発生後 CloudTrail でログ出力されるまでのタイムラグ

$
0
0

CloudTrail typically delivers log files within 15 minutes of account activity. In addition, CloudTrail publishes log files multiple times an hour, about every five minutes. These log files contain API calls from services in the account that support CloudTrail.

How CloudTrail Works - AWS CloudTrail

CloudTrail はアクティビティ発生後 15 分以内にログが出力される。5分間隔でログを出力する。と書かれているが、CloudTrail で S3 に出力されたログを Athena で確認してみたら、結構ラグは少なそうな感じ。


  • クエリ
select now() AT TIME ZONE 'Asia/Tokyo' as now_tokyo, now() now_utc, eventtime ,eventsource, eventname
from default.cloudtrail_logs_cloudtrail_269419664770_do_not_delete 
order by eventtime desc limit 10
  • 結果
now_tokyonow_utceventtimeeventsourceeventname
2018-10-01 00:13:50.110 Asia/Tokyo2018-09-30 15:13:50.110 UTC2018-09-30T15:13:41Zs3.amazonaws.comPutObject
2018-10-01 00:13:50.110 Asia/Tokyo2018-09-30 15:13:50.110 UTC2018-09-30T15:13:38Zs3.amazonaws.comPutObject
2018-10-01 00:13:50.110 Asia/Tokyo2018-09-30 15:13:50.110 UTC2018-09-30T15:13:38Zs3.amazonaws.comPutObject
2018-10-01 00:13:50.110 Asia/Tokyo2018-09-30 15:13:50.110 UTC2018-09-30T15:12:58Zs3.amazonaws.comPutObject
2018-10-01 00:13:50.110 Asia/Tokyo2018-09-30 15:13:50.110 UTC2018-09-30T15:09:43Zs3.amazonaws.comPutObject
2018-10-01 00:13:50.110 Asia/Tokyo2018-09-30 15:13:50.110 UTC2018-09-30T15:09:42Zs3.amazonaws.comPutObject
2018-10-01 00:13:50.110 Asia/Tokyo2018-09-30 15:13:50.110 UTC2018-09-30T15:09:31Zs3.amazonaws.comPutObject
2018-10-01 00:13:50.110 Asia/Tokyo2018-09-30 15:13:50.110 UTC2018-09-30T15:09:31Zs3.amazonaws.comPutObject
2018-10-01 00:13:50.110 Asia/Tokyo2018-09-30 15:13:50.110 UTC2018-09-30T15:09:31Zs3.amazonaws.comPutObject
2018-10-01 00:13:50.110 Asia/Tokyo2018-09-30 15:13:50.110 UTC2018-09-30T15:09:25Zs3.amazonaws.comHeadObject

[AWS]S3バケットの Static website hosting が有効化されたら CWE -> Lambda で自動的に無効化する

$
0
0

S3バケットの Static website hosting が有効化されたら、CloudWatch Events で検知して Lambda で無効化してみたメモ。


設定

Lambda 関数を作成する
  • 名前: S3DeleteBucketWebsite
  • ランタイム: Python 2.7
  • ロール: S3FullAccess
  • 関数コード
import boto3
s3 = boto3.client('s3')

def lambda_handler(event, context):
    print( 'event: ', event )
    bucket_name=event['detail']['requestParameters']['bucketName']
    print( bucket_name )
    s3.delete_bucket_website(Bucket=bucket_name)
    return 'S3DeleteBucketWebsite finished!'

CloudWatch Events
  • [CloudWatch]-[イベント]-[ルール]を選択し、[ルールの作成]をクリックしてルールを作成する。
  • [イベントパターン]を選択する。
  • サービス名: S3
  • イベントタイプ: Bucket Level Operation
  • Lambda関数の機能で "S3DeleteBucketWebsite" を指定する。

実行してみる

  • 任意のS3バケットの Static website hosting を有効化する。
  • CloudTrail でイベントを確認する

f:id:yohei-a:20181001143030p:image

  • 手動で Static website hosting を有効化したときのイベント詳細
{
    "eventVersion": "1.05",
    "userIdentity": {
        "type": "AssumedRole",
        "principalId": "...:yoheia",
        "arn": "arn:aws:sts::...:assumed-role/Admin/yoheia",
        "accountId": "...",
        "accessKeyId": "...",
        "sessionContext": {
            "attributes": {
                "mfaAuthenticated": "false",
                "creationDate": "2018-10-01T01:23:41Z"
            },
            "sessionIssuer": {
                "type": "Role",
                "principalId": "...",
                "arn": "arn:aws:iam::...:role/Admin",
                "accountId": "...",
                "userName": "Admin"
            }
        }
    },
    "eventTime": "2018-10-01T01:29:24Z",
    "eventSource": "s3.amazonaws.com",
    "eventName": "PutBucketWebsite",
    "awsRegion": "ap-northeast-1",
    "sourceIPAddress": "***.***.164.95",
    "userAgent": "[S3Console/0.4, aws-internal/3 aws-sdk-java/1.11.408 Linux/4.9.93-0.1.ac.178.67.327.metal1.x86_64 OpenJDK_64-Bit_Server_VM/25.181-b13 java/1.8.0_181]",
    "requestParameters": {
        "bucketName": "...",
        "website": [
            ""
        ],
        "WebsiteConfiguration": {
            "IndexDocument": {
                "Suffix": "index.html"
            },
            "xmlns": "http://s3.amazonaws.com/doc/2006-03-01/",
            "ErrorDocument": {
                "Key": "error.html"
            }
        }
    },
    "responseElements": null,
    "additionalEventData": {
        "vpcEndpointId": "vpce-..."
    },
    "requestID": "5713892F375F06E7",
    "eventID": "8bf143b4-0c11-4d2f-ade2-8d8a48fd264d",
    "eventType": "AwsApiCall",
    "recipientAccountId": "...",
    "vpcEndpointId": "vpce-..."
}
  • Lambda で自動無効化したときのイベント詳細
{
    "eventVersion": "1.05",
    "userIdentity": {
        "type": "AssumedRole",
        "principalId": "...:S3DeleteBucketWebsite",
        "arn": "arn:aws:sts::...:assumed-role/S3FullAccess/S3DeleteBucketWebsite",
        "accountId": "...",
        "accessKeyId": "...",
        "sessionContext": {
            "attributes": {
                "mfaAuthenticated": "false",
                "creationDate": "2018-09-30T23:48:23Z"
            },
            "sessionIssuer": {
                "type": "Role",
                "principalId": "...",
                "arn": "arn:aws:iam::...:role/S3FullAccess",
                "accountId": "...",
                "userName": "S3FullAccess"
            }
        }
    },
    "eventTime": "2018-10-01T01:29:46Z",
    "eventSource": "s3.amazonaws.com",
    "eventName": "DeleteBucketWebsite",
    "awsRegion": "ap-northeast-1",
    "sourceIPAddress": "**.***.125.247",
    "userAgent": "[Boto3/1.7.74 Python/2.7.12 Linux/4.14.67-66.56.amzn1.x86_64 exec-env/AWS_Lambda_python2.7 Botocore/1.10.74]",
    "requestParameters": {
        "bucketName": "az-www-test",
        "website": [
            ""
        ]
    },
    "responseElements": null,
    "requestID": "82F29B6174AE2EEF",
    "eventID": "90c11fd3-c30e-4423-b429-9e9aae689df3",
    "eventType": "AwsApiCall",
    "recipientAccountId": "..."
}

補足

  • テストの際は[テストイベントの設定] で CloudTrail のイベントをコピー&ペーストするとテスト&デバッグが楽。実際にイベントを発生させなくてもイベントから渡される情報を使ったテストが可能。ただし、CloudTrail の画面のイベントの詳細は event['detail'] 以降のネームスペースになる。

参考

Delete a Bucket Website Configuration

  • The example below shows how to:

Delete a bucket website configuration using delete_bucket_website.

Example

import boto3

# Create an S3 client
s3 = boto3.client('s3')

# Call S3 to delete the website policy for the given bucket
s3.delete_bucket_website(Bucket='my-bucket')
https://boto3.amazonaws.com/v1/documentation/api/latest/guide/s3-example-static-web-host.html
Viewing all 1154 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>