Monday, September 11, 2017

USEFUL QUERIES FOR TROUBLESHOOTING IN AMAZON REDSHIFT

 USEFUL QUERIES FOR TROUBLESHOOTING IN AMAZON REDSHIFT

TO CHECK LIST OF RUNNING QUERIES AND USERNAMES:
select a.userid, cast(u.usename as varchar(100)), a.query, a.label, a.pid, a.starttime, b.duration,
b.duration/1000000 as duration_sec, b.query as querytext
from stv_inflight a, stv_recents b, pg_user u
where a.pid = b.pid and a.userid = u.usesysid

**----To get  all schema table
select trim(pgdb.datname) as Database, trim(pgn.nspname) as Schema,trim(a.name) as Table, b.mbytes, a.rows
from ( select db_id, id, name, sum(rows) as rows from stv_tbl_perm a group by db_id, id, name ) as a join pg_class as pgc on pgc.oid = a.id join pg_namespace as pgn on pgn.oid = pgc.relnamespace
join pg_database as pgdb on pgdb.oid = a.db_id join (select tbl, count(*) as mbytes from stv_blocklist group by tbl) b on a.id=b.tbl --where pgn.nspname='stg'
order by a.db_id, a.name;
---To check Recents Running Queries
select * from stv_recents where status='Running';

--TO Kill the Queries
cancel <PID>
---To Check copy command
select getdate()- starttime as CurrentDuration ,user_name, duration, query from stv_recents where status = 'Running' and query like 'COPY%';

---To check Cluster Total disk space and Free space
select   sum(capacity)/1024 as capacity_gbytes,   sum(used)/1024 as used_gbytes,   (sum(capacity) - sum(used))/1024 as free_gbytes from   stv_partitions where part_begin=0;
-- Table Sizes & Rows
select    trim(pgdb.datname) as Database, trim(pgn.nspname) as Schema,trim(a.name) as Table,b.mbytes,a.rows
from ( select db_id, id, name, sum(rows) as rows from stv_tbl_perm a  group by db_id, id, name) as a
join pg_class as pgc on pgc.oid = a.id join pg_namespace as pgn on pgn.oid = pgc.relnamespace join pg_database as pgdb on pgdb.oid = a.db_id
join ( select tbl, count(*) as mbytes  from stv_blocklist group by tbl ) b on a.id = b.tbl
order by mbytes desc, a.db_id, a.name;
---To Check DB Infro and user
select * from pg_database_info;
select * from pg_user_info;
--TO find databaseSize;
select sum(mbytes)/1024 as db_size_in_gb, database from (select trim(pgdb.datname) as Database,trim(a.name) as Table, b.mbytes from stv_tbl_perm a join pg_database as pgdb on pgdb.oid = a.db_id
join (select tbl, count(*) as mbytes from stv_blocklist group by tbl) b on a.id=b.tbl
where a.slice=0 order by db_id, name) group by database;



TO LOOK FOR ALERTS:
select * from STL_ALERT_EVENT_LOG
where query = 1011
order by event_time desc
limit 100;


TO CHECK TABLE SIZE:
select trim(pgdb.datname) as Database, trim(pgn.nspname) as Schema,
trim(a.name) as Table, b.mbytes, a.rows
from ( select db_id, id, name, sum(rows) as rows from stv_tbl_perm a group by db_id, id, name ) as a
join pg_class as pgc on pgc.oid = a.id
join pg_namespace as pgn on pgn.oid = pgc.relnamespace
join pg_database as pgdb on pgdb.oid = a.db_id
join (select tbl, count(*) as mbytes
from stv_blocklist group by tbl) b on a.id=b.tbl
order by b.mbytes desc, a.db_id, a.name;


TO CHECK FOR TABLE COMPRESSION:
analyze <tablename>;
analyze compression <tablename>;


TO ANALYZE ENCODING:
select "column", type, encoding
from pg_table_def where tablename = 'biglist';


TO CHECK LIST OF FILES COPIED:
select * from stl_load_errors

select * from stl_load_commits


select query, trim(filename) as file, curtime as updated, *
from stl_load_commits
where query = pg_last_copy_id();


TO CHECK LOAD ERRORS
select d.query, substring(d.filename,14,20),
d.line_number as line,
substring(d.value,1,16) as value,
substring(le.err_reason,1,48) as err_reason
from stl_loaderror_detail d, stl_load_errors le
where d.query = le.query
and d.query = pg_last_copy_id();

TO CHECK FOR DISKSPACE USED IN REDSHIFT:
select owner as node, diskno, used, capacity
from stv_partitions
order by 1, 2, 3, 4;
select query, trim(querytxt) as sqlquery
from stl_query
order by query desc limit 5;

UNDERSTANDING THE KEYS IN AMAZON REDSHIFT

UNDERSTANDING THE KEYS IN AMAZON REDSHIFT



1)   Table design (Data distribution and Sorting) - In redshift the data is distributed on all the nodes. Whenever it is executing the query it has to bring the data from the different nodes and use it in joins, aggregation, sorting, group by, etc. If the amount of data that has to grabbed from these nodes is high then it results in lot of traffic and poor query performance. To reduce the amount of traffic between nodes, the data has to be distributed properly between the nodes. This can be achieved through the proper table design, data compression,  distribution key, sort keys, and query joins. First lets looks at the table design. Below is the table create statement. The distribution style (diststyle), distribution key (distkeys), sort keys are very important in defining how the data is distributed among the nodes.

 CREATE [ [LOCAL ] { TEMPORARY | TEMP } ]
TABLE table_name [ ( column_name1 encode ,  ]
| DISTSTYLE { EVEN | ALL | KEY }
| DISTKEY ( distkey_identifier )
| SORTKEY ( sortkey_identifier [, ... ] )


DISTSTYLE ALL will put a copy of the data in all the nodes and make it available for joins without having to move the data between nodes. This is ideal for dimension tables which does not have many records in it.
DISTSTYLE KEY will distribute data based on a defined key column and better for tables that are huge and you can use a column that is used in joins to       distribute the data.
DISTSTYLE EVEN is used mostly to evenly distribute the data and when you cant deside between ALL or KEY.

*Trick*: USE DISTSTYLE ALL for all the dimension tables that are not big.

DISTKEY (column name)- The column name used here should be the one that is used in joins and should be defined for all the large fact tables. The distribution key (distkey) will help in distributing the data based on the distkey column and during joins only the required data is brought the nodes. There can be only one distkey (distribution key) column defined.

*Trick*: Define Distkey for all the large fact tables and the dist key column should be the one used in the joins.

If the distribution key from the large fact table is joined to the any column of a dimension table that is in diststyle ALL then in your explain plan you will see the attribute DS_DIST_ALL_NONE which mean there is no data redistribution and the cost of the query is very low.  This is the state you want most part of the query to be in for best performance.

SORT KEY (column name1, ..):  There can be multiple sort keys and the sort keys help a lot in storing the data in the nodes in sorted order which helps in group by or order by operation.

*Trick*: If there are columns such as date or timestamp that you use in order by operation then define those as sort keys.

2) Table design (Compression) - In the amazon redshift tables the encoding that can be used for compression can be defined along with the column name. Format shown below. There are various encodings available.

CREATE TABLE table_name (column_name
            data_type ENCODE encoding-type)[, ...]

*Trick*: Easy way to decide on the encoding if you are using the copy command to load data  from s3to redshift is to use COMPUPDATE ON option set in the copy command as shown below. The copy command choosed the best compression to use for the columns that it is loading data to. 

copy <tablename> from 's3://mybucket/files.txt'
credentials 'aws_access_key_id=<access-key-id>;aws_secret_access_key=<secret-access-key>'
delimiter '|' COMPUPDATE ON

If you want to explicitly define the encoding like when you are inserting data from another table or set of tables, then load some 200K records to the table and use the command ANALYZE COMPRESSION <tablename> to make redshift suggest the best compression for each of the columns. You can use those suggestion while recreating the table.


3) Table design (Constraints) - Defining the primary keys and foreign keys though is not enforced by redshift apparently is used by the query planner to execute the query efficiently. Hence, define those keys in your create table statements for better performance.


4) Query design -
Now we have discussed about table design, the actual query design will all use all those table design features to execute the query. As discussed before the trick in designing good queries is to have proper joins. i.e use distribution key from the large fact table to join to  any column of a dimension table that is in diststyle ALL. If dimension table is not in (distribution style) diststyle all, then use the distkey (distribution key) from the dimension table in the joins.

5) WLM queue setting:

The WLM queue setting on your cluster configuration determines the number of queries running on your cluster and essential part of performance tuning. Keeping it low means not many queries can run at the same time and provided more memory for each query. Set the concurrency and memory setting for each user group in the cluster for better performance




6) Copy Command Performance tuning:
If you want your copy command to be faster then split the files into multiple files so that they can get loaded in parallel to the redshift database.