AtScale Aggregate Best Practices

Overview

To gain a comprehensive understanding of the available aggregate options and their descriptions, it is essential to review our AtScale documentation. This document, however, will focus solely on the appropriate utilization of aggregates, outlining when and how they should be used.

Currently, we offer five distinct types of aggregates, with the possibility of introducing more in the future:

  • Distinct Count Aggregates
  • Predictive Defined Aggregates
  • System Defined Aggregates
  • User Defined Aggregates
  • Dimensionally Modified Aggregates

By following these guidelines, you can optimize the efficiency and effectiveness of aggregates in your operations.

As a default configuration, specific aggregates are not enabled in the system. Instead, they are activated selectively when a specific use case aligns with their intended purpose. Most of these settings can be customized at the cube level, allowing you to enable or disable aggregates based on your specific requirements. Additionally, some settings may only be adjustable at the system level, granting you greater control over the utilization of aggregates across the entire system. This flexible approach ensures that aggregates are utilized efficiently and effectively, tailored to the needs of individual use cases while optimizing system-wide performance.

Distinct Count Aggregates

The distinct count aggregate is turned off by default, primarily because of its unique characteristics. This aggregate functions optimally when there are not many variables in the filter of a dashboard. You must access the custom engine settings to enable the distinct count aggregate. Doing so lets you fine-tune the configuration to suit your specific use case best and take advantage of the distinct count aggregate's capabilities when the filtering conditions are conducive to its efficient operation.

 aggregates.create.allowExactDistinctCountMeasures.enabled set it to true

Predictive Defined Aggregates

The default setting for this aggregate is turned off due to its aggressive nature in building dimension-only aggregates. This characteristic is especially relevant for projects involving many ad hoc users who utilize random attributes as pick lists in their filters.

You'll be able to enable this aggregate at the project level. You can access the project settings and edit the project. Once there, you can enable the predictive define aggregate option. After making this change, please republish the project to apply the new configuration effectively. By doing so, you can take advantage of the predictive define aggregate's capabilities, tailored to the specific needs of your project, while ensuring optimal performance and responsiveness with the dynamic filtering requirements.

Dimensionally Modified Aggregates

By default, the setting for this aggregate is turned on. However, you are migrating from an older version of AtScale when you edit the project. In that case, you will be prompted whether you want to update the dimension with these settings to enable the aggregate specifically for the date dimension hierarchy.

This prompt will help you deliberate how to use the aggregate in the context of the date dimension. You can retain the default setting or update the dimension and enable the aggregate for the date hierarchy based on your specific requirements and use cases. This migration flexibility ensures your system maintains its desired configuration while allowing for potential enhancements in utilizing aggregates when appropriate.

System Defined Aggregates

The setting for utilizing aggregates is enabled by default. However, you have the flexibility to disable the use of aggregates for specific scenarios:

Option 1:

You can disable aggregates in the user profile settings for a specific duration (e.g., "n" minutes). This allows you to temporarily bypass the aggregate functionality for queries initiated by that user during the specified time frame.

Option 2:

Alternatively, when executing a query using API calls or interacting with the system at runtime, you can set a runtime parameter for that particular query. This runtime parameter will enable you to explicitly specify whether aggregates should be used for that specific query, overriding the default behavior.

By offering these options, you can control aggregates at a granular level, tailoring the query optimization to match the specific needs of users and queries as required.

           "aggregation": {

         "useAggregates": false,

         "genAggregates": false

         },

User Defined Aggregates

The ability to define User-Defined Aggregates (UDAs) in AtScale provides a valuable option when you want to take more control over the aggregation process. There are several use cases where UDAs become particularly beneficial:

  • Optimizing Dashboard Aggregates: In cases where AtScale uses too many or too few aggregates for a set of dashboards, creating custom UDAs allows you to fine-tune the aggregation strategy and ensure optimal performance for those specific dashboards.
  • QDS and Degenerate Dimensions: When dealing with Quick Data Sources (QDS) and degenerate dimensions, it is advantageous to utilize UDAs. UDAs enable you to handle these specialized scenarios efficiently, tailoring the aggregation process to suit their unique characteristics.
  • Missing System-Defined Aggregates: There might be situations where AtScale does not generate system-defined aggregates for specific data due to various reasons:
    • Compression Scores: If the data's compression scores do not meet the criteria for automatic aggregation, you can define custom UDAs to fill in the gaps and ensure efficient aggregation.
    • Attribute Complexity: If the attributes used in a dashboard are too complex for AtScale to generate system-defined aggregates, creating custom UDAs can help address this challenge.

By leveraging User-Defined Aggregates, you can proactively address these use cases, ensuring that the aggregation strategy aligns precisely with the specific requirements of your data and queries, leading to improved performance and query response times.

User-defined aggregates (UDAs) play a crucial role in enabling the creation of additional system-defined aggregates in situations where statistics are not generated due to various reasons, such as timing constraints (data availability or resource limitations) or when the data's compression ratio does not meet the automatic aggregation criteria.

By utilizing UDAs strategically, you can address these limitations and enhance the aggregation process in two primary ways:

  • Seeding System-Defined Aggregates: When statistics are unavailable, you can use UDAs as a stat substitute. By defining custom UDAs, you provide AtScale with the necessary information to generate aggregates effectively, even without complete statistical data.
  • Enabling More System-Defined Aggregates: UDAs can be utilized to create a narrower version of aggregates, allowing for generating additional system-defined aggregates. This narrower version better fits specific use cases, creating a more comprehensive set of aggregates.

To maximize the benefits of UDAs, it is crucial to ensure that you select the partition attribute when defining them. This selection is vital as it improves the aggregate creation process in AtScale, leading to more efficient and effective aggregation and, consequently, improved performance for your queries and reports.

Incremental Aggregate

Building aggregates with the incremental indicator enabled has both advantages and disadvantages. Let's focus on the disadvantages:

Disadvantages of Incremental Aggregates:

  • Performance Variability: One major disadvantage of incremental aggregates is that they utilize "union all" from a view based on the aggregate tables. As a result, the performance of queries can vary depending on the day of execution. Queries run on the 5th day (when the data from the previous five days is appended) may be slower than the first or sixth day.
  • Query Performance on Day 5: Since on the 6th day, the aggregate tables are appended back to one table, querying the data on the 5th day may experience performance issues due to the data restructuring process.
  • Incremental Build Time: As the system appends the aggregate tables on the 6th day, the build process will take longer than the previous five builds, potentially causing slight delays in data availability.

While incremental aggregates can offer advantages such as reducing build times and optimizing storage, it's essential to be aware of these potential disadvantages and consider them when deciding whether to enable the incremental indicator for specific aggregates. Using incremental aggregates should align with your specific performance and data availability requirements.

Incremental aggregates apply exclusively to fact tables, whether physical tables, views, or Quick Data Sources (QDS). Including a datetime attribute as a calculated column in the fact table enables incremental aggregates. This datetime attribute serves as a reference point for determining incremental updates.

For a comprehensive understanding of how incremental aggregates behave and the guidelines for their implementation, it would be best for you to review the specific documentation on How AtScale Handles Incremental Aggregate Builds. This documentation will provide valuable insights into the behavior, advantages, limitations, and best practices of using incremental aggregates in your AtScale environment.

Cache

The aggregate cache can be a powerful tuning option to improve query performance in an AtScale environment. The cache entries are indicated with a flash icon in the queries history view, making it easy to identify when the cache is being used, either solely for the query or in combination with aggregates.

When working with the cache, there are several tuning options to consider, specifically related to the settings in the engine section of AtScale:

  • query.cache.eviction: This setting determines the eviction strategy for the cache entries, dictating how and when cached data is removed to make space for new entries.
  • query.cache.entry.sizeLimit: This setting controls the maximum size of each cache entry, limiting the amount of memory used by individual cached queries.

As you adjust these settings, knowing that the cache is an instance set in an AtScale host(s) is crucial. Careful consideration and understanding of these settings' advantages and disadvantages are necessary since they can significantly impact system performance.

Key considerations include:

  • Memory Allocation: Understand the amount of physical memory allocated to the AtScale and the JVM size (31G). These settings will affect the memory usage of the query cache and can impact the overall system performance.
  • Performance Implications: While using the cache can improve query response times for frequently accessed queries, improper configurations can degrade performance or lead to memory-related issues.
  • Documentation Review: To make informed decisions about cache tuning, consult the AtScale documentation, which provides insights into the behavior and best practices related to query cache usage.

By carefully considering these factors and understanding the behavior of the query cache, you can optimize the performance of your AtScale environment and leverage the benefits of cache while mitigating potential performance degradation or memory-related concerns.

Final Note

Flushing the cache after every ETL (Extract, Transform, Load) job ensures data consistency and accuracy in the AtScale environment. By clearing the cache, you ensure that any outdated or invalid data is removed, and the system can populate the cache with fresh information from the latest ETL process.

To flush the cache, you can use the appropriate API command provided by AtScale. Running this command will effectively delete the existing cache entries, allowing the system to rebuild the cache with updated data during subsequent queries.

Regularly flushing the cache after ETL jobs helps maintain data integrity, prevent stale results, and ensure users work with the most recent information in their queries and reports. Incorporating this step as part of your ETL workflow is essential to maintain a consistent and reliable data environment in AtScale.

curl --insecure -X DELETE -H "Authorization:Bearer $jwt" https://ubuntu-atscale.atscaledomain.com:10502/query-cache/flush

 

Was this article helpful?

3 out of 4 found this helpful