.. Index:: Ledger Format; Advanced .. _advanced-ledger: Advanced Ledger Format ====================== Graphene can perform better if the parquet input is organized with some constraints in mind. Some of the basic requirements of the parquet input are also more flexible if certain alternate conditions can be met. This section describes these considerations in detail. Columns ------- .. note:: Column names are case-sensitive and required columns must match those specified in the table. .. list-table:: Additional details on the input columns :widths: 10 10 10 10 60 :header-rows: 1 * - Column Name - Data Type - Required - Nullable? - Description * - Trial - INT64 - Yes - No - Also known as Year. In the general case of a multi-year analysis Trial is a more appropriate term to avoid confusion. * - Time - DOUBLE - Yes - No - In models that represent the time dimension of a scenario as day-of-year, this must be converted to an absolute timestamp. * - Type - STRING - Yes - No - Any choice may be made for Type values that distinguish records from each other, as long as they are consistent across all inputs. Choosing standard values like "Loss" and "Premium" will prevent having to further customize modelling, since built-in Graphene features will be expecting standard values. * - Value - DOUBLE - Yes - No - The actual monetary value associated with the transaction in the currency defined for the Ledger. If the Ledger currency is set to USD, the value in this field is US dollars. Also, Ledgers use a Profit & Loss perspective of the insured which means losses are represented as negative values and income as positive values. * - Metadata Field 1...n - INT64 FLOAT32 DOUBLE[note] STRING - No - Yes - Each Ledger can contain any number of metadata fields. For a loss model, this would typically be information about the event that caused the loss (e.g. EventId, Peril, Severity) or information about the exposure affected (e.g. LineOfBusiness, TIV, Location, etc.). .. note:: DOUBLE format Metadata columns will be handled as single precision floating point values (32 bit) Since we support ``DOUBLE`` data types, and since a number of special values can appear in these columns, it is important to note that ``NaN`` is converted and treated as a ``null`` value in columns where it values can be ``null``. When columns such as Time or Value do not allow ``null``, ``NaN`` is not permitted. Since Graphene is case sensitive to column names (i.e. ``"Trials"`` and ``"trials"`` are different columns), be sure provide the correct column name in parquet input. AWS Athena treats columns as case insensitive in parquet, so when using data directly with Athena or when requesting SQL queries via the Graphene Analysis API (which uses Athena), columns that only differ by case. Data types ---------- When multiple ledger inputs are provided, Graphene expects column data types to be defined identically for columns with the same name. An exception is made for integer columns. 32-bit integer metadata columns will be automatically converted to 64-bit integers when mixed inputs are detected. It is recommended that all input ledgers use the same integer data types for any given metadata column. Sub-optimal performance will result from relying on automatic handling of mixed integer types. .. note:: When multiple ledger inputs are provided, Graphene will not attempt to handle a column that has differing types among multiple files. Instead, a user-facing error will be returned. .. _occurrence-key: Occurrence Key ~~~~~~~~~~~~~~ Since independent events in models can occur at the same time, an otherwise optional metadata column is required. This is known as the ``Occurrence Key``, and it is provided as a metadata column in the Ledger. It does not need to have a specific name, but it must consist of exactly one column, of ``integer`` type that uniquely separates individual events within a Trial. .. note:: If every event is guaranteed to have a distinct Time, then providing an Occurrence Key is optional. An illustration of the handling of this column is as follows: +-------+-----------+------+-------+----------+----------------+ | Trial | Time | Type | Value | Event ID | Location | +=======+===========+======+=======+==========+================+ | 1 | 3842394.0 | Loss | 10M | 32497 | North Carolina | +-------+-----------+------+-------+----------+----------------+ | 1 | 3842394.0 | Loss | 20M | 32497 | South Carolina | +-------+-----------+------+-------+----------+----------------+ | 1 | 3842394.0 | Loss | 5M | 55206 | North Carolina | +-------+-----------+------+-------+----------+----------------+ During application of financial terms, the above Ledger uses an Occurrence Key of ``"Event ID"``, so that these two records together represent one event that causes different losses in two different locations. The ground-up loss for event 32497 is 30M, and for event 55206, it is 5M in the above example. Since these losses are distinguished by the Occurrence Key, they are processed as two distinct events for financial modelling, even though there are three records with identical Time. Without the Occurrence Key, this would be treated as a single event with a 35M ground-up loss for the purposes of financial modelling. .. _currency-column: Currency column ~~~~~~~~~~~~~~~ It is possible to specify the currency for each individual record in the input ledger if necessary. This can be achieved by providing a metadata column representing ``Currency`` in the ledger. The ``Currency`` column can then be defined by the parameter ``currency_column`` in the ledger load template, such as ScaledLoad. It does not need to have a specific name, but it must consist of exactly one column of ``string`` type with values that match the currencies specified in the exchange rate table. If a currency column is used, each value in the ledger is converted from the specified currency to the base currency. .. note:: it is not allowed to specify both ``currency`` and ``currency_column`` at the same time. .. index:: Parquet; Recommendations, Input; Data Organization Data Organization ----------------- Graphene assumes some ordering of the rows within parquet files for performance. Other internal details of the format are suggested to further enhance performance of the system. .. index:: Input; Row Order, Parquet; Order Row Order ~~~~~~~~~ Records need to be ordered by Trial, Time, Occurrence Key. There are exceptions when using filtering parameters at the input file level, but in general, this is essential to producing correct metrics output in all cases. Ordering all records by Trial is preferred for performance reasons, regardless of special exceptions. Column Order ~~~~~~~~~~~~ No specific column ordering is required. Row Group Size ~~~~~~~~~~~~~~ Splitting data across row groups is recommmended such that each row group contains records from exactly one trial. Otherwise, row groups of 1000-10000 records are appropriate. Compression ~~~~~~~~~~~ The parquet format includes built in support for data compression. We support snappy compression which is one of the original Apache Parquet format compression schemes. .. index:: Partitioning, Input; Partitioning, Parquet; Partitioning File Partitioning ~~~~~~~~~~~~~~~~~ Ledgers may be provided as collections of files. Graphene expects Ledger data to be accessible via Amazon S3 storage, and data may be partitioned into multiple files within a single folder. Some restrictions are: - a folder may contain only one independent set of losses (for all its trials) - trials do not span multiple files File extensions can follow a prescribed naming convention to enable the system to handle them efficiently without additional metadata being provided. Validation errors will result when trying to run an analysis on a data set that contains ambiguous partitions. For example: :: s3://example_bucket/uploads/ledgers/example_ledger_upload/ |-- ground_up_loss.parquet.1of4 |-- ground_up_loss.parquet.2of4 |-- ground_up_loss.parquet.3of4 |-- ground_up_loss.parquet.4of4 When providing a reference to the partitioned files as in the above example, only the folder URL is used (it must include the trailing slash). This is in contrast to referencing the full file URL when data is not partitioned. In non-partitioned cases, the XofY extension is omitted. Examples of malformed partitioned data sets include folders containing gaps (missing the .2of4), or inconsistent total counts (.3of5, .4of6). Partitioned parquet files in the same folder do not require using the prescibed file extension above. The name of each file must be unique within the folder it is stored. For example: :: s3://example_bucket/uploads/ledgers/example_ledger_upload/ |-- ground_up_loss_1.parquet |-- ground_up_loss_2.parquet |-- ground_up_loss_3.parquet |-- ground_up_loss_4.parquet When using multiple partitioned parquet files without the prescibed file extension above the system will not have the information to check for any malformed partitioned data containing gaps or missing files.