Details

    • All
    • GreenHopper Ranking:
      0|i1pwqj:
    • 9223372036854775807
    • Small

    Description

      Too reach this goal : https://jira.talendforge.org/browse/TDC-3876

      1. In TCK/Record, add a new interface on schema : getLabel()
      2.a AvroSchema is a TCK schema that wraps an Avro schema. Need to set the label meta-data if exist, so setLabel(String l), must be added also.
      https://github.com/Talend/component-runtime/blob/7a4f24e0c876d6b0cd9d2686b6740cb960bc39ce/component-runtime-beam/src/main/java/org/talend/sdk/component/runtime/beam/spi/record/AvroSchema.java#L102
      2.b Which property will be used in Avro that will contain the label (original column's name) ?
      3. TCK AvroRecord will also need some adjustements to retrieve the avro property, for example "talend.component.label" that will contain the real name of the column
      The property can be retrieve using : https://github.com/Talend/component-runtime/blob/7a4f24e0c876d6b0cd9d2686b6740cb960bc39ce/component-runtime-beam/src/main/java/org/talend/sdk/component/runtime/beam/spi/record/AvroPropertyMapper.java
      4. Fileio generates first an Avro IndexRecord, and fields name are already computed : https://github.com/Talend/cloud-components/blob/3ac7d1c50bccab3bc38eaefc7c200ba115166c3a/fileio/src/main/java/org/talend/components/fileio/runtime/SimpleRecordFormatCsvIO.java#L557
      :warning: We should add meta to keep also original name
      5. Should we integrate in TCK, Avro names limitation ? For example, should we enforce technical fields name to be Avro compliant ?
      We currently sanitize names in TCK Record : https://github.com/Talend/component-runtime/blob/7a4f24e0c876d6b0cd9d2686b6740cb960bc39ce/component-runtime-impl/src/main/java/org/talend/sdk/component/runtime/record/SchemaImpl.java#L132
      But it is not full Avro compliant.
      Since we will have a "label" metadata, we could have technical field name fully compliant with Avro. This mean, the given name for a field will be store in "label" and then sanitized to be stored in technical name.
      /!\ @undx must validate this update of TCK
      6. If the technical name already exist, "foo", an incremental index could be added : "foo_2"
      7. To retrieve value in TCK, getString(name), we still use technical name, and so loop on schema. Label (original column name) should be use in outputs to get real fields names (for example to update column name)
      8. We have some TCK/Record->Avro/IndexRecord & Avro/IndexRecord->TCK/Record converters that should deal with this "label". Need some checks.

      Actions:
      A/ Add get/setLabel in TCK API
      B/ Improve sanitize name method (/!\ has been copied in 2 several places) to be full Avro compliant
      C/ Work on fileio to make it "label" compliant
      D/ Fix TCK/Record<-->Avro/IndexRecord converters if not already done for fileIO

      • /!\ The full feature (cross team) is for end of May. We should deliver it asap since.
      • /!\ This will impact TCK, TPD must be bump to last version ok TCK
      • OK: B/ will generate new column names that can break studio jobs (only if it has columns name with '$' since allowed in java variables but not within Avro) => :heavy_check_mark: : Not really since studio doesn't support '$' in schema fields ! Studio name limitation is same as Avro.

      Attachments

        Issue Links

          Activity

            People

              wwang Wei Wang
              ypiel Yves Piel
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: