Uploaded image for project: 'Talend Component Kit'
  1. Talend Component Kit
  2. TCOMP-2019

Sanitized columns name collision support

Apply templateInsert Lucidchart Diagram
    XMLWordPrintable

Details

    Description

      Currently, the sanitize of tck column name to be compliant with Avro & Studio do:

      In some case, it can generate column name collision.
      For instance:

      final Record record_c = this.factory.newRecordBuilder()
              .withString("1name_b", "a_value") 
              .withString("2name_b", "b_value")
              .build();
      

      The first character is a number and so not Avro compatible. The columns name will be sanitized and it will generate a collision.

      The proposal to fix that collision is to add a suffix "_${index}" in that way:

      "shema":[
      	{"name": "name_b", "rawname": "1name_b"},
      	{"name": "name_b_1", "rawname": "2name_b"},
      ]
      

      Their should be a fall back to rawname when we want to retrieve the value from a record:

      record.getString("2name_b")
      

      should return "b_value".

      Scope of collision support
      (a) Should we limite this fix only on collision generated after a sanitize ? (b) Or, we allow to add two column with same name ?
      => emmanuel_g / clesaec : if we allow (b) it may change the current behavior, so we limit to (a) : we only support collision after a sanitize.

      Need to validate

      • A given column has its name equals to the sanitized one of another column:
        final Record record_c = this.factory.newRecordBuilder()
                .withString("1name_b", "a_value") 
                .withString("2name_b", "b_value")
                .withString("name_b", "c_value")
                .build();
        

        Should generate:

        "shema":[
        	{"name": "name_b_1", "rawname": "1name_b"}, // a_value
        	{"name": "name_b_2", "rawname": "2name_b"}, // b_value
        	{"name": "name_b", "rawname": ""},          // c_value
        ]
        
      • A given column has its name equals to the sanitized one of another column:
        final Record record_c = this.factory.newRecordBuilder()
                .withString("1name_b", "a_value") 
                .withString("2name_b", "b_value")
                .withString("name_b_1", "c_value")
                .build();
        

        May generate:

        "shema":[
        	{"name": "name_b", "rawname": "1name_b"}, // a_value
        	{"name": "name_b_2", "rawname": "2name_b"}, // b_value
        	{"name": "name_b_1", "rawname": ""},        // c_value
        ]
        

        If we generalize:

        final Record record_c = this.factory.newRecordBuilder()
                .withString("1name_b", "a_value")    // name_b
                .withString("2name_b", "b_value")    // name_b_1
                ...
                .withString("{n}name_b", "b_value")  // name_b_{n-1}
                .withString("name_b_x", "c_value")   // if 1 <= x < n
                .build();
        

        Should we support case ?

      Attachments

        Activity

          People

            emmanuel_g emmanuel gallois
            ypiel Yves Piel
            Christophe LeSaec
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: