Details
-
Bug
-
Resolution: Fixed
-
Major
-
None
-
All
-
Small
Description
Currently, the sanitize of tck column name to be compliant with Avro & Studio do:
- If 1st chars is not avro compliant it is deleted : it raises issues !
- ASCII chars non-avro-compliant are transformed to '_'
- Non-ASCII chars, like jap ones, are transformed to '_'
- '+' and '/' from base64 are transformed to _'
https://github.com/Talend/component-runtime/blob/a2d9069e2c5b820630e5d535d9b04b077e37fae1/component-api/src/main/java/org/talend/sdk/component/api/record/Schema.java#L298
In some case, it can generate column name collision.
For instance:
final Record record_c = this.factory.newRecordBuilder() .withString("1name_b", "a_value") .withString("2name_b", "b_value") .build();
The first character is a number and so not Avro compatible. The columns name will be sanitized and it will generate a collision.
The proposal to fix that collision is to add a suffix "_${index}" in that way:
"shema":[ {"name": "name_b", "rawname": "1name_b"}, {"name": "name_b_1", "rawname": "2name_b"}, ]
Their should be a fall back to rawname when we want to retrieve the value from a record:
record.getString("2name_b")
should return "b_value".
Scope of collision support
(a) Should we limite this fix only on collision generated after a sanitize ? (b) Or, we allow to add two column with same name ?
=> emmanuel_g / clesaec : if we allow (b) it may change the current behavior, so we limit to (a) : we only support collision after a sanitize.
Need to validate
- A given column has its name equals to the sanitized one of another column:
final Record record_c = this.factory.newRecordBuilder() .withString("1name_b", "a_value") .withString("2name_b", "b_value") .withString("name_b", "c_value") .build();
Should generate:
"shema":[ {"name": "name_b_1", "rawname": "1name_b"}, // a_value {"name": "name_b_2", "rawname": "2name_b"}, // b_value {"name": "name_b", "rawname": ""}, // c_value ]
- A given column has its name equals to the sanitized one of another column:
final Record record_c = this.factory.newRecordBuilder() .withString("1name_b", "a_value") .withString("2name_b", "b_value") .withString("name_b_1", "c_value") .build();
May generate:
"shema":[ {"name": "name_b", "rawname": "1name_b"}, // a_value {"name": "name_b_2", "rawname": "2name_b"}, // b_value {"name": "name_b_1", "rawname": ""}, // c_value ]
If we generalize:
final Record record_c = this.factory.newRecordBuilder() .withString("1name_b", "a_value") // name_b .withString("2name_b", "b_value") // name_b_1 ... .withString("{n}name_b", "b_value") // name_b_{n-1} .withString("name_b_x", "c_value") // if 1 <= x < n .build();
Should we support case ?