Uploaded image for project: 'Talend DI components'
  1. Talend DI components
  2. TDI-46778

[Runtime convergence] New tck/join processor

Apply templateInsert Lucidchart Diagram


    • New Feature
    • Status: Done
    • Major
    • Resolution: Fixed
    • None
    • connectors/1.28.0
    • None



      Goal: to have full tck join processor that is iso with legacy tdp/join.

      What should be join behavior

      • It has to be a left outer join
      • We join only first match
      • Match can be checked on several columns:
         inputRecord.pivotColumnA == lookup.pivotColumnA && inputRecord.pivotColumnB == lookup.pivotColumnB && inputRecord.pivotColumnC == lookup.pivotColumnC ...
      • For a same pivot value, the join should always return the same lookup. It means that the lookup should be ordered, or at least retrieve values always in the same order
        • => What about having an option to order lookup on pivotal keys ? Order could be done in join connector, but could have performance issue at load, so I guess should be optional.
      • User should be able to define which columns of the lookup he wants to merge in the original record. Those columns should be merged right after the input record pivot column.
      • The current naming collision should be reproduced to keep legacy behavior
        • adding _1, _2, ... when collision arrives

      Example for naming collision:

      • input record schema
        ["id", "name", "age", "zipcode", "street"]
      • lookup record schema
        ["id", zip", "name", "population", "surface"]
      • User configure the join connector:
        • Input record pivot: zipcode
        • Lookup pivot : zip
        • Retrieved columns : name, surface
      • Result should be
        ["id", "name", "age", "zipcode", "name_1", "surface", "street"]

      As information

      • Dataprep allows to create preparation only flat dataset. Datainventory set a flag in its dataset description flat=true
      • In the same way, lookups are also flats

      This connector would be in a first time dedicated for preparation executions. It should not be part of pipeline designer list of available processors.
      So, it should be defined as a technical processor with:

      @Metadatas({ @Metadata(key = "isTechnical", value = "true") })

      Like: https://github.com/Talend/connectors-ee/blob/b51ec38364eccf8e8b5e9a392d2f2c0393025880/processing-functions/src/main/java/org/talend/components/processing/functions/metadata/setrowid/SetRowIdProcessor.java#L25

      Acceptance Criteria

      • The processor should result in a Left Outer Join
      • Only first match should be joined.
      • Processor can define :
        • which input columns should match with which lookup columns
          • All pairs are linked by a AND relation
        • parameters of another Dataset (lookup dataset)
        • which lookup columns to merge.
          • Columns to merge should be added after the last input record pivot column.
            • 1.28.0: Will be added at the end of the record.
            • Afterwards: as mentioned initially.
          • Columns should be suggested -> This point is not supported by TDI but by TDS integration of this component.
          • Not found lookup elements are set to null.
      • Naming collision management for columns:
        • adding _1, _2, ... when collision occurs
      • The processor should not appear in Pipeline Designer.
        • Should be defined as a technical processor.
        • User can add a Join processor on pipeline as other dataprep function  Cannot be checked by TDI as long as not integrated by TDP


      Topic Description DoR  
      Description Is the description enough for all stakeholders? Version scope confirmed for 1.28.0 scope. Topics marked with  will be covered in a following update.
      Acceptance Criteria Are they defined? Were they validated by PO, Dev and QA? Version scope confirmed for 1.28.0 scope. Topics marked with  will be covered in a following update.
      Jira information Is the Jira information correct? (fix version, labels, security level)  
      Environment Environment ready: need TDP presentation
      Support SSL: not needed
      Reachable for QA/Doc/automation(TTP, Junit): not yet
      Won't be integrated by TDP during validation. Will require local modifications to test the processor before release.
      License Is license EE or SE clearly identified ?  
      Technical Analysis Does the developer understand how it will be implemented?
      Do we have a solution?
      Approved/discussed with architecture (in-team, global or security, depending on the scope)?
      Dependencies Are all dependencies linked to JIRA (link "depends on")?
      Are they done? Not done yet
      Including SRE/Devops/IT
      Migrations Is migration needed? no  
      Doc Is DOCT created and linked under the epic? No available for users no doc needed.  Not for user directly. Will replace current Join from TDP which is already documented in TDP.
      Communication channel Is slack feat- created? With all the correct owner involved? (QA/Doc/PO/SM) #feat-runtime-convergence -> Technical
      #eng-runtime-convergence-sync -> Confirmations
      UX Are there changes in the UI & were they added in the DOCT?
      For new forms, was it approved by UX? 
      For new connectors, was a TUX ticket created for a new Icon? the icon already exists
      Tck/join form will be done by TDP (as processor replace back end of current Join feature)


        1. image-2021-12-01-17-30-01-828.png
          42 kB
        2. image-2021-12-10-16-52-44-737.png
          43 kB
        3. majLivy.sh
          5 kB
        4. majLivyVersion.sh
          0.0 kB
        5. multiInputBeam.zip
          11 kB

        Issue Links



              pteyssier pierre teyssier
              clesaec Christophe LeSaec
              Christophe LeSaec, Fabien Desiles, Yves Piel
              0 Vote for this issue
              4 Start watching this issue