Deduplication Algorithm

Problem Description

Sellers offering the same products create similar SKUs in Sellers Center and generate many entries of the same items.

  • This problem makes the search and catalog ranking process more difficult.
  • Accordingly, this results in bad UX and leads to a decrease in sales.

Current Solution

Manual check and fixing is ineffective due to size of catalog and speed at which new items appear.

In addition, only 6k items are labeled as masters and 13k items have corresponding master information. This is less than 1% of the total catalog.

  • Algorithm will behave differently for each database. In order to use the algorithm in production, it has to be analyzed in staging.

Algorithm

Step Description Screenshot
Short Overview The proposed solution is based on novel approaches in text mining that show outstanding performance in various contests and production applications.
In general, all processes consists of two stages:

  1. Training stage
  2. Requests processing stage
  • Training stage is required to correct system performance and training all models.
    • Needs to be performed once per day.
  • Search stage is a main process that responds with similar items to request with SKU data.
    They have similar steps described here
75G06VgrrYzb5zZdZ9Qziri0hMWNJ4UEfC9fVGIQS3pWOF5M2N68B23yXIJG3cDe7Gtf6A=s2048