How to Avoid Duplicates in Database?

  • Every 10 minutes I am scraping a news website for new articles and save the Title + Content into a Database. The problem is that instead of ignoring the aready existing Titles/Content it just adds the same entries again below. So how can I avoid that?

  • good question. out of curiosity i checked databases and tried to check with javascript .match or .IndexOf for a string but it either resulted false or error when string was there.

    I also took list from database and checked with list contains option to see if its there, but it returned false

  • Ok I managed to do it this way:

    alt text

    be careful for notreuse option

  • Thank you! Will try it now

  • I have the same problem.

    Unfortunately I don't see how the string matches regex function can help.
    It allows to check a variable, but I don't see how I could check all the records in a table?

    Any ideas or other solutions?

  • @zw Were you able to find a way around it?

  • There are many ways to handle this issue. I'll show you one way.

    For a news website, it's almost impossible for recently added articles to have 100% identical content. Therefore, you just need to compare the latest data added to your database with the data you're checking. If they are 100% identical, then the data has already been added, so you can skip it and check the next piece of data using the same process.

    So, for your request, you can convert the database to a list format, then retrieve the last element corresponding to the latest data added (or you can retrieve the latest data directly from the database if you know how to work with databases). Use Parse CSV string to parse the string into article title and content.

    You can compare the title of the latest article with the title you're checking. If they are 100% identical, then it has already been added to the database. If you want to be more certain, you can also compare the content (although this may not be necessary).

    I hope you understand what I'm saying.

Log in to reply