Is this website really build a protection against scraping?



  • the website is https://www.proxyrotator.com/free-proxy-list/ .

    After months mastering BAS (and learning javascript,node,regex,xpath etc...) I became confident there is no website I can't take the information from... until I tried the one above.

    If someone knows or has idea how to take the proxies from it, please share.

    I'm really curious what solutions (maybe without the obvious one - screenshot and OCR over it) we have in BAS for it.



  • @hungrym said in Is this website really build a protection against scraping?:

    the website is https://www.proxyrotator.com/free-proxy-list/ .
    After months mastering BAS (and learning javascript,node,regex,xpath etc...) I became confident there is no website I can't take the information from... until I tried the one above.
    If someone knows or has idea how to take the proxies from it, please share.
    I'm really curious what solutions (maybe without the obvious one - screenshot and OCR over it) we have in BAS for it.

    What exactly is the problem? I looked at this site, it is simple in my opinion, a little inconvenient to parse, but in General it is not a problem.



  • @usertrue And how exactly? Did you check the source code? It's not possible to copy/paste the proxy (you can just try in normal browser), how about to write BAS script to parse it. I mean what xpath,css, regex you will use to take the full proxy and add it to a list in BAS ?



  • @hungrym said in Is this website really build a protection against scraping?:

    And how exactly? Did you check the source code? It's not possible to copy/paste the proxy (you can just try in normal browser), how about to write BAS script to parse it. I mean what xpath,css, regex you will use to take the full proxy and add it to a list in BAS ?

    Yes, I looked at the page code, it has everything you need.



  • @usertrue : Ok. In that case can you give idea how to parse the IP address and the Port of the proxy? If you're busy don't go in details, just overview of how you will do it.
    I really can't find a way and you're saying it's simple...



  • @hungrym I wrote a js that runs in a browser and collects data. But port comes in the form of base64 pictures of the - think themselves further. There are recognition modules in node js, but I don't have time for that.

    {
    	let proxy = [];
    	let rows = Array.from(document.querySelectorAll('tbody tr:not([class])') );
    	rows.forEach( row => {
    		let ip = Array.from(row.querySelectorAll('td:nth-of-type(2)>*') ).filter(el=> {
    			let xy = el.getBoundingClientRect();
    			return el == document.elementFromPoint(xy.x, xy.y);
    		}).map( el => el.textContent).slice(0,-1).join('');
    		let port = row.querySelectorAll('td:nth-of-type(3)>img')[0].src.split(';')[2];
    		let loc =  row.querySelectorAll('td:nth-of-type(4)')[0].textContent.trim();
    		let type = row.querySelectorAll('td:nth-of-type(6)')[0].textContent;
    
    		proxy.push({ip,port,type,loc});
    	});
    	JSON.stringify(proxy)
    }
    

    0_1565013145122_proxyrotator.xml



  • @usertrue Thank you so much! Very clever and elegant!
    Yes, for the port - node module ng-ocr or API to ocr.space works for me.
    Definitly I'm going to read your answers in the forum and study from them! Thanks again!



  • any one help me getting the port number automatically or post project file here which is working


Log in to reply