THE TEMPLATE DETECTION AND CONTENT EXTRACTION BENCHMARK SUITE README ==================================================================== 1.Introduction ============== Template extraction and content extraction techniques are based on extracting information from real webpages. These techniques need to be continually tested in order to get new improvements in their results and also in their performance. This testing is done by using sets of webpages prepared for this purpose. Thus, a benchmarks suite is an important requirement in order to measure the performance of these techniques. 2.How to obtain TECO 4.0 ======================== TECO 4.0 can be downloaded from the following URL: https://mist.dsic.upv.es/teco 3.Structure =========== TECO 4.0 was created by downloading 130 websites from the Internet. Once all the webpages were downloaded, four different engineers explored the key page and the webpages accessible from it to decide what part of the webpage is the template and what part is the main content. Using the results of this experiment, each website was prepared for template extraction, content detection and menu detection.. On one hand, all elements from the key page not belonging to the template were included in an HTML class called TECO_notTemplate. This way, a template extraction tool can compare its output to the nodes not belonging to the TECO_notTemplate class. On the other hand, all elements belonging to the main content were included in an HTML class called TECO_mainContent. Therefore, a content extraction tool can easily compare its output to the nodes belonging to this class. In addition, the main menu of the key page was included in an HTML class called TECO_mainMenu. There are different kinds of websites such as blogs, companies, forums, personal websites, sports websites, newspapers, etc. Some of the websites are well known like the BBC website or the Unicef website and others are less known like personal blogs or small companies websites. TECO 4.0 is organized in directories. There is a main directory called pages which has 129 directories inside, a directory for each website domain. Note that there are two websites sharing the same domain. 4.How to use TECO ================= The installation is very simple, the zip file has to be extracted into the hard drive, pendrive or other media. Once extracted it will create a directory called pages. It is recommended to extract the file on Linux or OS X systems because Windows based systems do not allow the directory structure used to store the benchmarks. 5.Key pages =========== The following paths indicate the path to the key page of each benchmark: web.mit.edu/institute-events/visitor www.isoc-es.org www.museodelprado.es/index.html www.jdi.org.za/index.html www.u-tokyo.ac.jp/en/about/history.html www.savethechildren.net/what-we-do/our-humanitarian-work.html college.harvard.edu/financial-aid.html www.unicef.org/where-we-work.html www.linuxfoundation.org/about.1.html clinicaltrials.gov/ct2/search/index/index.html cordis.europa.eu/fp7/ict/fire.html www.informatik.uni-trier.de/~ley/pers/hd/s/Silva_Josep.html parents.berkeley.edu/advice/babies/laundry.html amref.org/strategic-pillars/index.html cpoepalencia.es/federaciones-y-asociaciones-confederadas-asociaciones/index.html www.icann.org/history.html www.gip-jci-justice.fr/en/about-us/support-council/index.html www.einstein.yu.edu/leadership/index.html www.americanacademy.de/about/index.html www.mensa.es/cms/pages/%C2%BFqu%C3%A9-es-mensa.html www.bcrf.org/breast-cancer-research.html www.ielts.org/what-is-ielts/ielts-introduction.html fr.unesco.org/about-us/introducing-unesco.html www.ccbe.eu/about/who-we-are/index.html www.fraud.org/get_involved.html www.avaasaja.org/index.php/quienes-somos.html es.sharelatex.com/learn/Uploading_a_project github.com/DawidStankiewicz/forum.1 forum.skyscraperpage.com/index.html en.citizendium.org/index.html www.filmaffinity.com/es/main.html stackoverflow.com/index.html www.meneame.net/faq-es.html www.strangehorizons.com/2004/20040906/greenglass-f.shtml.html www.accountkiller.com/en/delete-activision-account.html study.com/learn/science-questions-and-answers.html c.mi.com/it/index.html frances.forosactivos.net/index.html alumni.harvard.edu/help/message-board.html www.spacetimestudios.com/forumdisplay.php%3f29-Websites-and-Forum-Discussion.html www.gimpforum.de/index.html www.emaildiscussions.com/index.html forums.debian.net/viewforum.php%3ff%3d5.html forums.mozillazine.org/viewforum.php%3ff%3d23.html forums.tomsguide.com/forums/laptop-general-discussion.15/index.html forums.mysql.com/list.php%3f21.html lawstudents.ca/forums.html www.japanesepod101.com/forum/viewforum.php%3ff%3d26.html forums.opera.com/index.html forums.linuxmint.com/viewforum.php%3ff%3d72.html www.wysiwygwebbuilder.com/forum/viewforum.php%3ff%3d10.html communities.apple.com/es/community/mac_os/os_x_el_capitan.html www.eclipse.org/index.html www.swimmingpool.com/index.html www.emmaclothes.com/index.html www.arduino.cc/en/Main/Software.html today.java.net/pub/a/today/2004/07/06/3ddesktop.html clotheshor.se/index.html ruzafagallery.com/calendario/index.html www.raspberrypi.org/resources/teach/index.html doodle.com/online-calendar.html www.newprosoft.com/web-content-extractor.htm worryfreelabs.com/about.1.html www.intelligencetest.com/index.htm www.ikea.com/gb/en.html www.nubbeo.com.ar/index.html www.mulberry.com/es/shop/sale/sale-mens-accessories.html www.tous.com/es-es/novedades/relojes/c/59.html preferenceweb.com/collections/all-sneakers.html www.trekbikes.com/us/en_US/bikes/mountain-bikes/electric-mountain-bikes/c/B512/index.html addons.prestashop.com/es/2-modulos.html us.pandora.net/en/charm-bracelets/pandora-moments/pandora-moments-bracelets/index.html kawaiipenshop.com/index.html www.vam.ac.uk/shop/lindsay-philip-butterfield-blue-flower-silk-scarf.html shop.fendt.com/kids-toys/clothing/shirts.html www.euroholds.com/it/29-prese-arrampicata.html naranjascarcaixent.com/tienda.html www.usedbooksfactory.com/buy-second-hand-old-books/category/ENGINEERING-FIRST-YEAR-BOOKS.html www.cocinaconmarta.com/2015/04/empanadillas-chinas-de-gambas-y-verduras.html www.javiercelaya.es/index.html markahall.blogspot.com.es www.trendencias.com users.dsic.upv.es/~dinsa/en/index.html googleblog.blogspot.com.es www.robyncarr.com/qa.html www.annmalaspina.com/index.html users.dsic.upv.es/~jsilva/wwv2013/index2.html foodsense.is/a-list.html diarium.usal.es/lguich/pagina-personal-de-luis-arturo-guichard www.folj.com/puzzles/difficult-logic-problems.htm oneminutelist.com/16-browser-alternatives-to-desktop-programs/index.html artsonline.uwaterloo.ca/jburbidg/index.html benjamincongdon.me/blog.html michael.tsikerdekis.com/index.html www.beeorganisee.com/reprendre-en-main-le-nettoyage/index.html www.danielgrindrod.com/about.html //75 ofdollarsanddata.com/index.html //76 blog.mint.com/updates/enter-our-newdecadenewyou-meme-sweepstakes-for-a-chance-to-win-5000/index.html elainesir.com/best-korean-beauty-blogs-bloggers-follow/index.html www.vindame.com.br/semana-riesling/uva-riesling/index.html www.rosamontero.es/obra-rosa-montero.html www.almezzer.com/libros/literatura-infantil/a-partir-de-4-anos/index.html johnboyne.com/about/index.html johngardnerathome.info/index.htm edition.cnn.com/index.html www.neoteo.com/star-wars-the-force-awakens-el-regreso-de-viejos-personajes/ riotimesonline.com/index.html www.bbc.co.uk/news/index.html techcrunch.com/gadgets www.turfparadise.com/index.html www.cleanclothes.org/index.html www.afp.com/es/contact.html news.discovery.com/tech/robotics/artificial-intelligences-hawkings-fears-stir-debate-141206.htm www.history.com/index.html detroit.cbslocal.com/2018/12/04/high-school-newspaper-suspended-after-publishing-disruptive-investigation/index.html www.rocklists.com/91x-1983.html www.lashorasperdidas.com/index.html www.journalism.org/2014/03/13/social-search-direct/index.html www.socialmediatoday.com/news/facebook-adds-new-features-for-instant-articles-including-links-to-more-pu/569786/index.html www.diariodeburgos.es/Noticia/Z1C5D6DE9-D1E6-B03A-61236AF21520B8B2/202002/Un-programa-verde-dedicado-a-Felix-Rodriguez-de-la-Fuente.html //CANVIAR URGENT wordofmouthmendo.com/word-of-mouth-stories/2018/5/31/travellers-fare.html www.usine-digitale.fr/article/la-start-up-americaine-clearview-ai-illustre-deja-les-derives-de-la-reconnaissance-faciale.N921119.html 1015fm.com.au/2020/02/steve-mickenbecker-interest-rates-on-hold-2020-02-07/index.html www.dw.com/de/lebron-james-vom-pflegekind-zum-basketball-superstar/a-52088565.html www.theday.com/movies--tv/20200203/super-bowl-ads-dialed-up-fun-as-antidote-to-politics.html nltimes.nl/2019/12/16/chocolate-spread-babies-wins-misleading-product-award.html biztechmagazine.com/article/2019/12/why-byod-makes-endpoint-security-crucial-small-businesses.html www.diariandorra.ad/noticies/nacional/2020/01/23/descontrol_156338_1125.html www.wishtv.com/news/flu-is-widespread-across-the-us/index.html es.gizmodo.com/google-nearby-sharing-por-fin-una-alternativa-a-airdro-1841219114.html