• We're half back! There's a lot missing, but you can find out more here,

    You are now able to log into the forums and post

A new way to search and navigate TGSC

PeeWee678

Well-known member
Jan 7, 2022
445
275
where would I find that code?
Like Zongo wrote: you have to edit those lines in James' file. Have you downloaded that already?
Then change lines 156 and 7841 the way I did above. Of course your edit depends on how and to what location on your hard drive you scraped TGSC.

I used HTTrack to scrape TGSC and stored it to: F:\TGSC\TGSC\
That's also the exact folder I unzipped Charles' "AllFragrance-embedded-v1.html" to.

Hope this helps. If not, please let me know.
 

Zongo

Member
Mar 11, 2023
63
8
Tech note: the HTTrack action worked here, but with over 4000 errors and warnings, and a number of missing files, like
Code:
Warning:     Retry after error -4 (Connect Error) at link www.thegoodscentscompany.com/data/rw1008501.html
Which might explain different numbers for people here. Anyway, I fixed this by rerunning HTTrack a second time. It changes automatically into 'update mode'. This time, it still had thousands of errors, but they were merely of the type
Code:
Error:     "Not Found" (404) at link www.thegoodscentscompany.com/data/false1280301.html
which doesn't seem to lead to any missing files. No single error for actual files like rw1664231.html. As for now, I have ~41,000 files in ~1.07GB. Before stripping it down it was ~1.85GB, but you can delete these folders selected here, that are unrelated to the TGSC website, so it's actually just 1,07GB. To better compare numbers: the core folder www.thegoodscentscompany.com\data has ' 25,235 items' on my drive. How about yours? P.S. I'm still missing some files. So my numbers aren't final. Eg. es1615211.html "psiadia altissima".
 

Attachments

  • _screen.png
    _screen.png
    26.2 KB · Views: 2
Last edited:

tensor9

Basenotes Plus
Basenotes Plus
Feb 18, 2014
3,129
176
If I had the money and time, it I’d take over TGSC and make it right.
 

rococo

New member
Jan 1, 2010
92
94
I have a scrape of all the materials and formulae, parsed and cleaned into JSON blobs as well as a relational database, and my own frontend that I run locally. I’ve thought of making this available (either the data alone or a full hosted site with search etc), but don’t want to overstep if Bill’s family still have an interest in the site.
 

PeeWee678

Well-known member
Jan 7, 2022
445
275
I'm still missing 11 thousand core files
If you're interested in what items I have in the main folder; see attachment.
You can run it through a file comparison tool of your choice and compare with a list of yours to see what's missing in your download.

P.S. I Included all the 13.521 images (mostly of the moleculair structures) in my files count and generated the list after "expanding" the folder structure (I use Total Commander for these kind of things) so it's like a recursive list of all files in the folder "www.thegoodscentscompany.com".
 

Attachments

  • files in www.thegoodscentscompany.com folder.txt
    552.7 KB · Views: 10

Zongo

Member
Mar 11, 2023
63
8
@PeeWee678 That is helpful! In Your listing, I did a spot check with 5 random items that my HTTrack action didn't scrape.
Code:
http://www.thegoodscentscompany.com/data/es1615211.html ( psiadia altissima leaf oil )
http://www.thegoodscentscompany.com/data/es1390731.html ( citrus medica peel oil )
http://www.thegoodscentscompany.com/data/rw1454231.html ( melozol formate )
http://www.thegoodscentscompany.com/data/rw1473041.html ( 3-octyl formate )
http://www.thegoodscentscompany.com/data/es1051851.html ( yuzu peel oil )
all of them missing in your listing; so probably there's plenty of items missing on your side too? Correct me if I'm wrong but AFAIK HTTrack just follows links, so it can't see nor scrape pages that aren't linked anywhere else? Maybe best would be to contact the family and ask straight away for permission to mirror their site.
 

PeeWee678

Well-known member
Jan 7, 2022
445
275
In Your listing, I did a spot check with 5 random items that my HTTrack action didn't scrape
Thanks!

They're indeed missing on my side. Those items are linked from the alphabetic pages though so I have no idea why my scrape didn't pick them up.
I think I will scrape again, multiple iterations in the coming days to see if that helps.

There is already a mirror BTW: http://www.perflavory.com
I'm scraping that now.
 

Kittycat74

New member
May 27, 2020
116
8
At risk of incurring the wrath of anyone who thinks I haven't read all the previous excellent posts, and because I've already posted this in the other thread about the lack of TGSC search engine functionality. I'm using Perflavoury, which appears to have all (most) of the same info as TGSC. If anyone is looking for an analogue of TGSC and wants to search for stuff, Perflavoury is a good replacement.
 

Zongo

Member
Mar 11, 2023
63
8
@PeeWee678 Alright, I've started with scraping perflavory and see how I gonna manage to merge both sources (original TGSC and perflavory.) If perflavory turns out to have no disadvantages, even better and we can discard TGSC?
 

pkiler

Basenotes Plus
Basenotes Plus
Dec 5, 2007
13,542
2,351
I have not spoken about my favorite search tool in this thread, to draw away from James' brilliance and getting his tool in your hands.
But if you aren't going to use his new tool, and still want to search TGSC, then, My main method to search TGSC is through the www.perfumersearch.com portal. I use it everyday. Bless you James for your brilliant work as well! ;-)
 

Culpa Ire

Active member
Nov 11, 2022
202
228
@Culpa Ire You swap "http://" for "file:///" in two occurences in the AllFragrance-embedded-v1.html file. The path given with "file:///..." must point to your local data, so take care to adapt it to your directory, where HTTrack downloaded the website to.

P.S. my answer was maybe a bit too simplified, as you have actually have to take care that 'www.' is in here
Code:
$('#iframe').attr('src', `file:///F:/TGSC/TGSC/www.thegoodscentscompany.com/data/${ref}.html`)
... which wasn't in there before.

Like Zongo wrote: you have to edit those lines in James' file. Have you downloaded that already?
Then change lines 156 and 7841 the way I did above. Of course your edit depends on how and to what location on your hard drive you scraped TGSC.

I used HTTrack to scrape TGSC and stored it to: F:\TGSC\TGSC\
That's also the exact folder I unzipped Charles' "AllFragrance-embedded-v1.html" to.

Hope this helps. If not, please let me know.
Thanks to both of you for your reply.

My problem is that I don't know how or where to edit that file. I assume I open it with a browser and edit using the page inspector. If not can you tell me which program I use to make the edit. I have the path to the folder where I copied the site, I'm just totally green when it comes to editing .html files. I'm sure it's easy enough but treat me like I was born yesterday.

I'm on a Mac if it makes any difference.
 

Forum statistics

Threads
267,187
Messages
5,068,500
Members
205,482
Latest member
instantviral
Top