Mock Turtles All the Way Down
Today's article delves into the intricacies of creating mock hotel data for my travelectable website. I originally started using the Amadeus Developers' API to populate the site but it faces a few issues:
- It's quota restricted. If my site gathers any kind of traffic (granted that's a big if), I'll have to rate limit my users or pay to increase the limit without any expected revenue to cover the new expenses.
- Because it's a 3rd party aggregator for hotel chains and airlines, its descriptive content - at least in the test environment I'm using - can leave me wanting. Sometimes the hotel descriptions are simply a grouping of nearly disjoint keywords.
- The test sandbox limits certain API calls to certain locations. Star ratings are only available for London and New York and, even then, further restricted to certain hotels.
- Since the data is sourced from vendors' live systems, I can't control the rate availability. This is a problem for a proof of concept website.
User: Generate a list of 40 top hotels in New York. For each hotel, generate a list of 4 room rates with amenities and an estimated room rate.
No.
LLMs - though they're better at understanding natural language than the average computer, are still horribly literal. And, since, they have the power to continue generating text unabated, their literal misinterpretation becomes incomprehensible by the time they've finished. They're also horrible at counting. And, unless directed specifically, they're shy about being wrong, so they may refuse to answer your question at all (and yet they have no problems lying in completely unrelated areas with unmitigated confidence).
When I tried variations of the above query, I often got a response that I should check various travel sites for rates. This isn't helpful when you're attempting to programmatically generate as much data as possible in one go.
LLMs also suffer from what I'll deem "token fatigue," where something like 'estimated room rate': $120' becomes 'intimate room rate: $120' as the length of the output increases. As a public service announcement - beware of 500-word essays discussing the root causes of WWII. By the end of the essay, you may be informed that an underlying cause of the war was due to ill will between the Weimaraner Republic and French pastries.
And, then, sometimes, they'll just stop generating output at all. Lovely.
I spent nearly a day massaging test queries for domestic US vacation spots only to receive answers for Quebec City, Cancun, and...Mongolia. When I asked for 50, I'd logically get 18 back. Or 37. Or 50, where Savannah was listed twice.
I finally recognized my own folly in my original strategy and understood that I didn't need to drink from the firehose in order to succeed. I could build up my data slowly with my own safeguards (say restricting the total number of locations I query or stopping asking for hotels when I've reached 40 in a given location).
Asking the LLM to structure the data also seems to make its output more reliable and has the added bonus of my being able to write code to store it easily.
Here are four example queries I'm using to build my hotel mock data. I'll curate the list of selected destinations ahead of time (probably 50-100 US domestic and 50 international), but these queries should allow me to build up a rich, plausible data set for the site:
Write a 100-word marketing description for visiting New York. Format it as {'description':<description>}
Provide the lat and long for New York as well as 5 points of interest. Structure your output like this:
{'city':<city>,'state':<state>,'lat':<latitude>,'long':<longitude>,'points_of_interest':[<point_of_interest:>...]}
If the city doesn't have a state (like Washington D.C), leave the state blank.
List 10 hotels within 5 miles of New York's lat/long (40.7128,-74.0060). Provide their star ratings, addresses, distance in miles from Chicago's lat/long, a 50-word description of the hotel, and an estimated teaser rate per night for both summer and winter seasons. The rate doesn't need to be up-to-date, but estimate the value based on historical rates. Also account for the rate being off-season or in-season as appropriate.
Format the response as:
[{"name":<name>,"address":<address>,"distance":<distance>,"star_rating":star_rating, "description":description,"winter_offer_rate":winter_offer_rate,"summer_offer_rate":summer_offer_rate},...]
For "The Langham Chicago Hotel", provide 4 different room offers. Format it as follows:
[{"room_type":<room_type>, "room_description":<room_description>,"occupancy":<occupancy>,"amenities":[<list_of_amenities>],"winter_rate":<winter_rate>,"summer_rate":<summer_rate>,"cancellation_policy":cancellation_policy}]
Don't worry if the information isn't up-to-date. Provide a best estimate that matches historical information. For the rates, take into account the seasonality of Chicago, so higher rates aren't present in the off-season.
- Eddie Vedder's Escape
- Kanye's Kingdom
- Oprah's Oasis
- Capone's Castle
Comments
Post a Comment