A Complete Hotel Universe in 153 Lines

Honestly, I may prefer the Overlook Hotel

Well, not a complete hotel universe, but I wouldn't be a showman if I didn't exaggerate a little bit.  Previously, I broke down how I used AI to help me find locations for travelectable and gather data about the locations that would otherwise be difficult to source.

Today, I show how I added hotel data to each of those locations providing a more complete set than anything I've encountered elsewhere (other than using real inventory, which is cost-prohibitive).

If you want to view the entire source file, you can find it here on GitHub.

Most of the magic happens in the queries I send to Llama-3:


def get_location_description_query(location):
    return f"""
    Write a 100-word marketing description for visiting {location}.  
    
    Format it as [<description>]

    Do not add a summary or disclaimer at the beginning or end of your reply. Do not deviate from the format.
    """

def get_location_points_of_interest_query(location):
    return f"""
    Provide 5 points of interest for {location}

    Format it as <point_of_interest>|<point_of_interest>|<point_of_interest>|<point_of_interest>|<point_of_interest>

    Do not add a summary or disclaimer at the beginning or end of your reply. Do not deviate from the format.
    """

def get_hotel_query(radius,location,lat,long):
    return f"""
    List 10 hotels within {radius} miles of {location[1]} at lat/long ({lat},{long}).

    Provide their star ratings, addresses, distance in miles from lat/long ({lat},{long}), 
    and a 50-word description of the hotel.  You must provide an answer even if it's an estimate.

    Format the response as:

    [{{"name":<name>,"address":<address>,"distance":<distance>,"star_rating":star_rating, "description":description}},...]

    Do not add a summary or disclaimer at the beginning or end of your reply. Do not deviate from the format.
    """

def get_room_rate_query(location,hotel_name,address):
    return f"""
    For {hotel_name} at {address} in {location[1]}, provide 4 different room offers. Format it as follows:
    [{{"room_type":<room_type>, "room_description":<room_description>,"amenities":[<list_of_amenities>],"winter_rate":<winter_rate>,"summer_rate":<summer_rate>,"cancellation_policy":cancellation_policy}},...]

    Don't worry if the information isn't up-to-date. Provide a best estimate that matches historical information.  
    For the rates, take into account the seasonality of {location[1]} so higher rates aren't present in the off-season.

    Do not add a summary or disclaimer at the beginning or end of your reply. Do not deviate from the format.
    """

The data returned by the LLM has been pretty consistent, which is a huge win for a generative AI machine (Who's a good bot?  You are!  Yes you are!).  The key to that is the tag line I've added to all of the queries:

Do not add a summary or disclaimer at the beginning or end of your reply. Do not deviate from the format.

There's no guarantee that the LLM will always follow my instructions, but, for the number of runs I've done so far, it skips the filler at the beginning of its response (things like the pseudo-friendly "Here you go!") and at the end (things like the pseudo-legal "These are just estimates.  Please check the relevant sites for up-to-date information.") making the data easier to parse.

Even if it does ignore the instruction, the number of exceptions I'm likely to encounter while collecting mock data will likely be small and manageable.

The city description Llama-3 provides is surprisingly well-composed.  This is likely due to the fact that I only asked it for 100 words and the text isn't some Didion-esque love letter, but standard marketing copy.

It certainly serves its purpose for mock data (though I still haven't decided what to do with descriptions on the site), but it's something I'd prefer to have a real writer address if the site were to go live.  The text doesn't do anything that evokes passion and leisure travel is definitely about passion.  I don't differentiate myself from other sites if my copy is generic.  

One could argue that online travel agencies are just conduits for booking travel and are indistinguishable otherwise.  While that can be true, if you had the option to use a site that provided wonderful recommendations, and it made you genuinely excited to visit new places or old favorites, wouldn't that be the first site you'd visit?  Even if you had to pay a dollar or two extra on your trip to support that content?

The hotel results are also of good quality.  I'm getting real hotels with real addresses and real distances from the city center.  The descriptions are generic and I have no idea if the star ratings are up-to-date, but, again, it's consistency and verisimilitude that matters most when creating fake data.

I've added a few wrinkles that Codeium assisted me with when generating the data:


def populate_hotels(location):
    with sqlite3.connect('travel_data.db') as conn:
        curr = conn.cursor()
        curr.execute("SELECT id,name,address FROM hotels WHERE location_id = ?", (location[0],))
        rows = curr.fetchall()
        # We're gating the number of hotels for each location to 40
        if len(rows) == 40:
            return
        
        hotel_names = [row[1] for row in rows]
        hotel_addresses = [row[2] for row in rows]
        
        curr.execute("SELECT latitude,longitude,hotel_retries FROM destinations WHERE id = ?", (location[0],))
        data = curr.fetchall()
        (lat, long,hotel_retries) = data[0]

        # We've tried enough times to get 40 hotels.  We'll go with what we have.
        if hotel_retries >= 10:
            return
        
        radii = [5,10,15,20,25,30,35,40]
        new_hotels_found = False

        for radius in radii:
            query = get_hotel_query(radius,location,lat,long)
            hotels = utilities.execute_llm_query(query,max_tokens = 1024)
            hotels = ast.literal_eval(hotels)
            for hotel in hotels:
                # There's no guarantee that everything will be formatted properly, but
                # we can limit the number of duplicates by checking the name and address.
                if hotel['name'] in hotel_names or hotel['address'] in hotel_addresses:
                    continue
                else:
                    new_hotels_found = True
                    new_hotel = (hotel['name'],hotel['address'],hotel['distance'],hotel['star_rating'],
                                 hotel['description'],location[0])
                    curr.execute("INSERT INTO hotels (name,address,distance,star_rating,description,location_id) "
                                 "VALUES (?,?,?,?,?,?)", new_hotel)
                    conn.commit()
                    hotel_names.append(hotel['name'])
                    hotel_addresses.append(hotel['address'])

            if new_hotels_found:
                curr.execute("UPDATE destinations SET hotel_retries = 0 WHERE id = ?", (location[0],))
                conn.commit()
                # If we've got 10 total hotels, we're done for now.
                if len(hotel_names) >= 10:
                    return
                
            hotel_retries += 1
            curr.execute("UPDATE destinations SET hotel_retries = ? WHERE id = ?", (hotel_retries,location[0]))
            conn.commit()

I've decided that I don't want to populate my mock data set with more than 40 hotels per location, but I also don't want to spend the amount of time needed to dynamically populate each location with 40 hotels when someone first searches for the location.  If I've got somewhere between 10 and 40 hotels, I make an incremental call for a few more hotels to build up my inventory slowly every time someone searches for that location. 

If I don't find more hotels near the city center, I expand my search to greater distances (this will prove useful for more remote locations like national parks).  If, after several tries across several sessions (saved in the database as hotel_retries) I don't find anything new, I'll assume we've mined everything possible and just use what we have, regardless of the number.

I'm employing a similar on-demand tactic for room rates.  I'll randomly fetch 10 available hotels from my set and pre-populate those room rates (4 rates per hotel).  I'll also fetch room rates if people click on a hotel without an existing data set to back it.

Of all the data I'm asking for, room rates are the most dubious in their authenticity.  I don't think for a second that the room types and room descriptions are available at the listed hotels.  In addition, I'm asking for static rates, which, anyone who has ever stayed at a hotel knows, don't exist.  

But, again, I'm going for consistency and verisimilitude, not accuracy.  And, Llama-3 returns winter and summer rates for some variety.  I could build some randomness into the rates to simulate demand and make changes for occupancy, but that's something I'll consider further down the line.  For now, I'm happy simply to have the data.

In closing, I can't emphasize enough how powerful the queries are in this exercise.  I'm asking the LLM to indulge its sweet spot - well-known data that doesn't need to be entirely correct and that can be returned in digestible chunks.  

Attempting to create mock data without something that's readily available to participate in my mass deception would require much more effort.  I'd either need to go the fake text route (e.g. "Lorem ipsum..." for all city descriptions), need to build out exquisite and byzantine screen scrapers, or spend time researching and/or writing my own corpus.

There's still a glaring void regarding visual content.  I'll make a valiant attempt to use image generators as part of my content, but assuming that the results would scare even Kafka on LSD, I'll fall back to open-source licensed stock images (Unsplash has a pretty solid API offering).

Until next time my human and robot friends.

Comments

Popular Posts