Michal Bryxí 🌱

Michal Bryxí 🌱

1 year ago • •

Michal Bryxí 🌱
1 year ago • •

I'm going to shoot myself into the foot here: AI as a tool to give good, concise, descriptive and accurate alt text to images has greatly surpassed my own abilities. See my previous toot.

At this point I think it's a hindrance to the quality of an alt text if **I** make it by hand.

Prediction: Alt text will be generated by AI directly on the consumer's side so that *they* can tell what detail, information density, parts of the picture are important for *them*. And pre-written alt text will be frowned upon.

#AI #AltText

in reply to Michal Bryxí 🌱

Glenn

in reply to Michal Bryxí 🌱 • 1 year ago • •

Perhaps this is true. Just remember to verify that the alt text is not completely incorrect before you post.

in reply to Glenn

Michal Bryxí 🌱

in reply to Glenn • 1 year ago • •

@glennsills Yup. A bit 😰 triple checking before hitting send on this one. To be honest...

@Glenn

in reply to Michal Bryxí 🌱

Jupiter Rowland

in reply to Michal Bryxí 🌱 • 1 year ago • •

@Michal Bryxí 🌱

Prediction: Alt text will be generated by AI directly on the consumer's side so that *they* can tell what detail, information density, parts of the picture are important for *them*. And pre-written alt text will be frowned upon.

Won't happen.

Maybe AI sometimes happens to be as good as humans when it comes to describing generic, everyday images that are easy to describe. By the way, I keep seeing AI miserably failing to describe cat photos.

But when it comes to extremely obscure niche content, AI can only produce useless train wrecks. And this will never change. When it comes to extremely obscure niche content, AI not only requires full, super-detailed, up-to-date-by-the-minute knowledge of all aspects of the topic, down to niches within niches within the niche, but it must be able to explain it, and it must know that and inhowfar it's necessary to explain it.

I've pitted

@Michal Bryxí 🌱

Prediction: Alt text will be generated by AI directly on the consumer's side so that *they* can tell what detail, information density, parts of the picture are important for *them*. And pre-written alt text will be frowned upon.

Won't happen.

I've pitted LLaVA against my own hand-written image descriptions. Twice. Not simply against the short image descriptions in my alt-texts, but against the full, long, detailed, explanatory image descriptions in the posts.

And LLaVA failed so, so miserably. What little it described, it often got it wrong. More importantly, LLaVA's descriptions were nowhere near explanatory enough for a casual audience with no prior knowledge in the topic to really understand the image.

500+ characters generated by LLaVA in five seconds are no match against my own 25,000+ characters that took me eight hours to research and write.

1,100+ characters generated by LLaVA in 30 seconds are no match against my own 60,000+ characters that took me two full days to research and write.

When I describe my images, I put abilities to use that AI will never have. Including, but not limited to the ability to join and navigate 3-D virtual worlds. Not to mention that an AI would have to be able to deduce from a picture where exactly a virtual world image was created, and how to get there.

So no, ChatGPT won't write circles around me by next year. Or ever. Neither will any other AI out there.

#Long #LongPost #CWLong #CWLongPost #VirtualWorlds #AltText #AltTextMeta #CWAltTextMeta #ImageDescription #ImageDescriptions #ImageDescriptionMeta #CWImagDescriptionMeta #LLaVA #AI #AIVsHuman #HumanVsAI

LLaVA

Visual Instruction Tuning

^{llava-vl.github.io}

Jupiter Rowland

2024-03-05 19:27:53

LLaVA vs my own image description

Okay, I've done it.
I've pitted an image-describing AI against myself and one of my own image descriptions.
As the test object, I've used my latest in-world picture which has first been published here, including the exact same full-length image description and explanation from that post. The short description in the alt-text is the same, too, all shortcomings included, but I had to modify the alt-text itself for this specific use-case.
Users of Pleroma and its forks, Friendica, Hubzilla and (streams) can see the image here:

Users of Mastodon, Misskey and their respective forks can see it as a file attachment below this post.
My AI of choice was the Large Language and Vision Assistant (LLaVA). I've fed it the image at the resolution at which I've also posted it, namely at a resolution of 800x533 pixels.
The AI took a few seconds to produce this 558-character image description:
The image appears to be a 3D rendering or a screenshot from a video game or a virtual environment. It shows a character standing on a paved path with a brick-like texture. The character is facing away from the viewer, looking towards a sign or information board on the right side of the image. The environment is forested with tall trees and a dense canopy, suggesting a natural, possibly park-like setting. The lighting is subdued, with shadows cast by the trees, indicating either early morning or late afternoon. The overall atmosphere is calm and serene.

I'll talk about this description in more details in a follow-up.
Now, here comes my description. I didn't write it while looking at the scaled-down image. I didn't write it while looking at the image at its original solution of 2100x1400 pixels either.
I wrote it while I was still in-world. I could walk around and zoom around and see everything from different angles and at much, much higher resolutions. For example, one of the picture on the advertising board to the right shows a building that's nearby but outside the borders of the image. I could describe that picture by walking to that very building.
All in all, I took eight hours to produce this 25,271-character image description:
Image description

The picture in this post is a digital rendering from inside a 3-D virtual world based on OpenSimulator, generated in a regular client for this kind of virtual worlds, also known as a viewer, using shaders and generated shadows, but without ray-tracing. It shows my avatar on a paved path surrounded by conifer trees and with cliffs in the background. Everything in the picture is in monochrome like an old black-and-white film. However, the image is unaltered and shows both my avatar and the scene as they are in-world.
What OpenSimulator is

OpenSimulator is a free, open-source, cross-platform server-side re-implementation of the technology of Second Life. The latter is a commercial 3-D virtual world created by Philip Rosedale, also known as Philip Linden, of Linden Labs and launched in 2003. It is a so-called "pancake" virtual world which is accessed through desktop or laptop computers using standard 2-D screens rather than virtual reality headsets. Second Life had its heyday in 2007 and 2008. It is often believed to have shut down in late 2008 or early 2009 when the constant stream of news about it broke away, but in fact, it celebrated its 20th birthday in 2023, and it is still evolving.
OpenSimulator, OpenSim in short, was first published in January, 2007. Unlike Second Life, it is not one monolithic, centralised world. It is rather a server application for worlds or "grids" like Second Life which anyone could run on either rented Web space or at home, given a sufficiently powerful computer and a sufficiently fast and reliable land-line Internet connection. This makes OpenSim as decentralised as the Fediverse. The introduction of the Hypergrid in 2008 made it possible for avatars registered on one OpenSim grid to travel to most other OpenSim grids.
Second Life and the OpenSim-based worlds are called "grids" because they are flat worlds divided into square areas of 256 by 256 metres each which is roughly 280 by 280 yards. These areas are called "regions".
Where the picture was made

The picture displays a part of Black White Castle, a fairly recent sim in Pangea Grid. "Sim" is short for "simulator" which refers to what is running in a region so that something can be built in it, and avatars can enter it. In Second Life, a sim is always one region. In OpenSim, so-called varsims can span multiple regions, always in a square arrangement with the same number of regions in both directions. Up to 32x32 regions in one sim are possible. Black White Castle only covers one region.
Pangea Grid is a German OpenSim grid with a special focus on arts, architecture and landscaping.
The name "Black White Castle" is most likely borrowed from a section of the innuendo-saturated German comedy film Neues vom Wixxer from 2007, the sequel to Der Wixxer from 2004. Both films are parodies on the German black-and-white mystery thrillers Der Hexer from 1964 and its sequel Neues vom Hexer from 1965. These films, in turn, are part of a series commonly referred to as "Edgar Wallace films" as they're based on crime novels written by the British author Richard Horatio Edgar Wallace. These two films are based on the novel The Ringer from 1926, a revised version of a 1925 novel known by the titles The Gaunt Stranger and Police Work.
Neues vom Wixxer, while generally in colour, picks up black and white as what has grown into a style element in the classic Edgar Wallace films in a place named "Black-White Castle". As the name indicates, it is entirely black and white for reasons of tradition, everyone and everything inside it included.
The eponymous sim was built in small parts by making entirely new assets with monochrome textures, but mostly by taking existing objects, extracting their textures, exporting them from OpenSim, using an external image editor to reduce their saturation to zero, re-uploading them to OpenSim and replacing the original textures on the objects with their new monochrome versions. The basic ground texture was altered in the same way, and even the sky and the sunlight are devoid of colour. Likewise, it's common for visitors like me to try and make their own avatars entirely black and white.
The sim was built by Bink Draconia who had previously built a sim with the TV series The Good Place, started in 2016, as its theme.
My avatar

My avatar is standing in the middle of the image, the head right of centre by about two or three percent of the image's width due to most of the weight resting on the right foot, the feet a few percent above the bottom edge of the image, roughly centred on average and slightly apart. His back is turned towards the camera, and he is facing away from the camera, so his face is entirely invisible.
He is a male human with fair skin that was altered to light grey and short black hair. He is wearing a dark grey tweed suit with a very large herringbone pattern on the jacket and an even larger herringbone pattern on the trousers. Underneath the jacket, he is wearing a white button-down shirt, of which only a part of the collar above the collar of the jacket and the cuffs below the sleeves of the jacket are visible. In addition, he is wearing a black bowler hat and a pair of dark grey, slightly shiny formal dress shoes.
The ground

Beneath the avatar, there is a straight path with irregular edges that is about five metres or 17 feet wide and leads about 40 metres or 140 feet forward, ahead of and away from the avatar. Its texture shows pavement made of medium grey, rectangular concrete pavers, placed in alternating orientations in a 90-degree herringbone pattern, but rotated against the region's coordinate axes by 45 degrees and against the rough direction of the path by about 20 degrees to the right. The pavers are about twice as large as they would be in real life.
On both sides of the avatar, the pathway widens into a crossing, but the other three paths are beyond the edges of the image.
The ground on the sides of the paved path has a blurry light grey texture with a coarse resolution that is either a desaturated, very light grass texture or thin, dirty snow.
The scenery to the left

To the very left, there is a wooden arrow sign that is approximately rectangular except for the rough shape of the wood, including four notches on the left-hand side, and the extra corner protruding from the right-hand edge that points into the distance along the path. The bottom edge of the sign is at roughly the same height as the middle of my avatar's thighs, and the top edge is a little bit more than twice as high.
The sign has has "BlackWhite Castle" written on it in a Fraktur blackletter typeface, reminiscent of bright, shiny embossed metal with some dark shading surrounding it, but with a hard-to-identify texture on it. "BlackWhite" is written as one word, but in Pascal Case with the first letters of both "Black" and "White" as capitals. The writing is a bit less than a third of the height of the sign and about as long as its top and bottom edges. The medium-grey paint has partly come off again, especially near the top, but the writing is still intact. The sign shows the way to the building after which the whole sim is named. The sign is placed on top a lighter piece of wood with a rectangular cross-section that is a bit thinner than the sign itself and serves as its sign pole.
The sign is surrounded by three identical groups of eight bushels of high grass each, one to its left, one behind it, one to its right and partly in front of it. Most of the grass is less tall than the sign, but some of it, especially in the bushels behind the sign which have been enlarged, is taller. Also, in front of the sign, there is a group of six stone mushrooms at six different sizes which, given the colour-less setting, appear like actual rock. The two biggest ones have a diameter larger than that of my avatar's bowler hat.
There are three mountain pines to the left of the path which are identical, save for their size. The one the farthest away is about 12 metres or 40 feet tall. It is mostly obscured by another pine which is standing a little further to the left in the image and closer to the on-looker, and which is roughly 14 metres or 47 feet tall. Just right of the arrow sign and behind the right-hand grass bushel, there is a pine of about 20 metres or 70 feet, tall enough for its treetop to be beyond the borders of the image. All three cast a shadow on the ground around them and the pathway, as does a fourth 12-metre pine way to the left whose trunk is entirely outside the borders of the image, but whose shadow ends at my avatar's feet.
Between the second and the third pine, closer to the edge of the pathway than any of the pines, there are two rocks lying on the ground. Both take up the same ground area, but the one to the right is about knee-high, and the one to the left is roughly 60 percent higher. There is another group of eight grass bushels, four of which are in front of these two rocks while the other four seem to have fused with the rocks. More grass bushels surround the first pine.
The scenery to the right

To the right of the end of the path, there is another set of six stone mushrooms.
Another mountain pine, just a little shorter than the second one, is standing opposite the second one. Further to the right and further up-front, there are several more conifers of various heights, some only nine metres or 30 feet, others twice as high. The closest of these conifers, also one of the smallest, is at about a quarter of the width of the image away from the right-hand border, and it is the closest to the edge of the pathway.
All trees are made the tradition Second Life and OpenSim way: The trunk is a textured 3-D model. Everything else consists of the same partly transparent texture with branches, twigs and and needles on flat surfaces that pass through the trunk and have the texture on both sides. The mountain pines have three such surfaces at angles of 60 degrees from another, the other conifers have four which are 45 degrees apart. Within the context of the scenery, however, this is hardly noticeable, and it puts less strain on the graphics hardware.
To the right of the path, the ground is covered by a lot more grass, only that most of it more simple, using one partly transparent 2-D surface for each bushel, and only a bit higher than knee-high at its maximum.
A mostly wooden outdoor info board is protruding to the left from behind the closest of the conifers. Two vertical wooden teams have between them, from top to bottom, a longer but smaller horizontal beam, six rows of two slightly darker horizontal planks and another two horizontal beams, nine much smaller vertical bars standing between these two and connecting them. On top of each of the big vertical beams, two short beams mounted in a 90-degree arrangement carry a roof with a texture that seems to suggest slate shingles. The rooftop is a bit more than 3.60 metres or 12 feet above the ground. On the second row of planks, "Info Board" is written in a lighter tone of grey than the planks themselves. The last two letters are behind the trunk of the conifer in front of the sign.
Below the writing, at eye height, there are three square info panels on the board, each with a wooden frame around it. Only the ones on the left and in the middle are visible; the one to the right is fully obscured by the conifer again.
The panel on the left carries a worn-out advertising poster for BlackWhite Motel which is on the sim as well, in the opposite direction of where my avatar is looking. It is a two-storey building which is shaped like the letter L laid on the ground. On the short side in the left of the picture, it has room 101 and the office on the ground floor and rooms 201 and 202 upstairs. On the long side, it has eight more rooms, numbers 103 through 106 and 203 through 206, only six of which are in the picture; if you visit the motel itself, it becomes clearer why.
There is a parking-lot in front of the building with spaces for eleven cars, separated by white lines. One of the spaces in front of room 104, in the middle of the poster, is occupied by a two-tone white-and-white 1957 Chevrolet Impala four-door sedan which not only lacks hubcaps on its steel wheels, but also has opaque windows in a tone of grey just slightly lighter than the asphalt. Three spaces further to the right, at the right-hand edge of the poster, there is an almost identical car which is only darker all over from the carbody to the chrome trim to even the white walls on the tyres.
The ground floor of the building has a concrete walkway in front of itself which is a bit higher than the parking-lot. The upper floor can be accessed via 180-degree angled stairs in the corner of the building and an open gallery on the parking-lot side. Both the gallery and the actual front of the building are supported by vertical columns made of dark grey concrete with a square cross-section, save for one square cut-out in each corner.
All rooms, the office included, have dark grey doors which face the parking-lot, as do their windows which have very dark grey wooden frames and always come in pairs. Rooms 101 and 201 have two pairs of windows, one on each side of the door. The office and room 202 are window-less. The other rooms have one pair of windows to the right of the door. The doors are framed by two columns with a dark grey concrete panel above them. The wall sections with windows are otherwise filled with very light grey brick walls with unusually long bricks. The same bricks are used for the wall sections to the left of rooms 103 through 106 and 203 through 206 which also feature shiny black wall lamps with energy-saving bulbs.
The low walls that surround the gallery between the columns are made of eight long, very bright grey horizontal panels of probably some kind of metal each, topped with dark grey wooden handrails. The one in front of room 201 carries a flashing neon sign which reads "Vacancy" in all-caps with a brighter rectangular frame around it. Also, on top of the roof, near its front edge, in front of rooms 204 and 205, there's a "motel" sign with no caps and a rather unreliable illumination. Both signs are glowing on the poster.
All windows have blinds on the inside which are mostly closed. Only the right-hand blinds of rooms 101 and 201 and the blinds of rooms 103 and 204 are open. For those who want to know, even though it's outside the advertising poster: The blinds of room 106 are pulled up.
On the gallery-supporting column in front of the door to room 104, a medium-grey surveillance camera facing the parking-lot is moving into various positions. Next to the door of room 105, there is a refrigerated container for packaged ice with two side-hinged but actually unmoving bulb plate hatches on it.
On the ground in front of rooms 101 and 104, there are arrows consisting of seven chevrons each which point towards the office door. In order to enhance their effect, a gradient texture scrolls along them on each of them.
In the background behind the motel, a mountain pine rises above the roof in the middle to the left of the "motel" sign. Behind the sign and all the way to the left, there are three more conifers. These trees are basically identical to the ones in this image. Also, left of centre, a snow-covered mountain top rises further in the background.
The left half of the poster is covered by a dark overlay. Near its top, there is a very light grey rectangle aligned with the right-hand edge of the overlay. It has "Best Price" written on it in dark grey letters. Below that, there are five slightly ligher grey dingbats, either teardrop-spoked asterisks (Unicode U+273B) or sparkles (Unicode U+2747), which imply a five-star rating. An almost identical rectangle is just as close to the bottom and aligned with the left-hand edge of the poster, only that it has "Book Now" written on it. Between them, in the middle of the darkened half, "The BlackWhite Motel" is written in three lines with "BlackWhite" joined to one word again. All writing is done in the same narrow slab-serif typeface, and all characters including the dingbats have lighter lines around them that make them appear embossed.
Back to the panels on the info board: The one in the middle is mounted a bit lower than one on the left. It shows what appears to be a late medieval sea map of a place which I couldn't identify. Due to the limitation of in-world texture sizes to a maximum of 1024x1024 pixels, the rather small writing on the map is indecipherable. Most of it is ocean with some land in the upper half. On the land and in the bottom right corner, there are typical illustrations for maps from those days. The map shows its age with its darker tint and its jagged edges, and its shading makes it appear like it had been folded to a sixteenth of its original size before.
In front of the info board from the on-looker's point of view and actually between it and the conifer nearby, there is another group of eight grass bushels.
In front and partly to the right of the conifer, there is an object which doesn't exist in real life, but which is typical for OpenSimulator: an official OpenSimWorld beacon of the latest generation, but modified to fit the style of the sim.
This particular device has a shiny black foot with a long rectangular footprint which is about 80 percent as high as it is deep and tapered upward, and which has rounded edges. It carries the less shiny main body of the device. It starts narrower than the top surface of the foot in all directions. From bottom to top, it first protrudes forward and immediately increases in depth and slightly and curves backward and continues in a straight slope which still goes more upward than backward. Eventually, it curves upward and ends in a slim, rounded top. Transversally, it keeps the same width all the way. Both sides are carved out and illuminated, normally in cyan, here in almost white. Otherwise, it comes in its standard dark grey. However, it's actually a brownish anthracite grey, and the very top shows some light blue, so while it clearly hasn't received the monochrome treatment all over, a closer look also reveals that it should have. The same goes for the foot which is slightly bluish.
The straight section of the main body carries a shiny black frame with the central element of each OpenSimWorld beacon: the touch display with a ratio of 4:3. When not in use, this specimen shows the standard idle screen, only that it was modified to monochrome. Slightly above the middle, there is the official OpenSimWorld logo, namely the word "OpenSimWorld" itself with no actual caps. However, the "O" at the beginning is replaced with a circle matching the rounded sans-serif typeface which contains a stylised globe tilted to the left by an angle similar to Earth's inclination and showing three parallels and two meridians, but no land underneath. The last five letters, "world", are darker than the rest. Below it, in the same typeface, but in an even lighter grey, and without caps again, but a bit smaller, "teleporter" is written. Both lines also have shaded outlines that make them appear imprinted.
Further below, "Click for destinations" is written, still in the same type face and in about the same shade of grey as "OpenSim" above, but small enough to appear shorter than "teleporter" above. The background of the screen is a very light grey on the top 35 percent, medium grey on the bottom 35 percent and a gradient between the two. Clicking the screen breaks the monochrome theme, though, because the user interface which then appears has not been modified.
Lastly, there's a light grey panel on the front side of the foot which is scripted, too. It has "Like or comment this region" written on it in two lines in the same typeface as the writing on the touch screen, but with medium grey outlines. On the left, there is a medium grey thumb-up symbol, and on the right, there is a speech bubble with three dots in it in two shades of medium grey.
An OpenSimWorld beacon serves several purposes. For one, it transmits information about the sim to the website OpenSimWorld. This information includes not only the name of the sim and whether it's currently online, but also how many avatars are currently on the sim. The identities of these avatars are not transmitted, only how many they are. This makes finding sims with activity on them easier for users who want to go to parties or otherwise get into contacts with others, for OpenSim's general population density is much, much lower than Second Life's. This feature also helps generate rather controversial statistics about how popular any given sim is.
OpenSimWorld itself can be seen as the third-party centre of the decentralised Hypergrid. It started out about a decade ago as a sim catalogue, making navigating the Hypergrid and finding places much easier and more convenient than previous solutions like teleport stations or simply exchanging landmarks. Sims must be listed manually by registered users, and they need one OpenSimWorld beacon in-world. For example, this is the entry for Black White Castle.
In addition, OpenSimWorld offers discussion forums, user-created information and discussion groups for various topics, announcements of in-world events, information about free or paid land rentals other than whole sim rentals by grids, a catalogue for in-world scripts etc.
The other purpose of an OpenSimWorld beacon is as a teleporter which gives you access to currently about 1,700 sims all over the Hypergrid by means of a crowd-sourced sim list, namely that on OpenSimWorld itself. If you click the touch screen, it shows a list with the ten sims known to OpenSimWorld with the most avatars on them. Each sim is listed with its activity ranking, its name, the letter "A" in square brackets if it is Adult-rated and the number of avatars on it. The list can be navigated page by page with always ten sims on them. However, while it gets the information it shows directly from OpenSimWorld, it doesn't show any further information, not about the sim and not about whatever event may be on-going on any given sim. Clicking on a listed sim will immediately teleport you there, but it won't tell you what the place is where the beacon is taking you.
After a while of inactivity, the touch screen switches back into its idle mode.
Clicking the panel on the foot leaves a like on the entry of the sim.
The shadow of the tallest mountain pine on the left-hand side of the pathway is cast on the OpenSimWorld beacon.
All the way to the right, two leaves of an otherwise out-of-frame fern reach into the picture. Further above and in the background, the lower one of another pair of rocks appears with the higher one being to the right of it and hidden behind the trunk of a conifer.
The background

Just right of the avatar's head in the picture, the paved path ends at the foot of a rock cliff which spans the whole width of the image. The cliff is about nine metres or thirty feet high. A narrower, rocky path leads to the right and upward from just right of the middle of the paved path to about 40 percent of the height of the cliff. Then, hidden behind the mountain pine tree to the right of the end of the paved path, it takes a sharp turn of roughly 180 degrees to the left and ascends to about 80 percent of the height of the cliff. Right above where the paved path ends, almost right above my avatar's head, the cliff path takes another sharp U-turn to the right and ascends in a fairly gentle slope until it reaches the top of the cliff about as far right as the first turn.
The cliff extends to both sides at a constant height, save for its jagged upper edge. On the left and within the borders of the image, it does so roughly parallel to the paved path. It surrounds a largely snow-covered plateau with more mountain pines and other conifers on it and Black White Castle itself.
To the left of the top of the second mountain pine from the centre, upward from the pair of rocks left of the paved path, the top right corner of Black White Castle's dark grey roof of unidentified material peeks through a gap between the trees. The whole rest of the building is either hidden behind the forest, outside the image borders or both.
Further in the background, snow-covered mountains rise high above the treetops. These are actually already outside the sim, reaching into regions with no sim running on them. A little bit of sky appears to the right of the mountains. It is clear, but true to the visual theme of the whole sim, it is deep grey.
The final details

The camera is roughly at realistic eye height and oriented south-by-southwest-ward. The position of the Sun as the only directed light source in the picture is unusual for OpenSim, namely in the southeast. It is permanently fixed in this place because making one single setting for the sky is great deal easier than making settings for a whole day. But if it was moving, it would not do what it almost always does in OpenSim and pass through the zenith. Still, judging by the length of the tree shadows since the Sun is absent as an actual celestial body in the sky, it is too high up for winter.

Now, if you've made it all the way down here, I ask you: Which description is more accurate? Which description is more detailed? Which description is more informative? Which description actually helps you understand the image?
#Long #LongPost #CWLong #CWLongPost #BlackAndWhite #Monochrome #OpenSim #OpenSimulator #Metaverse #VirtualWorlds #AltText #ImageDescription #ImageDescriptions #ImageDescriptionMeta #CWImageDescriptionMeta #AI #LLaVA

Okay, so what is this OpenSim thing?
The free, decentralised metaverse is older than you may think
^{hub.netzgemeinde.eu}

#AI #VirtualWorlds #imagedescription #alttext #long #imagedescriptions #longpost #CWLongPost #CWLong #ImageDescriptionMeta #LLaVA #CWAltTextMeta #AltTextMeta #AIVsHuman #HumanVsAI #CWImagDescriptionMeta @Michal Bryxí 🌱

in reply to Jupiter Rowland

Michal Bryxí 🌱

in reply to Jupiter Rowland • 1 year ago • •

@jupiter_rowland So tell me *exactly* where I'm getting this wrong: I designed a single test, for a single image for a single LLM. And then I, myself executed said test comparing said LLM+image+execution against myself as the benchmark baseline. I intentionally ignored all the infinite variations of the possible results *I* as an author can request from *the machine* to produce and let it produce _some_ output. More than that I intentionally made the test run under a _quantitative_ different conditions where I either by ignorance or lack of knowledge did not allow the machine to expand it's output to be of a comparable size to mine (which is trivial to achieve btw). I am also willingly admitting that I used for my benchmark *more* information than said image and I intentionally _did not_ provide said information to the machine I was bench-marking against.

And from there I concluded that my results are better.

I can see what you tried to achieve there. But I'm sorry to say that my own benchmark, designed by me, executed by me and summed up also by me came with a result: Nah.

@Jupiter Rowland

in reply to Michal Bryxí 🌱

Michal Bryxí 🌱

in reply to Michal Bryxí 🌱 • 1 year ago • •

Let's try a simple test: Without any context, if you're a user that *relies* on alt text to understand the content on pictures here on the Fediverse. Would you rather have alt text that is:

#AltText #AltTextMeta #CWAltTextMeta #ImageDescription #ImageDescriptions #ImageDescriptionMeta

558-characters long (0%, 0 votes)
25,271-characters long (0%, 0 votes)
🍿 (100%, 1 vote)

1 voter. Poll end: 1 year ago

#imagedescription #alttext #imagedescriptions #ImageDescriptionMeta #CWAltTextMeta #AltTextMeta

in reply to Michal Bryxí 🌱

Jupiter Rowland

in reply to Michal Bryxí 🌱 • 1 year ago • •

@Michal Bryxí 🌱

Without any context

The context matters. A whole lot.

A simple real-life cat photograph can be described in a few hundred characters, and everyone knows what it's all about. It doesn't need much visual description because it's mainly only the cat that matters. Just about everyone knows what real-life cats generally look like, except from the ways they differ from one another. Even people born 100% blind should have a rough enough idea what a cat is and what it looks like from a) being told it if they inquire and b) touching and petting a few cats.

Thus, most elements of a real-life cat photograph can safely be assumed to be common knowledge. They don't require description, and they don't require explanation because everyone should know what a cat is.

Now, let's take the image which LLaVA has described in 558 characters, and which I've previously descri

@Michal Bryxí 🌱

Without any context

The context matters. A whole lot.

Now, let's take the image which LLaVA has described in 558 characters, and which I've previously described in 25,271 characters.

For one, it doesn't focus on anything. It shows an entire scene. If the visual description has to include what's important, it has to include everything in the image because everything in the image is important just the same.

Besides, it's a picture from a 3-D virtual world. Not from the real world. People don't know anything about this kind of 3-D virtual worlds in general, and they don't know anything about this place in particular. In this picture, nothing can safely be assumed to be common knowledge. For blind or visually-impaired users even less.

People may want to know where this image was made. AI won't be able to figure that out. AI can't examine that picture and immediately and with absolute certainty recognise that it was created on a sim called Black-White Castle on an OpenSim grid named Pangea Grid, especially seeing as that place was only a few days old when I was there. LLaVA wasn't even sure if it's a video game or a virtual world. So AI won't be able to tell people.

AI doesn't know either whether or not any of the location information can be considered common knowledge and therefore necessarily to explain so humans will understand it.

I, the human describer, on the other hand, can tell people where exactly this image was made. And I can explain it to them in such a way that they'll understand it with zero prior knowledge about the matter.

Next point: text transcripts. LLaVA didn't even notice that there is text in the image, much less transcribe it. Not transcribing every bit of text in an image is sloppy; not transcribing any text in an image is ableist.

No other AI will even be able to transcribe the text in this image, however. That's because no AI can read any of it. It's all too small and, on top of that, too low-contrast for reliable OCR. All that AI has is the image I've posted at a resolution of 800x533 pixels.

I myself can see the scenery at nigh-infinite resolution by going there. No AI can do that, and no LLM AI will ever be able to do that. And so I can read and transcribe all text in the image 100% verbatim with 100% accuracy.

However, text transcripts require some room in the description, also because they additionally require descriptions of where the text is.

I win again. And so does the long, detailed description.

Would you rather have alt text that is:

I'm not sure if this is typical Mastodon behaviour because it's impossible for Mastodon users to imagine that images can be described elsewhere than in the alt-text (they can, and I have), or if it's intentional trolling.

The 25,271 characters did not go into the alt-text! They went into the post.

I can put so many characters into a post. I'm not on Mastodon. I'm on Hubzilla which has never had and still doesn't have any character limits.

In the alt-text, there's a separate, shorter, still self-researched and hand-written image description to satisfy those who absolutely demand there be an image description in the alt-text.

25,271 characters in alt-text would cause Mastodon to cut 23,771 characters off and throw them away.

in reply to Jupiter Rowland

Jupiter Rowland

in reply to Jupiter Rowland • 1 year ago • •

@Michal Bryxí 🌱 And since you obviously haven't actually read anything I've linked to, here's a quote-post of my comment in which I dissect the first AI description.#Long #LongPost #CWLong #CWLongPost #VirtualWorlds #AltText #AltTextMeta #CWAltTextMeta #ImageDescription #ImageDescriptions #ImageDescriptionMeta #CWImagDescriptionMeta #LLaVA #AI #AIVsHuman #HumanVsAI

LLaVA vs my own image description -

^{hub.netzgemeinde.eu}

Jupiter Rowland

2024-03-05 19:28:12

(This is actually a comment. Find another post further up in this thread.)
Now let's pry LLaVA's image description apart, shall we?
The image appears to be a 3D rendering or a screenshot from a video game or a virtual environment.

Typical for an AI: It starts vague. That's because it isn't really sure what it's looking at.
This is not a video game. It's a 3-D virtual world.
At least, LLaVA didn't take this for a real-life photograph.
It shows a character

It's an avatar, not a character.
standing on a paved path with a brick-like texture.

This is the first time that the AI is accurate without being vague. However, there could be more details to this.
The character is facing away from the viewer,

And I can and do tell the audience in my own image description why my avatar is facing away from the viewer. Oh, and that it's the avatar of the creator of this picture, namely myself.
looking towards a sign or information board on the right side of the image.

Nope. Like the AI could see the eyeballs of my avatar from behind. The avatar is actually looking at the cliff in the background.
Also, it's clearly an advertising board.
The environment is forested with tall trees and a dense canopy, suggesting a natural, possibly park-like setting.

If I'm generous, I can let this pass as not exactly wrong. Only that there is no dense canopy, and this is not a park.
The lighting is subdued, with shadows cast by the trees, indicating either early morning or late afternoon.

Nope again. It's actually late morning. The AI doesn't know because it can't tell that the Sun is in the southeast, and because it has got no idea how tall the trees actually are, what with almost all treetops and half the shadow cast by the avatar being out of frame.
The overall atmosphere is calm and serene.

In a setting inspired by thrillers from the 1950s and 1960s. You're adorable, LLaVA. Then again, it was quiet because there was no other avatar present.
There's a whole lot in this image that LLaVA didn't mention at all. First of all, the most blatant shortcomings.
First of all, the colours. Or the lack of them. LLaVA doesn't say with a single world that everything is monochrome. What it's even less aware of is that the motive itself is monochrome, i.e. this whole virtual place is actually monochrome, and the avatar is monochrome, too.
Next, what does my avatar look like? Gender? Skin? Hair? Clothes?
Then there's that thing on the right. LLaVA doesn't even mention that this thing is there.
It doesn't mention the sign to the left, it doesn't mention the cliff at the end of the path, it doesn't mention the mountains in the background, and it's unaware of both the bit of sky near the top edge and the large building hidden behind the trees.
And it does not transcribe even one single bit of text in this image.
And now for what I think should really be in the description, but what no AI will ever be able to describe from looking at an image like this one.
A good image description should mention where an image was taken. AIs can currently only tell that when they're fed famous landmarks. AI won't be able to tell from looking at this image that it was taken at the central crossroads at Black White Castle, a sim in the OpenSim-based Pangea Grid anytime soon. And I'm not even talking about explaining OpenSim, grids and all that to people who don't know what it is.
Speaking of which, the object to the right. LLaVA completely ignores it. However, it should be able to not only correctly identify it as an OpenSimWorld beacon, but also describe what it looks like and explain to the reader what an OpenSimWorld beacon is, what OpenSimWorld is etc. because it should know that this can not be expected to be common knowledge. My own description does that in round about 5,000 characters.
And LLaVA should transcribe what's written on the touch screen which it should correctly identify as a touch screen. It should also mention the sign on the left and transcribe what's written on it.
In fact, all text anywhere within the borders of the picture should be transcribed 100% verbatim. Since there's no rule against transcribing text that's so small that it's illegible or that's so tiny that it's practically invisible or that's partially obscured or partially out of frame, a good AI should be capable of transcribing such text 100% verbatim in its entirety as well. Unless text is too small for me to read in-world, I can and do that.
And how about not only knowing that the advertising board is an advertising board, but also mentioning and describing what's on it? Technically speaking, there's actually a lot of text on that board, and in order to transcribe it, its context needs to be described. That is, I must admit I was sloppy myself and omitted a whole lot of transcriptions in my own description.
Still, AI has a very very long way to go. And it will never fully get there.
#Long #LongPost #CWLong #CWLongPost #AltText #ImageDescription #ImageDescriptions #ImageDescriptionMeta #CWImageDescriptionMeta #AI #LLaVA

in reply to Jupiter Rowland

Jupiter Rowland

in reply to Jupiter Rowland • 1 year ago • •

@Michal Bryxí 🌱 And while I'm at it, here's a quote-post of my comment in which I review the second AI description.#Long #LongPost #CWLong #CWLongPost #VirtualWorlds #AltText #AltTextMeta #CWAltTextMeta #ImageDescription #ImageDescriptions #ImageDescriptionMeta #CWImagDescriptionMeta #LLaVA #AI #AIVsHuman #HumanVsAI

Jupiter Rowland

2024-05-17 22:24:46

It's almost hilarious how clueless the AI was again. And how wrong.
First of all, the roof isn't curved in the traditional sense. The end piece kind of is, but the roof behind it is more complex. Granted, unlike me, the AI can't look behind the roof end, so it doesn't know.
Next, the roof end isn't reflective. It isn't even glossy. And brushed stainless steel shouldn't really reflect anything.
The AI fails to count the columns that hold the roof end, and it claims they're evenly spaced. They're anything but.
There are three letters "M" on the emblem, but none of them is stand-alone.There is visible text on the logo that does provide additional context: "Universal Campus", "patefacio radix" and "MMXI". Maybe LLaVA would have been able to decipher at least the former, had I fed it the image at its original resolution of 2100x1400 pixels instead of the one I've uploaded with a resolution of 800x533 pixels. Decide for yourself which was or would have been cheating.
"Well-maintained lawn". Ha. The lawn is painted on, and the ground is so bumpy that I wouldn't call it well-maintained.
The entrance of the building is visible. In fact, three of the five entrances are. Four if you count the one that can be seen through the glass on the front. And the main entrance is marked with that huge structure around it.
The "few scattered clouds" are mostly one large cloud.
At least LLaVA is still capable of recognising a digital rendering and tells us how. Just you wait until PBR is out, LLaVA.
#Long #LongPost #CWLong #CWLongPost #FediMeta #FediverseMeta #CWFediMeta #CWFediverseMeta #AltText #AltTextMeta #CWAltTextMeta #ImageDescription #ImageDescriptions #ImageDescriptionMeta #CWImageDescriptionMeta #AI #LLaVA