Today, we’re headed to the frozen north. Dispite the snow on the ground, the sun is out and the light is perfect for a brisk shoot at the weather-worn cabins of Colter.
Two months ago, I fell into the trap that is Stable Diffusion. Today, I released my first trained model based on the snowbound town of Colter from Red Dead Redemption 2. For anyone interested in SD image generation, you can grab a copy at CivitAI. https://civitai.com/models/137327. I’d appreciate you taking a look, and giving it a like or a rating if you’re so inclined. The LoRA model is stylistically versatile, and there’s a bunch of SFW examples I made of its range.
As always, images link to full-size PNGs that contain prompt metadata.
I…have got to learn more about this AI shit….
AckbarItsATrap.gif
Too time consuming?
Thanks for your contributions to the community!
I have questions if you don’t mind.
In really trying to get into LoRA training in general and there’s a lot of things I can’t intuitively work out or find solid answers to.
For example, with this, what is your “class”? And did you use regularization images? (I want to make a habit of using them). If you did, what did you use for them? Like places that aren’t this? Like deserts and forests, etc?
Would you consider elaborating on batch size, repeats, epochs, etc, too?
Thanks again!
There’s not much out there on training LoRAs that aren’t anime characters, and that just isn’t my thing. I don’t know a chibi from a booru, and most of those tutorials sound like gibberish to me. So I’m kind of just pushing buttons and seeing what happens over lots of iterations.
For this, I settled on the class of
place
. I triedlocation
but it gave me strange results, like lots of pictures of maps, and GPS type screens. I didn’t use any regularization images. Like you mentioned, i couldn’t think of what to use. I think the regularization would be more useful in face training anyway.I read that a batch size of one gave more detailed results, so I set it there and never changed it. I also didn’t use any repeats since I had 161 images.
I did carefully tag each photo with a caption .txt file using Utilities > BLIP Captioning in Kohya_ss. That improved results over the versions I made with no tags. Results improved again dramatically when I went back and manually cleaned up the captions to be more consistent. For instance, consolidating
building
,structure
,barn
,church
,house
all to justcabin.
Epochs was 150, which gave me 24,150 steps. Is that high or low? I have no idea. They say 2000 steps or so for a face, and a full location is way more complex than a single face… It seems to work, but it took me 8 different versions to get a model I was happy with.
Let me know what ends up working for you. I’d love to have more discussions about this stuff. As a reward for reading this far, here’s a sneak peek at my next lora based on RDR2’s Guarma island. https://files.catbox.moe/w1jdya.png. Still a work in progress.
Oof. Dude. You’re not wrong about what is and isn’t available online. But it’s okay. New frontier or whatever. Haha.
I’ve been mulling over the regularization image thing, so I created a reddit post asking about it, but I basically asked, “are these images supposed to represent what the model thinks ‘this’ thing is, and in that case, regularization images would serve the role of being ‘this, but not this’” or is it more like, “these fill in the gaps when the LoRA is lacking?”
I suspect it’s more like the first. That said, it might actually make sense to include all the defective and diverse images for the purpose of basically instructing the LoRA/model to be like, “I know you think I’m asking for ‘this,’ but in reality, that’s not what I want.”
If that’s the case, it might make sense to ENSURE your regularization images are way off base and messed up or whatever. Or at least anything in the class that you know you def don’t want.
I don’t have confirmation of any of this. I’m VERY new here (like ran my first LoRA training yesterday).
I like the idea of your batch size.
Ah. The captioning is something I REALLY need to think about. I’m guessing the cabin caption idea you used, basically you lost flexibility but gained accuracy by going that approach? I wonder if you could tag it ‘cabin, church’ and retain some of both?
The steps, to me, sound very high, but I can’t say, for sure. Ahaha. Because for people, I’ve heard 1500 to 3000.
I’ll be sure to come back and share findings once I have more. I think to really “do this right” you HAVE to train some of your own shit, but to do it well, as you’ve quickly realized, you’ve got to understand the methodology/philosophy of how it’s done.
Well. Maybe scratch some of what I said above. As with many things, the answer is simply more complicated than that.
I found this video fairly useful in helping understand the process. I hope it helps.
Thanks for always sharing the process. These are really good. The backgrounds look great. Always impressed you pull these from games.
Thank you, it still feels like magic to me, so it’s fun to see how SD reacts to different inputs.
First, another great album, thank you!
One thing I always find interesting is to look at the clothes generated, and they’re usually kinda bizarre, but I the jacket/jumper thing from #6 would actually look pretty cool in real life
I get a lot of half shirts and sweaters with very unconventional cut outs for sure. SD has trouble “bunching” fabric for the lift/reveal shots, so it likes to just cut things off.
I do like that black & blue knit+leather number though. Its unusual, but really cute. I did a double take when that came out of the diffusion soup.
This is really amazing work!
Hey these look great and that 1st one is just… wow . What model do you use?
Thank you! I use a bunch of different custom merges. See this very large xy comparison grid i posted earlier: https://files.catbox.moe/1k6mmr.jpg
I’d recommend Absolute Reality or LazyMix for an off-the-shelf model.